Learn Compute Fundamentals | Cloud as a Computational Model

Understanding compute fundamentals is essential for leveraging the cloud effectively in data science. Three primary compute models dominate the cloud landscape: virtual machines (VMs), containers, and serverless compute. Each abstracts hardware resources in different ways, which impacts how you run data science workloads.

Virtual machines provide a full operating system environment on top of virtualized hardware. Each VM is isolated, running its own OS and processes. For data science, this means you can install any libraries or dependencies you need, just as on a local machine. However, VMs are relatively heavyweight, with startup times ranging from seconds to minutes, and they consume more resources due to the overhead of running full operating systems for each instance.

Containers offer a lighter-weight abstraction. Instead of virtualizing hardware, containers share the host OS kernel but isolate the application and its dependencies. This makes containers much faster to start and stop than VMs, and they are easier to scale horizontally. In data science, containers are ideal for packaging code, libraries, and environment configuration, ensuring consistent execution across development, testing, and production. Tools like Docker or Kubernetes orchestrate containers, making it straightforward to manage complex workflows.

Serverless compute takes abstraction a step further. With serverless, you write and deploy code as functions, and the cloud provider automatically handles provisioning, scaling, and managing the underlying resources. You are billed only for the compute time consumed by your code. This model is highly efficient for event-driven tasks, such as data preprocessing or responding to triggers, but it can be limiting for long-running or stateful data science workloads.

These models differ in how they abstract resources, impacting flexibility, scalability, and operational complexity. For data science, the choice affects how you package code, manage dependencies, and scale compute for tasks like data cleaning, model training, and inference.

Choosing the right compute model also depends on understanding the underlying hardware and computation patterns. CPUs are general-purpose processors suited for a wide range of tasks, including many data science operations like data wrangling and basic statistical analysis. GPUs (graphics processing units) excel at parallel processing, making them ideal for deep learning, large-scale matrix operations, and other compute-intensive workloads. Accelerators, such as TPUs (tensor processing units), are specialized hardware designed for specific machine learning tasks and can offer even greater performance for supported workloads.

Another key architectural distinction is between stateless and stateful computation. Stateless computation does not retain information between executions; each task is independent. Serverless functions are typically stateless, which makes them highly scalable and easy to reproduce — every invocation starts from the same initial state. This is ideal for tasks like batch data transformations or model inference on independent data points.

Stateful computation, on the other hand, retains information or context between executions. This is common in long-running model training jobs or interactive data exploration sessions, where progress and data must be preserved. VMs and containers can support stateful workloads by maintaining persistent storage or in-memory state, but this comes with additional complexity in scaling and recovery. Understanding whether your workload is stateless or stateful helps determine which compute model is most appropriate for your data science task.

When deciding between VMs, containers, and serverless compute, you must weigh several trade-offs and limitations. Cost-performance is a primary consideration: VMs offer flexibility and can be cost-effective for long-running, resource-intensive jobs, but may be wasteful for bursty or short-lived tasks due to their persistent resource allocation. Containers improve resource utilization and can reduce costs through faster scaling and higher density. Serverless compute eliminates idle costs, as you pay only for actual usage, but may incur higher per-unit compute charges and can be restricted by execution timeouts or limited hardware options.

Operational overhead also differs. Managing VMs requires patching, monitoring, and scaling the OS and applications, which increases complexity. Containers simplify deployment and scaling but still require orchestration and monitoring. Serverless minimizes operational tasks, as the provider manages nearly everything, but you may have less control over the runtime environment and face challenges in debugging or handling stateful workloads.

For data science, choose VMs for custom environments, legacy dependencies, or stateful, long-running jobs. Opt for containers when you need consistent environments, rapid scaling, or want to orchestrate complex pipelines. Use serverless for event-driven, stateless tasks where minimal operational effort and cost efficiency are priorities. Each model has its place, and understanding their trade-offs ensures you pick the right tool for your data science workload.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 2

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Swipe to show menu

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 2