Compute Provisioning for ML: Production & Test Environments
How to provision compute resources in production environments and test environments (for example, CPU, GPU)
Compute Provisioning for ML: Production & Test Environments
This guide covers the strategic selection and automation of compute resources (CPU, GPU, and specialized ASICs) for machine learning workloads in both production and test environments, aligned with the AWS Certified Machine Learning Engineer - Associate (MLA-C01) exam.
Learning Objectives
By the end of this guide, you should be able to:
- Differentiate between CPU and GPU requirements for training vs. inference.
- Select appropriate EC2 instance families based on specific ML workloads.
- Compare On-Demand, Provisioned, and Spot resource types.
- Identify infrastructure targets including SageMaker, ECS/EKS, and Lambda.
- Understand the role of Infrastructure as Code (IaC) in automating resource provisioning.
Key Terms & Glossary
- CPU (Central Processing Unit): General-purpose processor; excels at sequential tasks and data preprocessing. Example: Data cleaning using Pandas.
- GPU (Graphics Processing Unit): Parallel processor; optimized for matrix operations essential for Deep Learning. Example: Training a ResNet image classifier.
- AWS Inferentia: Custom AWS chips (ASIC) designed specifically to provide high-performance, low-cost ML inference. Example: Running large-scale NLP model predictions.
- On-Demand Resources: Compute capacity paid for by the second/hour with no long-term commitment.
- Spot Instances: Unused EC2 capacity available at up to 90% discount, but subject to interruption by AWS.
- Provisioned Concurrency: A Lambda setting that keeps functions "warm" to eliminate cold starts for low-latency production needs.
The "Big Idea"
In Machine Learning, compute is the largest cost driver. Efficient provisioning is not just about having "enough" power; it's about matching the compute architecture to the specific stage of the ML lifecycle. Training requires high-throughput parallel processing (GPU), while inference often prioritizes low latency and cost-efficiency (CPU or Inferentia). Using Infrastructure as Code (IaC) ensures that these environments are reproducible, moving seamlessly from a data scientist's test environment to a robust production stack.
Formula / Concept Box
| Provisioning Strategy | Use Case | Cost Profile |
|---|---|---|
| On-Demand | Spiky workloads, development, new production launches | Baseline price |
| Spot Instances | Fault-tolerant training, batch processing | Lowest (up to 90% off) |
| Reserved / Savings Plans | Steady-state production workloads | Significant discount (1-3 yr) |
| Auto Scaling | Production environments with fluctuating traffic | Dynamic / Efficient |
Hierarchical Outline
- Compute Selection Criteria
- CPU-Optimized (C5/M5): Best for feature engineering and classical ML (Random Forest, XGBoost).
- GPU-Optimized (G4dn/G5): Essential for Deep Learning training and high-end computer vision inference.
- Accelerated Inference (Inf1/Inf2): Best for production-scale deep learning inference cost-optimization.
- Deployment Targets
- SageMaker Endpoints: Fully managed, supports auto-scaling and multi-model hosting.
- Serverless (Lambda): Ideal for lightweight, intermittent inference (max 15 min execution).
- Containers (ECS/EKS): Best for microservices architectures requiring fine-grained control.
- Automation & Environment Management
- CloudFormation/CDK: Scripting infrastructure to ensure parity between Test and Prod.
- Scaling Policies: Using metrics like
CPUUtilizationorInvocationsPerInstanceto trigger scaling.
Visual Anchors
ML Lifecycle Compute Flow
Performance vs. Cost Trade-off
\begin{tikzpicture}[scale=0.8] \draw[->] (0,0) -- (6,0) node[right] {Cost}; \draw[->] (0,0) -- (0,6) node[above] {Performance}; \draw[thick, blue] (1,1) .. controls (3,4) .. (5,5.5); \node at (1,0.5) {T3 (Test)}; \node at (3,2.5) {C5 (General)}; \node at (5,4.5) {G5 (Prod GPU)}; \draw[dashed] (0,0) -- (5,5) node[midway, sloped, above] {Scaling Curve}; \end{tikzpicture}
Definition-Example Pairs
- Provisioned Resources: Compute allocated specifically for a task, regardless of usage.
- Example: A SageMaker
ml.m5.xlargeinstance running 24/7 for a real-time fraud detection API.
- Example: A SageMaker
- On-Demand Resources: Compute spun up and down based on immediate request.
- Example: A developer launching a notebook instance only during work hours to test a training script.
- Burstable Performance: Instances that provide a baseline level of CPU but can "burst" higher during spikes.
- Example: Using
t2.microinstances for a small-scale testing API that only receives traffic once an hour.
- Example: Using
Worked Examples
Scenario: Transitioning from Test to Production
Task: A team has trained a BERT-based NLP model on a p3.2xlarge (GPU) instance in a test environment. They now need to deploy it for production inference with a requirement for < 100ms latency and minimal cost.
Step-by-Step Selection:
- Evaluate Training Resource: The
p3.2xlargeis powerful but expensive for 24/7 inference ($3.00+/hr). - Check Latency Requirements: BERT models are heavy. A standard CPU (C5) might take 300ms+, failing the 100ms requirement.
- Select Production Target:
- Option A:
g4dn.xlarge(NVIDIA T4 GPU) - Balanced cost/performance. - Option B:
inf1.xlarge(Inferentia) - Higher throughput, lower cost than G4dn.
- Option A:
- Implementation: Use AWS CDK to script the SageMaker endpoint deployment, ensuring the
ProductionVariantuses theml.inf1.xlargeinstance type.
Checkpoint Questions
- Why is a GPU typically used for Training but potentially swapped for a CPU or Inferentia for Inference?
- Which EC2 instance family is specifically designed for "burstable" performance in test environments?
- What is the primary benefit of using Spot Instances for ML training jobs?
- When should a developer choose AWS Lambda over a SageMaker Real-Time endpoint?
Muddy Points & Cross-Refs
- GPU Underutilization: A common mistake is deploying a small model on a massive GPU. If your GPU utilization is < 20%, consider switching to Amazon Elastic Inference or Inferentia.
- Spot Interruptions: Do not use Spot instances for real-time production endpoints. Use them only for training jobs with checkpointing enabled.
- Cross-Ref: For details on how to scale these resources, see the "Auto Scaling Policies" chapter.
Comparison Tables
SageMaker Endpoint Types
| Type | Best For | Compute Billing |
|---|---|---|
| Real-Time | Persistent, low-latency needs | Per hour, per instance |
| Serverless | Intermittent traffic, small models | Per ms of execution |
| Asynchronous | Large payloads (up to 1GB), long processing | Per hour (can scale to 0) |
| Batch Transform | High-volume offline processing | For duration of the job |
Instance Family Comparison
| Family | Category | Hardware | ML Phase |
|---|---|---|---|
| T-Series | Burstable | CPU | Dev/Test |
| C-Series | Compute Optimized | CPU | Preprocessing/Inference |
| P/G-Series | Accelerated | NVIDIA GPU | Training/Heavy Inference |
| Inf-Series | Accelerated | AWS Inferentia | High-Scale Inference |