Compute Provisioning for ML: Production & Test Environments

This guide covers the strategic selection and automation of compute resources (CPU, GPU, and specialized ASICs) for machine learning workloads in both production and test environments, aligned with the AWS Certified Machine Learning Engineer - Associate (MLA-C01) exam.

Learning Objectives

By the end of this guide, you should be able to:

Differentiate between CPU and GPU requirements for training vs. inference.
Select appropriate EC2 instance families based on specific ML workloads.
Compare On-Demand, Provisioned, and Spot resource types.
Identify infrastructure targets including SageMaker, ECS/EKS, and Lambda.
Understand the role of Infrastructure as Code (IaC) in automating resource provisioning.

Key Terms & Glossary

CPU (Central Processing Unit): General-purpose processor; excels at sequential tasks and data preprocessing. Example: Data cleaning using Pandas.
GPU (Graphics Processing Unit): Parallel processor; optimized for matrix operations essential for Deep Learning. Example: Training a ResNet image classifier.
AWS Inferentia: Custom AWS chips (ASIC) designed specifically to provide high-performance, low-cost ML inference. Example: Running large-scale NLP model predictions.
On-Demand Resources: Compute capacity paid for by the second/hour with no long-term commitment.
Spot Instances: Unused EC2 capacity available at up to 90% discount, but subject to interruption by AWS.
Provisioned Concurrency: A Lambda setting that keeps functions "warm" to eliminate cold starts for low-latency production needs.

The "Big Idea"

In Machine Learning, compute is the largest cost driver. Efficient provisioning is not just about having "enough" power; it's about matching the compute architecture to the specific stage of the ML lifecycle. Training requires high-throughput parallel processing (GPU), while inference often prioritizes low latency and cost-efficiency (CPU or Inferentia). Using Infrastructure as Code (IaC) ensures that these environments are reproducible, moving seamlessly from a data scientist's test environment to a robust production stack.

Formula / Concept Box

Provisioning Strategy	Use Case	Cost Profile
On-Demand	Spiky workloads, development, new production launches	Baseline price
Spot Instances	Fault-tolerant training, batch processing	Lowest (up to 90% off)
Reserved / Savings Plans	Steady-state production workloads	Significant discount (1-3 yr)
Auto Scaling	Production environments with fluctuating traffic	Dynamic / Efficient

Hierarchical Outline

Compute Selection Criteria
- CPU-Optimized (C5/M5): Best for feature engineering and classical ML (Random Forest, XGBoost).
- GPU-Optimized (G4dn/G5): Essential for Deep Learning training and high-end computer vision inference.
- Accelerated Inference (Inf1/Inf2): Best for production-scale deep learning inference cost-optimization.
Deployment Targets
- SageMaker Endpoints: Fully managed, supports auto-scaling and multi-model hosting.
- Serverless (Lambda): Ideal for lightweight, intermittent inference (max 15 min execution).
- Containers (ECS/EKS): Best for microservices architectures requiring fine-grained control.
Automation & Environment Management
- CloudFormation/CDK: Scripting infrastructure to ensure parity between Test and Prod.
- Scaling Policies: Using metrics like CPUUtilization or InvocationsPerInstance to trigger scaling.

Visual Anchors

ML Lifecycle Compute Flow

Loading Diagram...

Performance vs. Cost Trade-off

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Provisioned Resources: Compute allocated specifically for a task, regardless of usage.
- Example: A SageMaker ml.m5.xlarge instance running 24/7 for a real-time fraud detection API.
On-Demand Resources: Compute spun up and down based on immediate request.
- Example: A developer launching a notebook instance only during work hours to test a training script.
Burstable Performance: Instances that provide a baseline level of CPU but can "burst" higher during spikes.
- Example: Using t2.micro instances for a small-scale testing API that only receives traffic once an hour.

Worked Examples

Scenario: Transitioning from Test to Production

Task: A team has trained a BERT-based NLP model on a p3.2xlarge (GPU) instance in a test environment. They now need to deploy it for production inference with a requirement for < 100ms latency and minimal cost.

Step-by-Step Selection:

Evaluate Training Resource: The p3.2xlarge is powerful but expensive for 24/7 inference ($3.00+/hr).
Check Latency Requirements: BERT models are heavy. A standard CPU (C5) might take 300ms+, failing the 100ms requirement.
Select Production Target:
- Option A: g4dn.xlarge (NVIDIA T4 GPU) - Balanced cost/performance.
- Option B: inf1.xlarge (Inferentia) - Higher throughput, lower cost than G4dn.
Implementation: Use AWS CDK to script the SageMaker endpoint deployment, ensuring the ProductionVariant uses the ml.inf1.xlarge instance type.

Checkpoint Questions

Why is a GPU typically used for Training but potentially swapped for a CPU or Inferentia for Inference?
Which EC2 instance family is specifically designed for "burstable" performance in test environments?
What is the primary benefit of using Spot Instances for ML training jobs?
When should a developer choose AWS Lambda over a SageMaker Real-Time endpoint?

Muddy Points & Cross-Refs

GPU Underutilization: A common mistake is deploying a small model on a massive GPU. If your GPU utilization is < 20%, consider switching to Amazon Elastic Inference or Inferentia.
Spot Interruptions: Do not use Spot instances for real-time production endpoints. Use them only for training jobs with checkpointing enabled.
Cross-Ref: For details on how to scale these resources, see the "Auto Scaling Policies" chapter.

Comparison Tables

SageMaker Endpoint Types

Type	Best For	Compute Billing
Real-Time	Persistent, low-latency needs	Per hour, per instance
Serverless	Intermittent traffic, small models	Per ms of execution
Asynchronous	Large payloads (up to 1GB), long processing	Per hour (can scale to 0)
Batch Transform	High-volume offline processing	For duration of the job

Instance Family Comparison

Family	Category	Hardware	ML Phase
T-Series	Burstable	CPU	Dev/Test
C-Series	Compute Optimized	CPU	Preprocessing/Inference
P/G-Series	Accelerated	NVIDIA GPU	Training/Heavy Inference
Inf-Series	Accelerated	AWS Inferentia	High-Scale Inference