Study Guide1,084 words

Compute Provisioning for ML: Production & Test Environments

How to provision compute resources in production environments and test environments (for example, CPU, GPU)

Compute Provisioning for ML: Production & Test Environments

This guide covers the strategic selection and automation of compute resources (CPU, GPU, and specialized ASICs) for machine learning workloads in both production and test environments, aligned with the AWS Certified Machine Learning Engineer - Associate (MLA-C01) exam.

Learning Objectives

By the end of this guide, you should be able to:

  • Differentiate between CPU and GPU requirements for training vs. inference.
  • Select appropriate EC2 instance families based on specific ML workloads.
  • Compare On-Demand, Provisioned, and Spot resource types.
  • Identify infrastructure targets including SageMaker, ECS/EKS, and Lambda.
  • Understand the role of Infrastructure as Code (IaC) in automating resource provisioning.

Key Terms & Glossary

  • CPU (Central Processing Unit): General-purpose processor; excels at sequential tasks and data preprocessing. Example: Data cleaning using Pandas.
  • GPU (Graphics Processing Unit): Parallel processor; optimized for matrix operations essential for Deep Learning. Example: Training a ResNet image classifier.
  • AWS Inferentia: Custom AWS chips (ASIC) designed specifically to provide high-performance, low-cost ML inference. Example: Running large-scale NLP model predictions.
  • On-Demand Resources: Compute capacity paid for by the second/hour with no long-term commitment.
  • Spot Instances: Unused EC2 capacity available at up to 90% discount, but subject to interruption by AWS.
  • Provisioned Concurrency: A Lambda setting that keeps functions "warm" to eliminate cold starts for low-latency production needs.

The "Big Idea"

In Machine Learning, compute is the largest cost driver. Efficient provisioning is not just about having "enough" power; it's about matching the compute architecture to the specific stage of the ML lifecycle. Training requires high-throughput parallel processing (GPU), while inference often prioritizes low latency and cost-efficiency (CPU or Inferentia). Using Infrastructure as Code (IaC) ensures that these environments are reproducible, moving seamlessly from a data scientist's test environment to a robust production stack.

Formula / Concept Box

Provisioning StrategyUse CaseCost Profile
On-DemandSpiky workloads, development, new production launchesBaseline price
Spot InstancesFault-tolerant training, batch processingLowest (up to 90% off)
Reserved / Savings PlansSteady-state production workloadsSignificant discount (1-3 yr)
Auto ScalingProduction environments with fluctuating trafficDynamic / Efficient

Hierarchical Outline

  1. Compute Selection Criteria
    • CPU-Optimized (C5/M5): Best for feature engineering and classical ML (Random Forest, XGBoost).
    • GPU-Optimized (G4dn/G5): Essential for Deep Learning training and high-end computer vision inference.
    • Accelerated Inference (Inf1/Inf2): Best for production-scale deep learning inference cost-optimization.
  2. Deployment Targets
    • SageMaker Endpoints: Fully managed, supports auto-scaling and multi-model hosting.
    • Serverless (Lambda): Ideal for lightweight, intermittent inference (max 15 min execution).
    • Containers (ECS/EKS): Best for microservices architectures requiring fine-grained control.
  3. Automation & Environment Management
    • CloudFormation/CDK: Scripting infrastructure to ensure parity between Test and Prod.
    • Scaling Policies: Using metrics like CPUUtilization or InvocationsPerInstance to trigger scaling.

Visual Anchors

ML Lifecycle Compute Flow

Loading Diagram...

Performance vs. Cost Trade-off

\begin{tikzpicture}[scale=0.8] \draw[->] (0,0) -- (6,0) node[right] {Cost}; \draw[->] (0,0) -- (0,6) node[above] {Performance}; \draw[thick, blue] (1,1) .. controls (3,4) .. (5,5.5); \node at (1,0.5) {T3 (Test)}; \node at (3,2.5) {C5 (General)}; \node at (5,4.5) {G5 (Prod GPU)}; \draw[dashed] (0,0) -- (5,5) node[midway, sloped, above] {Scaling Curve}; \end{tikzpicture}

Definition-Example Pairs

  • Provisioned Resources: Compute allocated specifically for a task, regardless of usage.
    • Example: A SageMaker ml.m5.xlarge instance running 24/7 for a real-time fraud detection API.
  • On-Demand Resources: Compute spun up and down based on immediate request.
    • Example: A developer launching a notebook instance only during work hours to test a training script.
  • Burstable Performance: Instances that provide a baseline level of CPU but can "burst" higher during spikes.
    • Example: Using t2.micro instances for a small-scale testing API that only receives traffic once an hour.

Worked Examples

Scenario: Transitioning from Test to Production

Task: A team has trained a BERT-based NLP model on a p3.2xlarge (GPU) instance in a test environment. They now need to deploy it for production inference with a requirement for < 100ms latency and minimal cost.

Step-by-Step Selection:

  1. Evaluate Training Resource: The p3.2xlarge is powerful but expensive for 24/7 inference ($3.00+/hr).
  2. Check Latency Requirements: BERT models are heavy. A standard CPU (C5) might take 300ms+, failing the 100ms requirement.
  3. Select Production Target:
    • Option A: g4dn.xlarge (NVIDIA T4 GPU) - Balanced cost/performance.
    • Option B: inf1.xlarge (Inferentia) - Higher throughput, lower cost than G4dn.
  4. Implementation: Use AWS CDK to script the SageMaker endpoint deployment, ensuring the ProductionVariant uses the ml.inf1.xlarge instance type.

Checkpoint Questions

  1. Why is a GPU typically used for Training but potentially swapped for a CPU or Inferentia for Inference?
  2. Which EC2 instance family is specifically designed for "burstable" performance in test environments?
  3. What is the primary benefit of using Spot Instances for ML training jobs?
  4. When should a developer choose AWS Lambda over a SageMaker Real-Time endpoint?

Muddy Points & Cross-Refs

  • GPU Underutilization: A common mistake is deploying a small model on a massive GPU. If your GPU utilization is < 20%, consider switching to Amazon Elastic Inference or Inferentia.
  • Spot Interruptions: Do not use Spot instances for real-time production endpoints. Use them only for training jobs with checkpointing enabled.
  • Cross-Ref: For details on how to scale these resources, see the "Auto Scaling Policies" chapter.

Comparison Tables

SageMaker Endpoint Types

TypeBest ForCompute Billing
Real-TimePersistent, low-latency needsPer hour, per instance
ServerlessIntermittent traffic, small modelsPer ms of execution
AsynchronousLarge payloads (up to 1GB), long processingPer hour (can scale to 0)
Batch TransformHigh-volume offline processingFor duration of the job

Instance Family Comparison

FamilyCategoryHardwareML Phase
T-SeriesBurstableCPUDev/Test
C-SeriesCompute OptimizedCPUPreprocessing/Inference
P/G-SeriesAcceleratedNVIDIA GPUTraining/Heavy Inference
Inf-SeriesAcceleratedAWS InferentiaHigh-Scale Inference

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free