Selecting Deployment Infrastructure for ML Workflows
Select deployment infrastructure based on existing architecture and requirements
Selecting Deployment Infrastructure for ML Workflows
This guide covers the critical task of choosing the right AWS environment for hosting machine learning models. It focuses on balancing performance, cost, and operational complexity as outlined in the AWS Certified Machine Learning Engineer – Associate (MLA-C01) exam.
Learning Objectives
After studying this guide, you should be able to:
- Evaluate tradeoffs between managed (SageMaker) and unmanaged (ECS/EKS) deployment targets.
- Select the appropriate endpoint type (Real-time, Asynchronous, Serverless, or Batch) based on latency and traffic requirements.
- Identify optimal compute resources (CPU vs. GPU) and pricing models (On-demand vs. Spot) for specific inference workloads.
- Apply scaling policies based on metrics like model latency or invocations per instance.
Key Terms & Glossary
- Inference: The process of using a trained model to make predictions on new, unseen data.
- Managed Service: A service where AWS handles the underlying infrastructure (patching, scaling, hardware), such as SageMaker.
- BYOC (Bring Your Own Container): A deployment pattern where you package your own Docker image to run on SageMaker or ECS to support custom libraries.
- Cold Start: The delay experienced when a serverless function (like Lambda or SageMaker Serverless Inference) initializes for the first time after being idle.
- Provisioned Concurrency: Pre-allocated capacity that ensures serverless functions are ready to respond immediately, eliminating cold starts.
The "Big Idea"
The transition from training to inference represents a shift from high-throughput, batch-oriented computation to high-availability, low-latency service delivery. While training focuses on "learning" patterns (backward pass), deployment focuses on "serving" predictions (forward pass). The challenge is selecting infrastructure that is "just right"—scaling to meet peak demand without overspending during idle periods.
Formula / Concept Box
SageMaker Endpoint Types Comparison
| Feature | Real-Time | Asynchronous | Serverless | Batch Transform |
|---|---|---|---|---|
| Payload Size | Up to 6 MB | Up to 1 GB | Up to 30 MB | Unlimited (S3) |
| Latency | Milliseconds | Minutes | Seconds (Cold start) | Minutes/Hours |
| Cost Model | Hourly (Instance) | Hourly (Instance) | Per-request | Per-job |
| Best For | Low-latency apps | Large files/CV | Intermittent traffic | Large datasets |
Hierarchical Outline
- I. Managed vs. Unmanaged Infrastructure
- Managed (SageMaker AI): Low operational overhead; built-in monitoring; limited deep customization.
- Unmanaged (ECS, EKS, EC2): High flexibility; full control over OS and networking; requires manual patching/scaling management.
- II. Selection Criteria
- Performance: CPU for standard logic; GPU for Deep Learning (Computer Vision/NLP).
- Scalability: Auto-scaling based on InvocationsPerInstance or ModelLatency.
- Availability: Multi-AZ deployments for production workloads.
- III. Infrastructure as Code (IaC)
- CloudFormation: Declarative JSON/YAML templates.
- AWS CDK: Imperative programming (Python/TypeScript) to define cloud resources.
Visual Anchors
Decision Tree: Selecting a Deployment Target
Logical Architecture of a SageMaker Endpoint
\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, rounded corners, minimum width=3cm, minimum height=1cm, align=center}] \node (User) [fill=blue!10] {Client Application}; \node (LB) [below of=User, fill=green!10] {Load Balancer \ (Managed by SageMaker)}; \node (Instance1) [below left of=LB, xshift=-1cm, fill=orange!10] {Instance 1 \ (Model Container)}; \node (Instance2) [below right of=LB, xshift=1cm, fill=orange!10] {Instance 2 \ (Model Container)}; \node (S3) [below of=LB, yshift=-2cm, fill=gray!10] {S3 Artifacts \ (model.tar.gz)};
\draw[->, thick] (User) -- (LB);
\draw[->, thick] (LB) -- (Instance1);
\draw[->, thick] (LB) -- (Instance2);
\draw[dashed, ->] (S3) -- (Instance1);
\draw[dashed, ->] (S3) -- (Instance2);\end{tikzpicture}
Definition-Example Pairs
- SageMaker Neo: An optimization engine that compiles models for specific hardware.
- Example: Compiling a TensorFlow model to run efficiently on an NVIDIA Jetson edge device in a factory camera.
- Multi-Model Endpoint (MME): Hosting multiple models on a single set of resources to save costs.
- Example: A SaaS company hosting 500 small, customized XGBoost models (one per client) on a single r5.large instance.
- Spot Instances: Using spare AWS capacity at a discount (up to 90%).
- Example: Using Spot instances for a non-urgent nightly Batch Transform job to process 10 million customer reviews.
Worked Examples
Scenario 1: Real-time Credit Card Fraud Detection
- Requirement: Inference must happen in under 100ms during the swipe transaction.
- Target: SageMaker Real-Time Endpoint.
- Instance Choice: Compute-optimized (e.g., c5.large) or GPU (g4dn) if using complex Neural Networks.
- Scaling: Target Tracking Policy on
ModelLatencyto ensure speed stays consistent under load.
Scenario 2: Processing Video Uploads for Transcription
- Requirement: Payloads are 500MB; processing takes 3 minutes; client doesn't need an immediate response.
- Target: SageMaker Asynchronous Endpoint.
- Benefit: Supports large payloads and can scale down to zero instances when the queue is empty, saving costs.
Checkpoint Questions
- Which endpoint type should you choose for a model that receives traffic only twice a day for 5 minutes?
- What metric is most appropriate for auto-scaling an endpoint that is bottlenecked by mathematical computations?
- Why would an engineer choose Amazon EKS over SageMaker for model deployment?
- What is the main difference between an "on-demand" resource and a "provisioned" resource in the context of scaling?
Muddy Points & Cross-Refs
- Cold Starts vs. Latency: Students often confuse Serverless cold starts with network latency. Serverless is for cost-saving on intermittent traffic; Real-time is for consistent low latency.
- ECS vs. EKS: If your company already uses Kubernetes for everything, use EKS. If you want simpler container management, use ECS.
- Cross-Ref: See Domain 4: Monitoring for how to use CloudWatch to track the scaling metrics mentioned here.
Comparison Tables
Compute Selection: CPU vs. GPU vs. Inferentia
| Hardware Type | Use Case | Cost | Performance |
|---|---|---|---|
| CPU (C5/M5) | Tabular data (XGBoost, Sklearn) | Low | Moderate |
| GPU (G4dn/G5) | Deep Learning, Image/Video | High | High (Parallel) |
| AWS Inferentia | High-throughput Deep Learning | Medium | Best Price/Perf |
[!TIP] Always check if your model is supported by SageMaker Neo. It can often make a CPU-based deployment perform as fast as a GPU-based one for a fraction of the cost.