Selecting Deployment Infrastructure for ML Workflows

This guide covers the critical task of choosing the right AWS environment for hosting machine learning models. It focuses on balancing performance, cost, and operational complexity as outlined in the AWS Certified Machine Learning Engineer – Associate (MLA-C01) exam.

Learning Objectives

After studying this guide, you should be able to:

Evaluate tradeoffs between managed (SageMaker) and unmanaged (ECS/EKS) deployment targets.
Select the appropriate endpoint type (Real-time, Asynchronous, Serverless, or Batch) based on latency and traffic requirements.
Identify optimal compute resources (CPU vs. GPU) and pricing models (On-demand vs. Spot) for specific inference workloads.
Apply scaling policies based on metrics like model latency or invocations per instance.

Key Terms & Glossary

Inference: The process of using a trained model to make predictions on new, unseen data.
Managed Service: A service where AWS handles the underlying infrastructure (patching, scaling, hardware), such as SageMaker.
BYOC (Bring Your Own Container): A deployment pattern where you package your own Docker image to run on SageMaker or ECS to support custom libraries.
Cold Start: The delay experienced when a serverless function (like Lambda or SageMaker Serverless Inference) initializes for the first time after being idle.
Provisioned Concurrency: Pre-allocated capacity that ensures serverless functions are ready to respond immediately, eliminating cold starts.

The "Big Idea"

The transition from training to inference represents a shift from high-throughput, batch-oriented computation to high-availability, low-latency service delivery. While training focuses on "learning" patterns (backward pass), deployment focuses on "serving" predictions (forward pass). The challenge is selecting infrastructure that is "just right"—scaling to meet peak demand without overspending during idle periods.

Formula / Concept Box

SageMaker Endpoint Types Comparison

Feature	Real-Time	Asynchronous	Serverless	Batch Transform
Payload Size	Up to 6 MB	Up to 1 GB	Up to 30 MB	Unlimited (S3)
Latency	Milliseconds	Minutes	Seconds (Cold start)	Minutes/Hours
Cost Model	Hourly (Instance)	Hourly (Instance)	Per-request	Per-job
Best For	Low-latency apps	Large files/CV	Intermittent traffic	Large datasets

Hierarchical Outline

I. Managed vs. Unmanaged Infrastructure
- Managed (SageMaker AI): Low operational overhead; built-in monitoring; limited deep customization.
- Unmanaged (ECS, EKS, EC2): High flexibility; full control over OS and networking; requires manual patching/scaling management.
II. Selection Criteria
- Performance: CPU for standard logic; GPU for Deep Learning (Computer Vision/NLP).
- Scalability: Auto-scaling based on InvocationsPerInstance or ModelLatency.
- Availability: Multi-AZ deployments for production workloads.
III. Infrastructure as Code (IaC)
- CloudFormation: Declarative JSON/YAML templates.
- AWS CDK: Imperative programming (Python/TypeScript) to define cloud resources.

Visual Anchors

Decision Tree: Selecting a Deployment Target

Loading Diagram...

Logical Architecture of a SageMaker Endpoint

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

SageMaker Neo: An optimization engine that compiles models for specific hardware.
- Example: Compiling a TensorFlow model to run efficiently on an NVIDIA Jetson edge device in a factory camera.
Multi-Model Endpoint (MME): Hosting multiple models on a single set of resources to save costs.
- Example: A SaaS company hosting 500 small, customized XGBoost models (one per client) on a single r5.large instance.
Spot Instances: Using spare AWS capacity at a discount (up to 90%).
- Example: Using Spot instances for a non-urgent nightly Batch Transform job to process 10 million customer reviews.

Worked Examples

Scenario 1: Real-time Credit Card Fraud Detection

Requirement: Inference must happen in under 100ms during the swipe transaction.
Target: SageMaker Real-Time Endpoint.
Instance Choice: Compute-optimized (e.g., c5.large) or GPU (g4dn) if using complex Neural Networks.
Scaling: Target Tracking Policy on ModelLatency to ensure speed stays consistent under load.

Scenario 2: Processing Video Uploads for Transcription

Requirement: Payloads are 500MB; processing takes 3 minutes; client doesn't need an immediate response.
Target: SageMaker Asynchronous Endpoint.
Benefit: Supports large payloads and can scale down to zero instances when the queue is empty, saving costs.

Checkpoint Questions

Which endpoint type should you choose for a model that receives traffic only twice a day for 5 minutes?
What metric is most appropriate for auto-scaling an endpoint that is bottlenecked by mathematical computations?
Why would an engineer choose Amazon EKS over SageMaker for model deployment?
What is the main difference between an "on-demand" resource and a "provisioned" resource in the context of scaling?

Muddy Points & Cross-Refs

Cold Starts vs. Latency: Students often confuse Serverless cold starts with network latency. Serverless is for cost-saving on intermittent traffic; Real-time is for consistent low latency.
ECS vs. EKS: If your company already uses Kubernetes for everything, use EKS. If you want simpler container management, use ECS.
Cross-Ref: See Domain 4: Monitoring for how to use CloudWatch to track the scaling metrics mentioned here.

Comparison Tables

Compute Selection: CPU vs. GPU vs. Inferentia

Hardware Type	Use Case	Cost	Performance
CPU (C5/M5)	Tabular data (XGBoost, Sklearn)	Low	Moderate
GPU (G4dn/G5)	Deep Learning, Image/Video	High	High (Parallel)
AWS Inferentia	High-throughput Deep Learning	Medium	Best Price/Perf

[!TIP] Always check if your model is supported by SageMaker Neo. It can often make a CPU-based deployment perform as fast as a GPU-based one for a fraction of the cost.