AWS ML Deployment Targets: Managed vs. Unmanaged Solutions

This guide explores the critical decision-making process for selecting the appropriate infrastructure for hosting Machine Learning models on AWS, covering everything from fully managed SageMaker endpoints to serverless and containerized alternatives.

Learning Objectives

After studying this guide, you should be able to:

Distinguish between managed (SageMaker) and unmanaged (EC2, ECS, EKS) deployment targets.
Evaluate performance, cost, and latency tradeoffs for different compute environments.
Select the correct deployment target based on specific use cases (e.g., real-time vs. batch, high customization vs. low overhead).
Identify appropriate scaling metrics and policies for each target.

Key Terms & Glossary

Inference: The process of using a trained model to make predictions on new, unseen data.
Managed Service: A service where AWS handles the underlying infrastructure (patching, scaling, maintenance), allowing the user to focus on the application logic.
Cold Start: The latency experienced in serverless environments (like AWS Lambda) when a function is invoked after being idle, requiring a new container initialization.
Orchestration: The automated arrangement and coordination of complex computer systems and services (e.g., Kubernetes, ECS).
BYOC (Bring Your Own Container): A pattern in SageMaker where you provide a custom Docker image for inference rather than using built-in algorithms.

The "Big Idea"

[!IMPORTANT] The central challenge in ML deployment is the Trade-off between Control and Convenience. SageMaker provides "Convenience" (automated scaling, built-in monitoring), while targets like EKS provide "Control" (custom networking, specialized software dependencies). Your choice is defined by how much operational overhead your team can sustain versus how much customization your model requires.

Formula / Concept Box

Core Selection Metrics

Metric	Description	Application
Model Latency	Time taken for a single inference call.	Critical for real-time apps.
InvocationsPerInstance	Number of requests a single instance handles.	Used for SageMaker auto-scaling.
Cold Start Time	Time to initialize compute resources.	Primary concern for Lambda.
Overhead Ratio	(Time spent managing infra) / (Time spent modeling).	High in EKS, Low in SageMaker.

Hierarchical Outline

Managed Deployments (Amazon SageMaker AI)
- Real-time Endpoints: Persistent, low-latency, dedicated instances.
- Serverless Inference: Scalable, pay-per-use, best for intermittent traffic.
- Asynchronous Inference: For large payloads (up to 1GB) and long processing times.
- Batch Transform: High-throughput processing for non-real-time datasets.
Unmanaged/Containerized Deployments
- Amazon ECS: Simplified container orchestration; deep AWS integration.
- Amazon EKS: Kubernetes-native; high portability and advanced customization.
- AWS Fargate: Serverless compute for containers (works with ECS/EKS).
Serverless & Edge
- AWS Lambda: Event-driven, best for lightweight models and image/text pre-processing.
- SageMaker Neo: Optimization service for deploying models to edge devices.

Visual Anchors

Deployment Target Decision Tree

Loading Diagram...

The Control vs. Managed Spectrum

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Target: SageMaker Real-Time
- Definition: Fully managed, persistent endpoint for low-latency inference.
- Example: A credit card fraud detection system that must respond in under 100ms for every transaction.
Target: AWS Lambda
- Definition: A serverless compute service that runs code in response to events.
- Example: A mobile app that allows users to upload a photo, which triggers a Lambda to run a lightweight Scikit-learn model to categorize the image.
Target: Amazon EKS
- Definition: A managed service that makes it easy to run Kubernetes on AWS without installing your own control plane.
- Example: A research team that already uses KubeFlow for their pipelines and needs consistent environments across on-premises and AWS.

Worked Examples

Scenario: The Bursty Startup

Problem: A startup has a sentiment analysis model. Traffic is near zero at night but spikes to 5,000 requests/minute during marketing campaigns. They want to minimize costs while maintaining a simple architecture.

Step 1: Evaluate Requirements.

High variability (bursty).
Cost-sensitive (minimize idle time).
Operational simplicity needed.

Step 2: Compare Targets.

SageMaker Real-time: Too expensive (paying for idle instances).
EC2: High management overhead for a small team.
SageMaker Serverless: Winner. Automatically scales with traffic and charges $0 when not in use.

Checkpoint Questions

When would you choose Amazon ECS over Amazon SageMaker for a model deployment?
What is the primary disadvantage of using AWS Lambda for a deep learning model with a large memory footprint (e.g., 8GB)?
Which scaling metric would you use to trigger auto-scaling for a SageMaker Real-Time endpoint?
How does SageMaker Neo improve deployments on edge devices?

▶Click to see Answers

When you need deep integration with other containerized microservices or want more control over the container runtime without the full complexity of Kubernetes.
Lambda has limited runtime resources (memory/CPU) and may suffer from significant cold starts with large model artifacts.
InvocationsPerInstance, CPU utilization, or GPU utilization.
It optimizes the model specifically for the target hardware's processor family, reducing latency and memory footprint.

Muddy Points & Cross-Refs

Unmanaged vs. Managed: Students often confuse ECS/EKS as "managed." While AWS manages the orchestration plane, they are "unmanaged" in the context of the ML Lifecycle because they don't natively provide model versioning or data capture like SageMaker.
Fargate vs. Lambda: Use Fargate for long-running containerized tasks (>15 mins); use Lambda for short, event-driven functions (<15 mins).

Comparison Tables

Deployment Target Comparison Matrix

Feature	SageMaker	ECS / EKS	AWS Lambda	Amazon EC2
Management	Fully Managed	Managed Orchestration	Serverless	Self-Managed
Scaling	Automatic	Custom / K8s HPA	Instant / Automatic	Manual / Auto-Scaling Group
Billing	Per Instance-Hour	Per Resource-Hour	Per Execution / Duration	Per Instance-Hour
Customization	Moderate (BYOC)	High	Low	Maximum
Best For	Standard ML Workflows	Microservices / K8s	Light Inference	Highly Specialized HW