Model and Endpoint Deployment Requirements

This guide explores the architectural requirements and selection criteria for deploying machine learning models using Amazon SageMaker. Choosing the correct endpoint type is a critical task for the AWS Certified Machine Learning Engineer - Associate exam, focusing on the trade-offs between latency, cost, and payload size.

Learning Objectives

Compare and contrast the four primary SageMaker inference options: Real-time, Serverless, Asynchronous, and Batch.
Identify specific use cases based on traffic patterns (predictable vs. intermittent).
Distinguish between Multi-Model Endpoints (MMEs) and Multi-Container Endpoints (MCEs).
Analyze the architectural components of Asynchronous inference, including S3 and SNS integration.

Key Terms & Glossary

Cold Start: The delay incurred when a Serverless endpoint initializes a new compute instance after a period of inactivity.
Inference Pipeline: A linear sequence of 2 to 15 containers that process requests (e.g., pre-processing, model prediction, post-processing).
MME (Multi-Model Endpoint): A cost-optimized endpoint that hosts multiple models sharing a single container and set of resources.
MCE (Multi-Container Endpoint): An endpoint hosting multiple containers, each potentially with different frameworks or dependencies.
Provisioned Concurrency: A setting for Serverless endpoints that keeps instances warm to eliminate cold starts.

The "Big Idea"

Deploying a model is not a "one size fits all" task. It is the process of matching Traffic Characteristics (Is it constant or spiky?) and Payload Requirements (Is it a small JSON string or a 1GB video?) to the most cost-effective Compute Pattern. Selecting the wrong endpoint can lead to either excessive costs (idling real-time instances) or poor user experience (high latency for real-time needs).

Formula / Concept Box

Requirement	Recommended Endpoint
Latency < 100ms	Real-time Inference
Intermittent / Spiky Traffic	Serverless Inference
Large Payloads (> 6MB) / Long Processing Time	Asynchronous Inference
Offline / Non-real-time High Volume	Batch Transform

Hierarchical Outline

Online Inference (Synchronous)
- Real-time Endpoints: Persistent instances, lowest latency, best for interactive apps. Needs Auto-scaling policies (CPU/Invocations per instance).
- Serverless Inference: Scaled automatically by AWS. No instance management. Features Cold Starts but offers the best cost-efficiency for idle models.
Specialized Inference
- Asynchronous Inference: Queues requests in S3. Supports payloads up to 1GB and processing times up to 1 hour. Scales to zero when idle.
Offline Processing
- Batch Transform: No persistent endpoint. Processes data from S3, writes results to S3, and shuts down compute. Ideal for weekly/monthly reporting.
Advanced Hosting
- Multi-Model (MME): Shared memory/CPU. Good for thousands of similar models.
- Multi-Container (MCE): Independent containers. Good for A/B testing or complex pre-processing steps.

Visual Anchors

Endpoint Selection Logic

Loading Diagram...

Asynchronous Inference Architecture

Loading Diagram...

Definition-Example Pairs

Real-time Inference: Persistent, dedicated compute for sub-second responses.
- Example: A credit card company running a fraud detection model during a live swipe transaction.
Serverless Inference: On-demand compute that scales based on traffic volume.
- Example: A small business website that offers a "style recommendation" tool used only a few times a day.
Batch Transform: High-throughput processing of entire datasets.
- Example: An e-commerce platform generating personalized email recommendations for its 5 million users every Sunday night.

Worked Examples

Scenario: The Image Processing Startup

Problem: A company allows users to upload high-resolution medical images (50MB each) for AI analysis. The analysis takes 45 seconds per image. Users do not need the result instantly but should be notified when done.

Solution Breakdown:

Eliminate Real-time: Payload (50MB) exceeds the 6MB limit for standard real-time invocations.
Eliminate Serverless: Processing time (45s) and payload size are too large for typical serverless configurations.
Select Asynchronous: It supports up to 1GB payloads and 1 hour processing. We can configure an SNS Topic to alert the user's mobile app when the S3 output bucket receives the result.

Checkpoint Questions

What is the maximum payload size for a standard SageMaker Real-time endpoint invocation?
Which endpoint type should you choose if you want the compute to "Scale to Zero" to save costs during periods of no traffic?
When would you use a Multi-Model Endpoint (MME) instead of a Multi-Container Endpoint (MCE)?
True or False: Batch Transform requires a persistent endpoint URL to be active at all times.

▶Click to see answers

6 MB.
Serverless Inference (or Asynchronous Inference).
Use MME when you have many models (thousands) that share the same framework/container. Use MCE when models have different dependencies or require isolation.
False. Batch Transform provisions resources only for the duration of the job.

Muddy Points & Cross-Refs

Cold Starts vs. Latency: In Serverless, the first request after an idle period is slow. If your app is user-facing and requires consistent <200ms, use Real-time with a small instance instead.
MME vs. MCE: MME is about cost (bin-packing models). MCE is about workflow (linking different containers or A/B testing different environments).
Cross-Ref: For more on how to trigger these, see the SageMaker Python SDK (ServerlessInferenceConfig) documentation.

Comparison Tables

Feature	Real-time	Serverless	Asynchronous	Batch Transform
Billing	Per hour (Instance)	Per MS (Duration)	Per hour (Instance)	Per hour (Duration)
Max Payload	6 MB	6 MB	1 GB	Unlimited (S3)
Max Duration	60 seconds	60 seconds	1 hour	N/A (Job-based)
Scale to Zero	No	Yes (Automatic)	Yes (Configurable)	Yes (Automatic)
Best For	Interactive Apps	Spiky/Low Traffic	Large Data/Long Runs	Bulk Processing

Model and Endpoint Deployment Requirements

Learning Objectives

Compare and contrast the four primary SageMaker inference options: Real-time, Serverless, Asynchronous, and Batch.
Identify specific use cases based on traffic patterns (predictable vs. intermittent).
Distinguish between Multi-Model Endpoints (MMEs) and Multi-Container Endpoints (MCEs).
Analyze the architectural components of Asynchronous inference, including S3 and SNS integration.

Key Terms & Glossary

Cold Start: The delay incurred when a Serverless endpoint initializes a new compute instance after a period of inactivity.
Inference Pipeline: A linear sequence of 2 to 15 containers that process requests (e.g., pre-processing, model prediction, post-processing).
MME (Multi-Model Endpoint): A cost-optimized endpoint that hosts multiple models sharing a single container and set of resources.
MCE (Multi-Container Endpoint): An endpoint hosting multiple containers, each potentially with different frameworks or dependencies.
Provisioned Concurrency: A setting for Serverless endpoints that keeps instances warm to eliminate cold starts.

The "Big Idea"

Formula / Concept Box

Requirement	Recommended Endpoint
Latency < 100ms	Real-time Inference
Intermittent / Spiky Traffic	Serverless Inference
Large Payloads (> 6MB) / Long Processing Time	Asynchronous Inference
Offline / Non-real-time High Volume	Batch Transform

Hierarchical Outline

Online Inference (Synchronous)
- Real-time Endpoints: Persistent instances, lowest latency, best for interactive apps. Needs Auto-scaling policies (CPU/Invocations per instance).
- Serverless Inference: Scaled automatically by AWS. No instance management. Features Cold Starts but offers the best cost-efficiency for idle models.
Specialized Inference
- Asynchronous Inference: Queues requests in S3. Supports payloads up to 1GB and processing times up to 1 hour. Scales to zero when idle.
Offline Processing
- Batch Transform: No persistent endpoint. Processes data from S3, writes results to S3, and shuts down compute. Ideal for weekly/monthly reporting.
Advanced Hosting
- Multi-Model (MME): Shared memory/CPU. Good for thousands of similar models.
- Multi-Container (MCE): Independent containers. Good for A/B testing or complex pre-processing steps.

Visual Anchors

Endpoint Selection Logic

Loading Diagram...

Asynchronous Inference Architecture

Loading Diagram...

Definition-Example Pairs

Real-time Inference: Persistent, dedicated compute for sub-second responses.
- Example: A credit card company running a fraud detection model during a live swipe transaction.
Serverless Inference: On-demand compute that scales based on traffic volume.
- Example: A small business website that offers a "style recommendation" tool used only a few times a day.
Batch Transform: High-throughput processing of entire datasets.
- Example: An e-commerce platform generating personalized email recommendations for its 5 million users every Sunday night.

Worked Examples

Scenario: The Image Processing Startup

Solution Breakdown:

Eliminate Real-time: Payload (50MB) exceeds the 6MB limit for standard real-time invocations.
Eliminate Serverless: Processing time (45s) and payload size are too large for typical serverless configurations.
Select Asynchronous: It supports up to 1GB payloads and 1 hour processing. We can configure an SNS Topic to alert the user's mobile app when the S3 output bucket receives the result.

Checkpoint Questions

What is the maximum payload size for a standard SageMaker Real-time endpoint invocation?
Which endpoint type should you choose if you want the compute to "Scale to Zero" to save costs during periods of no traffic?
When would you use a Multi-Model Endpoint (MME) instead of a Multi-Container Endpoint (MCE)?
True or False: Batch Transform requires a persistent endpoint URL to be active at all times.

▶Click to see answers

6 MB.
Serverless Inference (or Asynchronous Inference).
Use MME when you have many models (thousands) that share the same framework/container. Use MCE when models have different dependencies or require isolation.
False. Batch Transform provisions resources only for the duration of the job.

Muddy Points & Cross-Refs

Cold Starts vs. Latency: In Serverless, the first request after an idle period is slow. If your app is user-facing and requires consistent <200ms, use Real-time with a small instance instead.
MME vs. MCE: MME is about cost (bin-packing models). MCE is about workflow (linking different containers or A/B testing different environments).
Cross-Ref: For more on how to trigger these, see the SageMaker Python SDK (ServerlessInferenceConfig) documentation.

Comparison Tables

Feature	Real-time	Serverless	Asynchronous	Batch Transform
Billing	Per hour (Instance)	Per MS (Duration)	Per hour (Instance)	Per hour (Duration)
Max Payload	6 MB	6 MB	1 GB	Unlimited (S3)
Max Duration	60 seconds	60 seconds	1 hour	N/A (Job-based)
Scale to Zero	No	Yes (Automatic)	Yes (Configurable)	Yes (Automatic)
Best For	Interactive Apps	Spiky/Low Traffic	Large Data/Long Runs	Bulk Processing