Model and Endpoint Deployment Requirements
Model and endpoint requirements for deployment endpoints (for example, serverless endpoints, real-time endpoints, asynchronous endpoints, batch inference)
Model and Endpoint Deployment Requirements
This guide explores the architectural requirements and selection criteria for deploying machine learning models using Amazon SageMaker. Choosing the correct endpoint type is a critical task for the AWS Certified Machine Learning Engineer - Associate exam, focusing on the trade-offs between latency, cost, and payload size.
Learning Objectives
- Compare and contrast the four primary SageMaker inference options: Real-time, Serverless, Asynchronous, and Batch.
- Identify specific use cases based on traffic patterns (predictable vs. intermittent).
- Distinguish between Multi-Model Endpoints (MMEs) and Multi-Container Endpoints (MCEs).
- Analyze the architectural components of Asynchronous inference, including S3 and SNS integration.
Key Terms & Glossary
- Cold Start: The delay incurred when a Serverless endpoint initializes a new compute instance after a period of inactivity.
- Inference Pipeline: A linear sequence of 2 to 15 containers that process requests (e.g., pre-processing, model prediction, post-processing).
- MME (Multi-Model Endpoint): A cost-optimized endpoint that hosts multiple models sharing a single container and set of resources.
- MCE (Multi-Container Endpoint): An endpoint hosting multiple containers, each potentially with different frameworks or dependencies.
- Provisioned Concurrency: A setting for Serverless endpoints that keeps instances warm to eliminate cold starts.
The "Big Idea"
Deploying a model is not a "one size fits all" task. It is the process of matching Traffic Characteristics (Is it constant or spiky?) and Payload Requirements (Is it a small JSON string or a 1GB video?) to the most cost-effective Compute Pattern. Selecting the wrong endpoint can lead to either excessive costs (idling real-time instances) or poor user experience (high latency for real-time needs).
Formula / Concept Box
| Requirement | Recommended Endpoint |
|---|---|
| Latency < 100ms | Real-time Inference |
| Intermittent / Spiky Traffic | Serverless Inference |
| Large Payloads (> 6MB) / Long Processing Time | Asynchronous Inference |
| Offline / Non-real-time High Volume | Batch Transform |
Hierarchical Outline
- Online Inference (Synchronous)
- Real-time Endpoints: Persistent instances, lowest latency, best for interactive apps. Needs Auto-scaling policies (CPU/Invocations per instance).
- Serverless Inference: Scaled automatically by AWS. No instance management. Features Cold Starts but offers the best cost-efficiency for idle models.
- Specialized Inference
- Asynchronous Inference: Queues requests in S3. Supports payloads up to 1GB and processing times up to 1 hour. Scales to zero when idle.
- Offline Processing
- Batch Transform: No persistent endpoint. Processes data from S3, writes results to S3, and shuts down compute. Ideal for weekly/monthly reporting.
- Advanced Hosting
- Multi-Model (MME): Shared memory/CPU. Good for thousands of similar models.
- Multi-Container (MCE): Independent containers. Good for A/B testing or complex pre-processing steps.
Visual Anchors
Endpoint Selection Logic
Asynchronous Inference Architecture
Definition-Example Pairs
- Real-time Inference: Persistent, dedicated compute for sub-second responses.
- Example: A credit card company running a fraud detection model during a live swipe transaction.
- Serverless Inference: On-demand compute that scales based on traffic volume.
- Example: A small business website that offers a "style recommendation" tool used only a few times a day.
- Batch Transform: High-throughput processing of entire datasets.
- Example: An e-commerce platform generating personalized email recommendations for its 5 million users every Sunday night.
Worked Examples
Scenario: The Image Processing Startup
Problem: A company allows users to upload high-resolution medical images (50MB each) for AI analysis. The analysis takes 45 seconds per image. Users do not need the result instantly but should be notified when done.
Solution Breakdown:
- Eliminate Real-time: Payload (50MB) exceeds the 6MB limit for standard real-time invocations.
- Eliminate Serverless: Processing time (45s) and payload size are too large for typical serverless configurations.
- Select Asynchronous: It supports up to 1GB payloads and 1 hour processing. We can configure an SNS Topic to alert the user's mobile app when the S3 output bucket receives the result.
Checkpoint Questions
- What is the maximum payload size for a standard SageMaker Real-time endpoint invocation?
- Which endpoint type should you choose if you want the compute to "Scale to Zero" to save costs during periods of no traffic?
- When would you use a Multi-Model Endpoint (MME) instead of a Multi-Container Endpoint (MCE)?
- True or False: Batch Transform requires a persistent endpoint URL to be active at all times.
▶Click to see answers
- 6 MB.
- Serverless Inference (or Asynchronous Inference).
- Use MME when you have many models (thousands) that share the same framework/container. Use MCE when models have different dependencies or require isolation.
- False. Batch Transform provisions resources only for the duration of the job.
Muddy Points & Cross-Refs
- Cold Starts vs. Latency: In Serverless, the first request after an idle period is slow. If your app is user-facing and requires consistent <200ms, use Real-time with a small instance instead.
- MME vs. MCE: MME is about cost (bin-packing models). MCE is about workflow (linking different containers or A/B testing different environments).
- Cross-Ref: For more on how to trigger these, see the SageMaker Python SDK (
ServerlessInferenceConfig) documentation.
Comparison Tables
| Feature | Real-time | Serverless | Asynchronous | Batch Transform |
|---|---|---|---|---|
| Billing | Per hour (Instance) | Per MS (Duration) | Per hour (Instance) | Per hour (Duration) |
| Max Payload | 6 MB | 6 MB | 1 GB | Unlimited (S3) |
| Max Duration | 60 seconds | 60 seconds | 1 hour | N/A (Job-based) |
| Scale to Zero | No | Yes (Automatic) | Yes (Configurable) | Yes (Automatic) |
| Best For | Interactive Apps | Spiky/Low Traffic | Large Data/Long Runs | Bulk Processing |