Serving ML Models: Real-time, Asynchronous, and Batch Strategies
Methods to serve ML models in real time and in batches
Serving ML Models: Real-time, Asynchronous, and Batch Strategies
This guide explores the methods and infrastructures used to serve Machine Learning (ML) models in production environments, focusing on the trade-offs between speed, cost, and data volume.
Learning Objectives
After studying this guide, you should be able to:
- Differentiate between real-time, asynchronous, and batch inference methods.
- Identify the appropriate AWS SageMaker AI deployment target based on latency and payload requirements.
- Compare the infrastructure needs for training versus inference.
- Understand the workflow of a SageMaker Batch Transform job.
Key Terms & Glossary
- Inference: The process of using a trained model to make predictions on new, unseen data.
- Endpoint: A physical or virtual URL where a model is hosted and can receive requests.
- Payload: The actual data sent to an inference endpoint (e.g., an image file or a JSON object).
- Latency: The time delay between sending an inference request and receiving a response.
- Throughput: The number of inference requests a system can handle in a given time period.
- Batch Transform: A SageMaker feature for generating predictions on large datasets stored in S3 without managing a persistent endpoint.
The "Big Idea"
In Machine Learning, training is about looking at the past to learn patterns, while serving (inference) is about applying those patterns to the present or future. The "Big Idea" is that there is no one-size-fits-all serving method. You must balance the timeliness of the prediction against the cost and complexity of the infrastructure. Real-time serving is expensive but fast; batch serving is cost-effective but slow.
Formula / Concept Box
| Inference Type | Latency Requirement | Payload Size | Persistent Endpoint? |
|---|---|---|---|
| Real-time | Milliseconds | Small (< 6MB) | Yes |
| Serverless | Seconds (Cold Start) | Small (< 6MB) | No (Managed) |
| Asynchronous | Minutes | Large (up to 1GB) | Yes (Queue-based) |
| Batch | Hours / Scheduled | Very Large (S3) | No |
Hierarchical Outline
- I. Real-time Inference
- Low Latency: Designed for sub-second responses.
- Persistent Infrastructure: Uses SageMaker AI endpoints that run 24/7.
- Use Cases: Fraud detection, recommendation engines, web apps.
- II. Asynchronous Inference
- Queueing System: Requests are placed in a queue and processed as resources allow.
- Large Payloads: Supports up to 1GB, ideal for heavy computer vision models.
- Auto-scaling: Can scale down to zero when no requests are in the queue.
- III. Batch Inference (Batch Transform)
- Offline Processing: No active endpoint is required; resources are provisioned only for the duration of the job.
- Data Sink: Inputs and outputs are handled directly via Amazon S3.
- Efficiency: Ideal for processing millions of records simultaneously.
- IV. Infrastructure Considerations
- Compute: Choosing between CPU (general purpose) and GPU (parallel processing for deep learning).
- Storage: Data ingestion strategies (Streaming for real-time vs. Batch for historical).
Visual Anchors
Inference Selection Flowchart
Batch Transform Architecture
Definition-Example Pairs
- Real-time Inference: Immediate prediction for a single request.
- Example: A banking app checking if a single credit card swipe is fraudulent at the point of sale.
- Batch Inference: Predictions generated for a large collection of data points at once.
- Example: A retail company generating personalized email coupons for its 10 million subscribers every Sunday night.
- Asynchronous Inference: A request that is acknowledged immediately but the result is delivered later via a notification.
- Example: Uploading a 5-minute video to an AI service to generate a transcript; the transcript is ready 2 minutes later.
Worked Examples
Problem: Selecting a Strategy for a Mobile App
Scenario: A company is launching a "Plant Identifier" app. Users take a photo (approx. 8MB) and want the result within 5-10 seconds. The app expects inconsistent traffic (spikes at noon, quiet at night).
Step-by-Step Analysis:
- Check Latency: 5-10 seconds is not "millisecond" real-time, but users are waiting, so it can't be a 1-hour batch job.
- Check Payload: 8MB exceeds the 6MB limit for standard SageMaker Real-time endpoints.
- Check Traffic: Spiky traffic suggests a need for auto-scaling to zero to save costs.
- Solution: Asynchronous Inference. It supports the 8MB payload, allows the user to wait a few seconds while the request is queued, and can scale instances to zero when no plants are being identified.
Checkpoint Questions
- Which inference method does NOT require a persistent endpoint?
- If your payload is 500MB, why would you choose Asynchronous Inference over Real-time Inference?
- True or False: Batch data ingestion is the best choice for training ML models.
- What AWS service acts as the "middleman" for results in Asynchronous Inference?
[!TIP] Answer Key: 1. Batch Transform. 2. Real-time endpoints have a 6MB payload limit. 3. True (historical datasets are processed in batches). 4. Amazon S3 (for output) and SNS (for notifications).
Muddy Points & Cross-Refs
- Asynchronous vs. Serverless: Both can scale to zero. The difference is the payload size and the "wait" mechanism. Use Serverless for tiny payloads with fast response times; use Asynchronous for heavy payloads where the client can check back later.
- Cold Starts: In Serverless inference, if the model hasn't been used in a while, there is a delay (latency) while the container spins up. This is a common exam topic.
- Cross-Reference: See "Content Domain 2: Data Engineering" for how Kinesis helps move streaming data into these inference endpoints.
Comparison Tables
Inference vs. Training Infrastructure
| Feature | Training Compute | Inference Compute |
|---|---|---|
| Parallelism | High (Clusters) | Usually lower (Single endpoint) |
| Memory | High (to hold datasets) | Lower (to hold one model) |
| Duration | Short-lived (Job based) | 24/7 (Real-time) or Job-based (Batch) |
| Location | Cloud-only | Cloud & Edge devices |
| Usage | Infrequent (Retraining) | Ubiquitous (Every time app is used) |