Scalable and Cost-Effective ML Solutions on AWS
Applying best practices to enable maintainable, scalable, and cost-effective ML solutions (for example, automatic scaling on SageMaker AI endpoints, dynamically adding Spot Instances, by using Amazon EC2 instances, by using Lambda behind the endpoints)
Scalable and Cost-Effective ML Solutions on AWS
This guide covers the best practices for deploying machine learning models on AWS that balance performance requirements with cost-efficiency and maintainability.
Learning Objectives
After studying this guide, you should be able to:
- Evaluate the tradeoffs between SageMaker real-time endpoints, serverless (Lambda), and batch inference.
- Configure SageMaker auto-scaling policies using target tracking, scheduled, and step scaling.
- Implement cost-saving measures such as Managed Spot Training and Multi-Model Endpoints (MMEs).
- Identify key metrics (CPU, Memory, Invocations) used to trigger scaling actions.
Key Terms & Glossary
- Scale-Out/In: Adding or removing instances in a cluster to match demand.
- Target Tracking: A scaling policy that maintains a metric (e.g., 50% CPU) by automatically adjusting capacity.
- Managed Spot Training: A SageMaker feature that uses spare AWS capacity for training, saving up to 90% in costs.
- Provisioned Concurrency: A Lambda feature that keeps functions "warm" to eliminate cold start latency for ML inference.
- Multi-Model Endpoint (MME): A single SageMaker endpoint that can host hundreds of models on a shared container, significantly reducing costs for low-traffic models.
The "Big Idea"
The core challenge of ML Engineering is the Triple Constraint: Performance (Latency), Scalability (Throughput), and Cost. Effective infrastructure design uses Automation (IaC) to ensure consistency and Elasticity (Auto-scaling) to ensure you only pay for what you use, without manual intervention.
Formula / Concept Box
| Concept | Metric / Formula | Use Case |
|---|---|---|
| Invocations Per Instance | Best for scaling based on throughput | |
| CPU Utilization | Best for compute-heavy models (e.g., Deep Learning) | |
| Model Latency | Monitoring performance impact during scaling | |
| Cost Savings | Calculating the ROI of Spot Instances |
Hierarchical Outline
- I. Deployment Targets
- SageMaker Real-Time: Low latency, persistent instances; supports Auto-scaling.
- AWS Lambda: Serverless inference; best for intermittent traffic; uses Provisioned Concurrency for latency.
- SageMaker Batch Transform: Non-real-time; processes large datasets; shuts down after completion.
- II. Auto-Scaling Strategies
- Target Tracking: "Set it and forget it" logic based on a specific metric value.
- Scheduled Scaling: Predictive scaling for known traffic spikes (e.g., business hours).
- Step Scaling: Adjusts capacity in stages based on the size of the metric breach.
- III. Cost Optimization
- Managed Spot Training: Uses
MaxWaitTimeInSecondsto handle interruptions. - Inference Recommender: Automates load testing to select the cheapest instance for a latency target.
- Multi-Container Endpoints (MCE): Chains up to 15 containers in a single endpoint.
- Managed Spot Training: Uses
Visual Anchors
Scaling Decision Logic
SageMaker Endpoint Architecture
\begin{tikzpicture} [node distance=2cm, box/.style={rectangle, draw, minimum width=2.5cm, minimum height=1cm, align=center}]
\node (client) [box] {Client \ Requests}; \node (lb) [box, right of=client, xshift=1cm] {Load \ Balancer}; \node (inst1) [box, above right of=lb, xshift=1.5cm] {Instance A}; \node (inst2) [box, below right of=lb, xshift=1.5cm] {Instance B}; \node (asg) [box, right of=lb, xshift=4.5cm, dashed] {Auto-Scaling \ Policy};
\draw [->, thick] (client) -- (lb); \draw [->] (lb) -- (inst1); \draw [->] (lb) -- (inst2); \draw [<->, red, thick] (asg) -- (inst1) node[midway, above, sloped] {scale}; \draw [<->, red, thick] (asg) -- (inst2); \draw [->, blue] (inst1) -- ++(2.5,0) node[right] {CloudWatch Metrics}; \end{tikzpicture}
Definition-Example Pairs
- Step Scaling: Scaling based on the magnitude of a breach.
- Example: If CPU > 70%, add 2 instances; if CPU > 90%, add 5 instances.
- Cold Start: The delay when a serverless function (Lambda) is invoked after being idle.
- Example: An ML model in Lambda takes 5 seconds to load weights from S3 on the first request but 100ms on subsequent requests.
- Inference Recommender: An AWS tool that suggests instance types.
- Example: SageMaker recommends using
ml.m5.largeinstead ofml.p3.2xlargebecause it meets your 50ms latency goal at 1/10th the cost.
- Example: SageMaker recommends using
Worked Examples
Configuring Auto-Scaling with Boto3
To enable auto-scaling for an existing SageMaker endpoint, you must register the scalable target and then apply the policy.
import boto3
client = boto3.client('application-autoscaling')
# 1. Register the Target (Min: 1, Max: 10 instances)
client.register_scalable_target(
ServiceNamespace='sagemaker',
ResourceId='endpoint/my-endpoint/variant/AllTraffic',
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
MinCapacity=1,
MaxCapacity=10
)
# 2. Define the Target Tracking Policy (Maintain 50% CPU)
client.put_scaling_policy(
PolicyName='CPUUtilScaling',
ServiceNamespace='sagemaker',
ResourceId='endpoint/my-endpoint/variant/AllTraffic',
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
PolicyType='TargetTrackingScaling',
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 50.0,
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
},
'ScaleInCooldown': 300,
'ScaleOutCooldown': 60
}
)[!NOTE]
ScaleOutCooldownis usually shorter thanScaleInCooldownto allow the system to respond quickly to traffic spikes but remain stable during traffic drops.
Comparison Tables
| Feature | Real-Time Endpoint | AWS Lambda | Batch Transform |
|---|---|---|---|
| Scaling | Horizontal (Instances) | Concurrent Executions | N/A (One-off) |
| Cost Model | Hourly per Instance | Per Request / Duration | Per Instance Hour |
| Max Timeout | 60 Seconds | 15 Minutes | No strict limit |
| Best For | Millisecond Latency | Intermittent Traffic | Massive Datasets |
Checkpoint Questions
- What is the difference between
ScaleInCooldownandScaleOutCooldown? - Why would you choose
InvocationsPerInstanceoverCPUUtilizationfor scaling an MME? - How does Managed Spot Training handle an instance interruption?
- What tool would you use to find the most cost-effective instance size for a specific model?
Muddy Points & Cross-Refs
- MME Scaling: When using Multi-Model Endpoints, auto-scaling happens at the instance level, not the model level. If one model gets all the traffic, the entire instance cluster scales out, which may be inefficient if other models are idle.
- Spot Interruption: Remember that Spot Instances can be reclaimed with a 2-minute warning. Always use Checkpoints in your training code to ensure progress is not lost.
- Deep Dive: For more on Infrastructure as Code, see the CloudFormation vs. CDK guide.