AWS SageMaker Auto Scaling: Comparing Scaling Policies
How to compare scaling policies
AWS SageMaker Auto Scaling: Comparing Scaling Policies
This guide covers the fundamental strategies for dynamically managing SageMaker endpoint resources to balance performance requirements with cost efficiency, as required for the AWS Certified Machine Learning Engineer Associate (MLA-C01) exam.
Learning Objectives
- Differentiate between Target Tracking, Step, Scheduled, and On-Demand scaling.
- Identify the appropriate metrics (CPU, Invocations) for different model workloads.
- Understand the role of Scale-In and Scale-Out cooldown periods.
- Select the correct policy based on traffic predictability and volatility.
Key Terms & Glossary
- Auto Scaling: The process of automatically adjusting the number of instances in a SageMaker endpoint based on demand.
- Scale-Out: Adding instances to handle increased load.
- Scale-In: Removing instances to save costs when load decreases.
- Target Tracking: A policy that adjusts capacity to maintain a specific metric value (e.g., 50% CPU).
- Step Scaling: A policy that increases/decreases capacity in chunks based on predefined threshold breaches.
- Cooldown Period: A mandatory waiting period after a scaling activity before another can start, preventing "flapping."
The "Big Idea"
Scaling is the bridge between Model Performance and Cost Optimization. In an ML production environment, traffic is rarely static. Without scaling, you either pay for idle compute (over-provisioning) or risk endpoint latency/failures (under-provisioning). Auto Scaling policies automate the decision-making process, ensuring the infrastructure reacts to real-time telemetry without manual intervention.
Formula / Concept Box
| Concept | Logic / Rule |
|---|---|
| Target Tracking | |
| Step Scaling | If , add instances; if , add $M instances. |
| Cooldown | No scaling actions allowed for T$ seconds after a change (Default: 300s). |
Hierarchical Outline
- Dynamic Scaling (Metrics-Based)
- Target Tracking Scaling: Continuous adjustment to hit a "moving target."
- Step Scaling: Discrete adjustments based on "steps" or tiers of demand.
- Predictive & Manual Scaling
- Scheduled Scaling: Time-based (e.g., scale up at 9:00 AM every Monday).
- On-Demand/Manual: Human-triggered for one-off events.
- Key Metrics for SageMaker
SageMakerVariantInvocationsPerInstance: Best for request-heavy workloads.CPUUtilization: Best for compute-intensive models (e.g., deep learning inference).
Visual Anchors
Decision Flow: Choosing a Scaling Policy
Step Scaling Visualization
Definition-Example Pairs
- Target Tracking: Maintaining a steady-state metric level.
- Example: Keep average CPU at 50%. If it hits 70%, add instances until it drops back to 50%.
- Step Scaling: Jumping capacity based on specific boundaries.
- Example: Add 2 instances immediately if requests per minute exceed 1,000; add 5 if they exceed 5,000.
- Scheduled Scaling: Matching known business cycles.
- Example: A retail model scales up 5x during Black Friday starting at 12:00 AM.
Worked Examples
Configuring Target Tracking (Boto3)
This example shows how to maintain an average of 70 invocations per instance.
response = client.put_scaling_policy(
PolicyName='InvocationsScaling',
ServiceNamespace='sagemaker',
ResourceId='endpoint/my-endpoint/variant/AllTraffic',
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
PolicyType='TargetTrackingScaling',
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 70.0,
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance',
},
'ScaleInCooldown': 300, # Wait 5 mins after scaling in
'ScaleOutCooldown': 300 # Wait 5 mins after scaling out
}
)Checkpoint Questions
- Which policy is best for a sudden, massive spike in traffic?
- Answer: Step Scaling, as it allows for aggressive increases (e.g., +10 instances) immediately upon breaching a high threshold.
- What is the primary purpose of the
ScaleInCooldown?- Answer: To prevent the system from removing instances too quickly, ensuring the system remains stable and doesn't "oscillate" if traffic fluctuates slightly.
- When using Multi-Container Endpoints (MCE), why is
InvocationsPerInstancepotentially risky?- Answer: If different containers have vastly different CPU requirements, a high volume of "light" requests might not trigger scaling even if a few "heavy" requests are Maxing out the CPU.
Muddy Points & Cross-Refs
[!WARNING] Flapping: This occurs when a system scales out and then immediately scales in because the new capacity lowered the metric too far. Solve this by increasing the Cooldown period.
- Cross-Ref: For edge deployment scaling, see SageMaker Neo and IoT Greengrass (Chapter 6).
- Cross-Ref: To understand the underlying compute, review EC2 Instance Types (Task 3.2).
Comparison Tables
| Feature | Target Tracking | Step Scaling | Scheduled Scaling |
|---|---|---|---|
| Best For | Stable, varying load | Rapid, aggressive spikes | Recurring events |
| Complexity | Low (Set one value) | Medium (Set steps) | Low (Set time) |
| Mechanism | Continuous adjustment | Threshold-based steps | Cron-style schedule |
| Cost Control | Excellent | Good (depends on steps) | High (if time is wrong) |