AWS SageMaker Auto Scaling: Comparing Scaling Policies

This guide covers the fundamental strategies for dynamically managing SageMaker endpoint resources to balance performance requirements with cost efficiency, as required for the AWS Certified Machine Learning Engineer Associate (MLA-C01) exam.

Learning Objectives

Differentiate between Target Tracking, Step, Scheduled, and On-Demand scaling.
Identify the appropriate metrics (CPU, Invocations) for different model workloads.
Understand the role of Scale-In and Scale-Out cooldown periods.
Select the correct policy based on traffic predictability and volatility.

Key Terms & Glossary

Auto Scaling: The process of automatically adjusting the number of instances in a SageMaker endpoint based on demand.
Scale-Out: Adding instances to handle increased load.
Scale-In: Removing instances to save costs when load decreases.
Target Tracking: A policy that adjusts capacity to maintain a specific metric value (e.g., 50% CPU).
Step Scaling: A policy that increases/decreases capacity in chunks based on predefined threshold breaches.
Cooldown Period: A mandatory waiting period after a scaling activity before another can start, preventing "flapping."

The "Big Idea"

Scaling is the bridge between Model Performance and Cost Optimization. In an ML production environment, traffic is rarely static. Without scaling, you either pay for idle compute (over-provisioning) or risk endpoint latency/failures (under-provisioning). Auto Scaling policies automate the decision-making process, ensuring the infrastructure reacts to real-time telemetry without manual intervention.

Formula / Concept Box

Concept	Logic / Rule
Target Tracking	$CurrentCapacity \times (ActualMetric / TargetValue) = NewCapacity$
Step Scaling	If $Metric > Threshold_1$ , add $N$ instances; if $Metric > Threshold_2$ , add $M$ instances.
Cooldown	No scaling actions allowed for $T$ seconds after a change (Default: 300s).

Hierarchical Outline

Dynamic Scaling (Metrics-Based)
- Target Tracking Scaling: Continuous adjustment to hit a "moving target."
- Step Scaling: Discrete adjustments based on "steps" or tiers of demand.
Predictive & Manual Scaling
- Scheduled Scaling: Time-based (e.g., scale up at 9:00 AM every Monday).
- On-Demand/Manual: Human-triggered for one-off events.
Key Metrics for SageMaker
- SageMakerVariantInvocationsPerInstance: Best for request-heavy workloads.
- CPUUtilization: Best for compute-intensive models (e.g., deep learning inference).

Visual Anchors

Decision Flow: Choosing a Scaling Policy

Loading Diagram...

Step Scaling Visualization

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Target Tracking: Maintaining a steady-state metric level.
- Example: Keep average CPU at 50%. If it hits 70%, add instances until it drops back to 50%.
Step Scaling: Jumping capacity based on specific boundaries.
- Example: Add 2 instances immediately if requests per minute exceed 1,000; add 5 if they exceed 5,000.
Scheduled Scaling: Matching known business cycles.
- Example: A retail model scales up 5x during Black Friday starting at 12:00 AM.

Worked Examples

Configuring Target Tracking (Boto3)

This example shows how to maintain an average of 70 invocations per instance.

python

response = client.put_scaling_policy(
    PolicyName='InvocationsScaling',
    ServiceNamespace='sagemaker',
    ResourceId='endpoint/my-endpoint/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 70.0,
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance',
        },
        'ScaleInCooldown': 300, # Wait 5 mins after scaling in
        'ScaleOutCooldown': 300 # Wait 5 mins after scaling out
    }
)

Checkpoint Questions

Which policy is best for a sudden, massive spike in traffic?
- Answer: Step Scaling, as it allows for aggressive increases (e.g., +10 instances) immediately upon breaching a high threshold.
What is the primary purpose of the ScaleInCooldown?
- Answer: To prevent the system from removing instances too quickly, ensuring the system remains stable and doesn't "oscillate" if traffic fluctuates slightly.
When using Multi-Container Endpoints (MCE), why is InvocationsPerInstance potentially risky?
- Answer: If different containers have vastly different CPU requirements, a high volume of "light" requests might not trigger scaling even if a few "heavy" requests are Maxing out the CPU.

Muddy Points & Cross-Refs

[!WARNING] Flapping: This occurs when a system scales out and then immediately scales in because the new capacity lowered the metric too far. Solve this by increasing the Cooldown period.

Cross-Ref: For edge deployment scaling, see SageMaker Neo and IoT Greengrass (Chapter 6).
Cross-Ref: To understand the underlying compute, review EC2 Instance Types (Task 3.2).

Comparison Tables

Feature	Target Tracking	Step Scaling	Scheduled Scaling
Best For	Stable, varying load	Rapid, aggressive spikes	Recurring events
Complexity	Low (Set one value)	Medium (Set steps)	Low (Set time)
Mechanism	Continuous adjustment	Threshold-based steps	Cron-style schedule
Cost Control	Excellent	Good (depends on steps)	High (if time is wrong)