Study Guide875 words

AWS SageMaker Auto Scaling: Comparing Scaling Policies

How to compare scaling policies

AWS SageMaker Auto Scaling: Comparing Scaling Policies

This guide covers the fundamental strategies for dynamically managing SageMaker endpoint resources to balance performance requirements with cost efficiency, as required for the AWS Certified Machine Learning Engineer Associate (MLA-C01) exam.

Learning Objectives

  • Differentiate between Target Tracking, Step, Scheduled, and On-Demand scaling.
  • Identify the appropriate metrics (CPU, Invocations) for different model workloads.
  • Understand the role of Scale-In and Scale-Out cooldown periods.
  • Select the correct policy based on traffic predictability and volatility.

Key Terms & Glossary

  • Auto Scaling: The process of automatically adjusting the number of instances in a SageMaker endpoint based on demand.
  • Scale-Out: Adding instances to handle increased load.
  • Scale-In: Removing instances to save costs when load decreases.
  • Target Tracking: A policy that adjusts capacity to maintain a specific metric value (e.g., 50% CPU).
  • Step Scaling: A policy that increases/decreases capacity in chunks based on predefined threshold breaches.
  • Cooldown Period: A mandatory waiting period after a scaling activity before another can start, preventing "flapping."

The "Big Idea"

Scaling is the bridge between Model Performance and Cost Optimization. In an ML production environment, traffic is rarely static. Without scaling, you either pay for idle compute (over-provisioning) or risk endpoint latency/failures (under-provisioning). Auto Scaling policies automate the decision-making process, ensuring the infrastructure reacts to real-time telemetry without manual intervention.

Formula / Concept Box

ConceptLogic / Rule
Target TrackingCurrentCapacity×(ActualMetric/TargetValue)=NewCapacityCurrentCapacity \times (ActualMetric / TargetValue) = NewCapacity
Step ScalingIf Metric>Threshold1Metric > Threshold_1, add NN instances; if Metric>Threshold2Metric > Threshold_2, add $M instances.
CooldownNo scaling actions allowed for T$ seconds after a change (Default: 300s).

Hierarchical Outline

  1. Dynamic Scaling (Metrics-Based)
    • Target Tracking Scaling: Continuous adjustment to hit a "moving target."
    • Step Scaling: Discrete adjustments based on "steps" or tiers of demand.
  2. Predictive & Manual Scaling
    • Scheduled Scaling: Time-based (e.g., scale up at 9:00 AM every Monday).
    • On-Demand/Manual: Human-triggered for one-off events.
  3. Key Metrics for SageMaker
    • SageMakerVariantInvocationsPerInstance: Best for request-heavy workloads.
    • CPUUtilization: Best for compute-intensive models (e.g., deep learning inference).

Visual Anchors

Decision Flow: Choosing a Scaling Policy

Loading Diagram...

Step Scaling Visualization

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Target Tracking: Maintaining a steady-state metric level.
    • Example: Keep average CPU at 50%. If it hits 70%, add instances until it drops back to 50%.
  • Step Scaling: Jumping capacity based on specific boundaries.
    • Example: Add 2 instances immediately if requests per minute exceed 1,000; add 5 if they exceed 5,000.
  • Scheduled Scaling: Matching known business cycles.
    • Example: A retail model scales up 5x during Black Friday starting at 12:00 AM.

Worked Examples

Configuring Target Tracking (Boto3)

This example shows how to maintain an average of 70 invocations per instance.

python
response = client.put_scaling_policy( PolicyName='InvocationsScaling', ServiceNamespace='sagemaker', ResourceId='endpoint/my-endpoint/variant/AllTraffic', ScalableDimension='sagemaker:variant:DesiredInstanceCount', PolicyType='TargetTrackingScaling', TargetTrackingScalingPolicyConfiguration={ 'TargetValue': 70.0, 'PredefinedMetricSpecification': { 'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance', }, 'ScaleInCooldown': 300, # Wait 5 mins after scaling in 'ScaleOutCooldown': 300 # Wait 5 mins after scaling out } )

Checkpoint Questions

  1. Which policy is best for a sudden, massive spike in traffic?
    • Answer: Step Scaling, as it allows for aggressive increases (e.g., +10 instances) immediately upon breaching a high threshold.
  2. What is the primary purpose of the ScaleInCooldown?
    • Answer: To prevent the system from removing instances too quickly, ensuring the system remains stable and doesn't "oscillate" if traffic fluctuates slightly.
  3. When using Multi-Container Endpoints (MCE), why is InvocationsPerInstance potentially risky?
    • Answer: If different containers have vastly different CPU requirements, a high volume of "light" requests might not trigger scaling even if a few "heavy" requests are Maxing out the CPU.

Muddy Points & Cross-Refs

[!WARNING] Flapping: This occurs when a system scales out and then immediately scales in because the new capacity lowered the metric too far. Solve this by increasing the Cooldown period.

  • Cross-Ref: For edge deployment scaling, see SageMaker Neo and IoT Greengrass (Chapter 6).
  • Cross-Ref: To understand the underlying compute, review EC2 Instance Types (Task 3.2).

Comparison Tables

FeatureTarget TrackingStep ScalingScheduled Scaling
Best ForStable, varying loadRapid, aggressive spikesRecurring events
ComplexityLow (Set one value)Medium (Set steps)Low (Set time)
MechanismContinuous adjustmentThreshold-based stepsCron-style schedule
Cost ControlExcellentGood (depends on steps)High (if time is wrong)

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free