Mastering Auto Scaling Metrics for SageMaker Endpoints
Choosing specific metrics for auto scaling (for example, model latency, CPU utilization, invocations per instance)
Mastering Auto Scaling Metrics for SageMaker Endpoints
This guide covers the critical decision-making process for selecting auto-scaling metrics to ensure AWS SageMaker endpoints remain performant and cost-effective.
Learning Objectives
- Identify the core CloudWatch metrics used for SageMaker auto scaling.
- Choose the appropriate metric based on model behavior (compute-bound vs. I/O-bound).
- Configure Target Tracking policies using
boto3. - Understand the specific constraints of scaling Multi-Model Endpoints (MME) and Multi-Container Endpoints (MCE).
Key Terms & Glossary
- Target Tracking Scaling: A policy that scales the number of instances to maintain a specific metric value (e.g., keep CPU at 50%).
- Cooldown Period: The amount of time to wait after a scaling activity completes before another can start. This prevents "flapping" (rapid oscillating scaling).
- InvocationsPerInstance: The average number of requests sent to a single instance per minute.
- ModelLatency: The time it takes for the model container to process an inference request (excludes overhead).
- Scale-Out: Adding instances to handle increased load.
- Scale-In: Removing instances to save costs when load is low.
The "Big Idea"
Auto scaling is the bridge between Performance and Cost. Without it, you must over-provision for peak traffic (expensive) or risk failure during spikes (poor UX). Choosing the right metric is the most important step: a compute-heavy model might fail if scaled by invocations, while a lightweight model might under-utilize resources if scaled by CPU.
Formula / Concept Box
| Concept | Metric Name | Logic |
|---|---|---|
| Compute Bound | CPUUtilization | Scale out when |
| Throughput Bound | InvocationsPerInstance | Scale out when |
| Experience Bound | ModelLatency | Scale out when |
Hierarchical Outline
- I. SageMaker Scaling Fundamentals
- Target Tracking (Recommended default)
- Step Scaling (Based on predefined CloudWatch Alarms)
- Scheduled Scaling (Best for predictable events like "Cyber Monday")
- II. Core Metrics for Selection
- CPUUtilization: Best for standard ML models running on CPU.
- GPUUtilization: Required for Deep Learning models on P-type instances.
- InvocationsPerInstance: Best for lightweight models where throughput is the bottleneck.
- ModelLatency: Directly correlates to user experience.
- III. Configuration Parameters
- Min/Max Capacity: Safety rails for instance counts.
- Cooldowns: Standard value is often 300 seconds.
Visual Anchors
Scaling Decision Logic
Instance Capacity vs. Traffic
\begin{tikzpicture} % Axes \draw[->] (0,0) -- (6,0) node[right] {Time}; \draw[->] (0,0) -- (0,4) node[above] {Load/Instances}; % Traffic Curve \draw[blue, thick] (0,0.5) .. controls (2,0.5) and (3,3.5) .. (4,3.5) .. controls (5,3.5) and (5.5,1) .. (6,1); \node[blue] at (5,3.8) {Traffic (Metric)}; % Instance Steps \draw[red, dashed, thick] (0,1) -- (2.5,1) -- (2.5,2) -- (3.2,2) -- (3.2,3) -- (5,3) -- (5,2) -- (6,2); \node[red] at (1,1.3) {Instances}; \end{tikzpicture}
Definition-Example Pairs
- CPUUtilization: Percentage of CPU consumed.
- Example: An XGBoost model performing heavy feature engineering on-the-fly will saturate the CPU before the network limit is reached.
- InvocationsPerInstance: The count of requests per instance.
- Example: A small Scikit-learn Logistic Regression model handles requests in 2ms; you scale when each instance hits 500 requests per minute to prevent network congestion.
- ModelLatency: Time in microseconds to complete inference.
- Example: A real-time translation app must stay under 200ms; scaling is triggered if the average latency hits 150ms.
Worked Examples
Implementing CPU-Based Scaling via Boto3
import boto3
client = boto3.client('application-autoscaling')
# 1. Define the target (the SageMaker variant)
client.register_scalable_target(
ServiceNamespace='sagemaker',
ResourceId='endpoint/sentiment-prod/variant/AllTraffic',
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
MinCapacity=2,
MaxCapacity=10
)
# 2. Apply the policy
client.put_scaling_policy(
PolicyName='CPU-Target-50',
ServiceNamespace='sagemaker',
ResourceId='endpoint/sentiment-prod/variant/AllTraffic',
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
PolicyType='TargetTrackingScaling',
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 50.0,
'PredefinedMetricSpecification': {
'PredefinedMetricType': 'SageMakerVariantUtilization'
},
'ScaleInCooldown': 300,
'ScaleOutCooldown': 60
}
)Checkpoint Questions
- Why would you use a longer Cooldown for Scale-In than for Scale-Out?
- For a Multi-Container Endpoint (MCE), why is it dangerous to use
InvocationsPerInstanceif models have different CPU profiles? - Which metric is most appropriate for a Large Language Model (LLM) running on
ml.g5.xlargeinstances?
Muddy Points & Cross-Refs
- MME/MCE Scaling: If using
InvocationsPerInstance, models must have similar latency/CPU profiles. If one model is 10x heavier, the metric will be skewed, leading to under-scaling. - SageMakerVariantUtilization: This is a predefined metric in CloudWatch that specifically maps to the average CPU utilization of a SageMaker variant.
- Load Testing: You cannot guess these values. You must use tools like Locust or JMeter to find the "Breaking Point" of your instance type before setting the
TargetValue.
Comparison Tables
| Metric | Best For | Pros | Cons |
|---|---|---|---|
| CPU Utilization | Most standard ML | Accurate measure of load | May lag behind sudden traffic spikes |
| Invocations | Lightweight models | Immediate reaction to traffic volume | Doesn't account for varying request complexity |
| GPU Utilization | Deep Learning | Essential for GPU-bound tasks | Only available on GPU-enabled instances |
| Latency | SLA-sensitive apps | Directly measures user experience | High variance can cause unnecessary scaling |