Mastering Auto Scaling Metrics for SageMaker Endpoints

This guide covers the critical decision-making process for selecting auto-scaling metrics to ensure AWS SageMaker endpoints remain performant and cost-effective.

Learning Objectives

Identify the core CloudWatch metrics used for SageMaker auto scaling.
Choose the appropriate metric based on model behavior (compute-bound vs. I/O-bound).
Configure Target Tracking policies using boto3.
Understand the specific constraints of scaling Multi-Model Endpoints (MME) and Multi-Container Endpoints (MCE).

Key Terms & Glossary

Target Tracking Scaling: A policy that scales the number of instances to maintain a specific metric value (e.g., keep CPU at 50%).
Cooldown Period: The amount of time to wait after a scaling activity completes before another can start. This prevents "flapping" (rapid oscillating scaling).
InvocationsPerInstance: The average number of requests sent to a single instance per minute.
ModelLatency: The time it takes for the model container to process an inference request (excludes overhead).
Scale-Out: Adding instances to handle increased load.
Scale-In: Removing instances to save costs when load is low.

The "Big Idea"

Auto scaling is the bridge between Performance and Cost. Without it, you must over-provision for peak traffic (expensive) or risk failure during spikes (poor UX). Choosing the right metric is the most important step: a compute-heavy model might fail if scaled by invocations, while a lightweight model might under-utilize resources if scaled by CPU.

Formula / Concept Box

Concept	Metric Name	Logic
Compute Bound	`CPUUtilization`	Scale out when $\text{Avg CPU} > \text{Target } \%$
Throughput Bound	`InvocationsPerInstance`	Scale out when $\text{Avg Invocations} > n$
Experience Bound	`ModelLatency`	Scale out when $\text{Inference Time} > t \text{ ms}$

Hierarchical Outline

I. SageMaker Scaling Fundamentals
- Target Tracking (Recommended default)
- Step Scaling (Based on predefined CloudWatch Alarms)
- Scheduled Scaling (Best for predictable events like "Cyber Monday")
II. Core Metrics for Selection
- CPUUtilization: Best for standard ML models running on CPU.
- GPUUtilization: Required for Deep Learning models on P-type instances.
- InvocationsPerInstance: Best for lightweight models where throughput is the bottleneck.
- ModelLatency: Directly correlates to user experience.
III. Configuration Parameters
- Min/Max Capacity: Safety rails for instance counts.
- Cooldowns: Standard value is often 300 seconds.

Visual Anchors

Scaling Decision Logic

Loading Diagram...

Instance Capacity vs. Traffic

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

CPUUtilization: Percentage of CPU consumed.
- Example: An XGBoost model performing heavy feature engineering on-the-fly will saturate the CPU before the network limit is reached.
InvocationsPerInstance: The count of requests per instance.
- Example: A small Scikit-learn Logistic Regression model handles requests in 2ms; you scale when each instance hits 500 requests per minute to prevent network congestion.
ModelLatency: Time in microseconds to complete inference.
- Example: A real-time translation app must stay under 200ms; scaling is triggered if the average latency hits 150ms.

Worked Examples

Implementing CPU-Based Scaling via Boto3

python

import boto3
client = boto3.client('application-autoscaling')

# 1. Define the target (the SageMaker variant)
client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId='endpoint/sentiment-prod/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=2,
    MaxCapacity=10
)

# 2. Apply the policy
client.put_scaling_policy(
    PolicyName='CPU-Target-50',
    ServiceNamespace='sagemaker',
    ResourceId='endpoint/sentiment-prod/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 50.0,
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantUtilization'
        },
        'ScaleInCooldown': 300,
        'ScaleOutCooldown': 60
    }
)

Checkpoint Questions

Why would you use a longer Cooldown for Scale-In than for Scale-Out?
For a Multi-Container Endpoint (MCE), why is it dangerous to use InvocationsPerInstance if models have different CPU profiles?
Which metric is most appropriate for a Large Language Model (LLM) running on ml.g5.xlarge instances?

Muddy Points & Cross-Refs

MME/MCE Scaling: If using InvocationsPerInstance, models must have similar latency/CPU profiles. If one model is 10x heavier, the metric will be skewed, leading to under-scaling.
SageMakerVariantUtilization: This is a predefined metric in CloudWatch that specifically maps to the average CPU utilization of a SageMaker variant.
Load Testing: You cannot guess these values. You must use tools like Locust or JMeter to find the "Breaking Point" of your instance type before setting the TargetValue.

Comparison Tables

Metric	Best For	Pros	Cons
CPU Utilization	Most standard ML	Accurate measure of load	May lag behind sudden traffic spikes
Invocations	Lightweight models	Immediate reaction to traffic volume	Doesn't account for varying request complexity
GPU Utilization	Deep Learning	Essential for GPU-bound tasks	Only available on GPU-enabled instances
Latency	SLA-sensitive apps	Directly measures user experience	High variance can cause unnecessary scaling