AWS ML Troubleshooting: Capacity, Cost, and Performance

This guide focuses on the critical balance between maintaining high-performance machine learning (ML) environments and optimizing cloud expenditure. Specifically, we cover troubleshooting bottlenecks, managing service limits, and configuring scaling strategies within the AWS ecosystem.

Learning Objectives

By the end of this guide, you will be able to:

Identify performance bottlenecks using CloudWatch metrics (CPU, Memory, GPU).
Troubleshoot and resolve Service Quota exhaustion issues.
Configure Auto Scaling and Provisioned Concurrency to mitigate latency and cold starts.
Select optimal purchasing models (Spot vs. On-Demand) to balance cost and reliability.

Key Terms & Glossary

Provisioned Concurrency: A configuration for SageMaker Serverless Inference that keeps a specific number of instances "warm" to eliminate cold start latency.
Service Quotas: Regional limits imposed by AWS on resource usage (e.g., number of ml.m5.xlarge instances) to prevent runaway costs and protect infrastructure.
Capacity Blocks: A feature allowing users to reserve GPU instances for a specific duration, ensuring availability for critical training or fine-tuning jobs.
Right-Sizing: The process of matching instance types and sizes to your workload's actual resource requirements.
Cooldown Period: A set time during which Auto Scaling does not perform further scaling actions, allowing the system to stabilize after a change.

The "Big Idea"

In the AWS ML environment, the goal is Elastic Efficiency. If you have too much capacity, you are wasting money (Overprovisioning). If you have too little, your models fail to respond or train effectively (Underprovisioning). Troubleshooting capacity is essentially the art of monitoring resource "signals" (metrics) and applying "levers" (scaling, quotas, or instance types) to reach an equilibrium where cost and performance meet.

Formula / Concept Box

Concept	Metric / Rule	Purpose
Target Tracking	$CurrentMetric \approx TargetValue$	Adjusts capacity to keep a metric (like CPU) at a specific set point.
Cold Start Latency	$T_{total} = T_{provisioning} + T_{inference}$	The delay experienced when a serverless endpoint initializes a new instance.
Scaling Cooldown	Default: 300 seconds	Prevents "flapping" (rapid scaling up and down) by pausing actions.
MME Limit	2 GB	The maximum supported ML model size per region for certain services.

Hierarchical Outline

Monitoring & Identification
- CloudWatch Dashboards: Centralizing CPU, GPU, and Memory utilization.
- Anomalies: Identifying underutilized (cost waste) vs. overburdened (performance drop) resources.
Scaling Strategies
- Dynamic Auto Scaling: Adjusting instances based on real-time load.
- Scheduled Scaling: Anticipating known traffic spikes (e.g., Black Friday).
- Step Scaling: Aggressive scaling based on predefined error margins.
Inference Optimization
- Provisioned Concurrency: Resolving cold starts in Serverless Inference.
- Traffic Shifting: Using Canary or Linear deployments to test capacity before full rollout.
Resource & Cost Control
- Service Quotas: Monitoring limits via AWS Trusted Advisor and Service Quotas console.
- Purchasing Models: Using Spot Instances for non-critical training to save up to 90% cost.

Visual Anchors

Capacity Troubleshooting Flow

Loading Diagram...

Latency vs. Cost with Provisioned Concurrency

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Term: Spot Instances
- Definition: Spare AWS capacity available at a discount, which can be reclaimed by AWS with a 2-minute notice.
- Example: Running a 24-hour hyperparameter tuning job that can be checkpointed and resumed later, saving 70% compared to On-Demand pricing.
Term: Canary Deployment
- Definition: Shifting a small percentage of traffic (e.g., 5%) to a new model version to monitor performance before a full rollout.
- Example: Testing if a new, larger ResNet model causes memory errors on your current instance type by only exposing it to 1 out of every 20 users.

Worked Examples

Example 1: Scaling Threshold Calculation

Scenario: You have a SageMaker endpoint currently running on 2 instances. Your max CPU utilization is 80% during peak. You want to maintain a 50% utilization target.

Current State: 2 instances @ 80% = 160% total "units" of CPU work.
Desired State: Total Units / Target = $160 / 50 = 3.2$.
Action: The Auto Scaling policy will scale out to 4 instances to ensure the average stays near the 50% target.

Example 2: Managing Quotas

Scenario: An ML Engineer attempts to launch a large training job using p4d.24xlarge instances but receives an InsufficientInstanceCapacity error.

Check Quotas: Navigate to the Service Quotas console.
Verify Limit: See if the regional limit for "P instance family" is set to 0.
Resolution: Submit a quota increase request. If the error persists due to AWS-wide demand, use Capacity Blocks to reserve the instances for a scheduled time window.

Checkpoint Questions

What is the main advantage of using Provisioned Concurrency for SageMaker Serverless Inference?
Which metric would you monitor in CloudWatch to decide if an instance is underprovisioned for a training job?
True or False: All AWS Service Quotas can be increased by the user manually through the console.
When should you prefer Step Scaling over Target Tracking?

Muddy Points & Cross-Refs

Multi-Model Endpoints (MME) Scaling: Scaling MMEs is tricky because one model might be "hot" while others are "cold." If using InvocationsPerInstance for scaling, ensure all models have similar CPU/latency profiles; otherwise, the scaling action might be inaccurate for the specific model receiving the traffic.
Cross-Reference: For more on cost monitoring, see the AWS Cost Explorer and AWS Budgets documentation in the "Cloud Operations" chapter.

Comparison Tables

Scaling Types

Type	Best For...	Key Trigger
Target Tracking	General consistency	Metric average (e.g., 70% RAM)
Step Scaling	Aggressive/Variable loads	Alarms with specific "steps" (+2, +5)
Scheduled Scaling	Predictable events	Time/Date (e.g., 9:00 AM daily)

Traffic Shifting Patterns

Pattern	Risk Level	Speed	Description
All At Once	High	Fast	100% traffic shifts immediately.
Canary	Low	Slow	Small subset first, then the rest.
Linear	Medium	Moderate	Incremental steps (e.g., 10% every 5 mins).