AWS ML Troubleshooting: Capacity, Cost, and Performance
Troubleshooting capacity concerns that involve cost and performance (for example, provisioned concurrency, service quotas, auto scaling)
AWS ML Troubleshooting: Capacity, Cost, and Performance
This guide focuses on the critical balance between maintaining high-performance machine learning (ML) environments and optimizing cloud expenditure. Specifically, we cover troubleshooting bottlenecks, managing service limits, and configuring scaling strategies within the AWS ecosystem.
Learning Objectives
By the end of this guide, you will be able to:
- Identify performance bottlenecks using CloudWatch metrics (CPU, Memory, GPU).
- Troubleshoot and resolve Service Quota exhaustion issues.
- Configure Auto Scaling and Provisioned Concurrency to mitigate latency and cold starts.
- Select optimal purchasing models (Spot vs. On-Demand) to balance cost and reliability.
Key Terms & Glossary
- Provisioned Concurrency: A configuration for SageMaker Serverless Inference that keeps a specific number of instances "warm" to eliminate cold start latency.
- Service Quotas: Regional limits imposed by AWS on resource usage (e.g., number of
ml.m5.xlargeinstances) to prevent runaway costs and protect infrastructure. - Capacity Blocks: A feature allowing users to reserve GPU instances for a specific duration, ensuring availability for critical training or fine-tuning jobs.
- Right-Sizing: The process of matching instance types and sizes to your workload's actual resource requirements.
- Cooldown Period: A set time during which Auto Scaling does not perform further scaling actions, allowing the system to stabilize after a change.
The "Big Idea"
In the AWS ML environment, the goal is Elastic Efficiency. If you have too much capacity, you are wasting money (Overprovisioning). If you have too little, your models fail to respond or train effectively (Underprovisioning). Troubleshooting capacity is essentially the art of monitoring resource "signals" (metrics) and applying "levers" (scaling, quotas, or instance types) to reach an equilibrium where cost and performance meet.
Formula / Concept Box
| Concept | Metric / Rule | Purpose |
|---|---|---|
| Target Tracking | Adjusts capacity to keep a metric (like CPU) at a specific set point. | |
| Cold Start Latency | The delay experienced when a serverless endpoint initializes a new instance. | |
| Scaling Cooldown | Default: 300 seconds | Prevents "flapping" (rapid scaling up and down) by pausing actions. |
| MME Limit | 2 GB | The maximum supported ML model size per region for certain services. |
Hierarchical Outline
- Monitoring & Identification
- CloudWatch Dashboards: Centralizing CPU, GPU, and Memory utilization.
- Anomalies: Identifying underutilized (cost waste) vs. overburdened (performance drop) resources.
- Scaling Strategies
- Dynamic Auto Scaling: Adjusting instances based on real-time load.
- Scheduled Scaling: Anticipating known traffic spikes (e.g., Black Friday).
- Step Scaling: Aggressive scaling based on predefined error margins.
- Inference Optimization
- Provisioned Concurrency: Resolving cold starts in Serverless Inference.
- Traffic Shifting: Using Canary or Linear deployments to test capacity before full rollout.
- Resource & Cost Control
- Service Quotas: Monitoring limits via AWS Trusted Advisor and Service Quotas console.
- Purchasing Models: Using Spot Instances for non-critical training to save up to 90% cost.
Visual Anchors
Capacity Troubleshooting Flow
Latency vs. Cost with Provisioned Concurrency
Definition-Example Pairs
- Term: Spot Instances
- Definition: Spare AWS capacity available at a discount, which can be reclaimed by AWS with a 2-minute notice.
- Example: Running a 24-hour hyperparameter tuning job that can be checkpointed and resumed later, saving 70% compared to On-Demand pricing.
- Term: Canary Deployment
- Definition: Shifting a small percentage of traffic (e.g., 5%) to a new model version to monitor performance before a full rollout.
- Example: Testing if a new, larger ResNet model causes memory errors on your current instance type by only exposing it to 1 out of every 20 users.
Worked Examples
Example 1: Scaling Threshold Calculation
Scenario: You have a SageMaker endpoint currently running on 2 instances. Your max CPU utilization is 80% during peak. You want to maintain a 50% utilization target.
- Current State: 2 instances @ 80% = 160% total "units" of CPU work.
- Desired State: Total Units / Target = $160 / 50 = 3.2$.
- Action: The Auto Scaling policy will scale out to 4 instances to ensure the average stays near the 50% target.
Example 2: Managing Quotas
Scenario: An ML Engineer attempts to launch a large training job using p4d.24xlarge instances but receives an InsufficientInstanceCapacity error.
- Check Quotas: Navigate to the Service Quotas console.
- Verify Limit: See if the regional limit for "P instance family" is set to 0.
- Resolution: Submit a quota increase request. If the error persists due to AWS-wide demand, use Capacity Blocks to reserve the instances for a scheduled time window.
Checkpoint Questions
- What is the main advantage of using Provisioned Concurrency for SageMaker Serverless Inference?
- Which metric would you monitor in CloudWatch to decide if an instance is underprovisioned for a training job?
- True or False: All AWS Service Quotas can be increased by the user manually through the console.
- When should you prefer Step Scaling over Target Tracking?
Muddy Points & Cross-Refs
- Multi-Model Endpoints (MME) Scaling: Scaling MMEs is tricky because one model might be "hot" while others are "cold." If using
InvocationsPerInstancefor scaling, ensure all models have similar CPU/latency profiles; otherwise, the scaling action might be inaccurate for the specific model receiving the traffic. - Cross-Reference: For more on cost monitoring, see the AWS Cost Explorer and AWS Budgets documentation in the "Cloud Operations" chapter.
Comparison Tables
Scaling Types
| Type | Best For... | Key Trigger |
|---|---|---|
| Target Tracking | General consistency | Metric average (e.g., 70% RAM) |
| Step Scaling | Aggressive/Variable loads | Alarms with specific "steps" (+2, +5) |
| Scheduled Scaling | Predictable events | Time/Date (e.g., 9:00 AM daily) |
Traffic Shifting Patterns
| Pattern | Risk Level | Speed | Description |
|---|---|---|---|
| All At Once | High | Fast | 100% traffic shifts immediately. |
| Canary | Low | Slow | Small subset first, then the rest. |
| Linear | Medium | Moderate | Incremental steps (e.g., 10% every 5 mins). |