Scalable and Cost-Effective ML Solutions on AWS

This guide covers the best practices for deploying machine learning models on AWS that balance performance requirements with cost-efficiency and maintainability.

Learning Objectives

After studying this guide, you should be able to:

Evaluate the tradeoffs between SageMaker real-time endpoints, serverless (Lambda), and batch inference.
Configure SageMaker auto-scaling policies using target tracking, scheduled, and step scaling.
Implement cost-saving measures such as Managed Spot Training and Multi-Model Endpoints (MMEs).
Identify key metrics (CPU, Memory, Invocations) used to trigger scaling actions.

Key Terms & Glossary

Scale-Out/In: Adding or removing instances in a cluster to match demand.
Target Tracking: A scaling policy that maintains a metric (e.g., 50% CPU) by automatically adjusting capacity.
Managed Spot Training: A SageMaker feature that uses spare AWS capacity for training, saving up to 90% in costs.
Provisioned Concurrency: A Lambda feature that keeps functions "warm" to eliminate cold start latency for ML inference.
Multi-Model Endpoint (MME): A single SageMaker endpoint that can host hundreds of models on a shared container, significantly reducing costs for low-traffic models.

The "Big Idea"

The core challenge of ML Engineering is the Triple Constraint: Performance (Latency), Scalability (Throughput), and Cost. Effective infrastructure design uses Automation (IaC) to ensure consistency and Elasticity (Auto-scaling) to ensure you only pay for what you use, without manual intervention.

Formula / Concept Box

Concept	Metric / Formula	Use Case
Invocations Per Instance	$\frac{\text{Total Invocations}}{\text{Instance Count}}$	Best for scaling based on throughput
CPU Utilization	$\% \text{ of CPU used}$	Best for compute-heavy models (e.g., Deep Learning)
Model Latency	$\text{Time per inference (ms)}$	Monitoring performance impact during scaling
Cost Savings	$(1 - \frac{\text{Spot Price}}{\text{On-Demand Price}}) \times 100$	Calculating the ROI of Spot Instances

Hierarchical Outline

I. Deployment Targets
- SageMaker Real-Time: Low latency, persistent instances; supports Auto-scaling.
- AWS Lambda: Serverless inference; best for intermittent traffic; uses Provisioned Concurrency for latency.
- SageMaker Batch Transform: Non-real-time; processes large datasets; shuts down after completion.
II. Auto-Scaling Strategies
- Target Tracking: "Set it and forget it" logic based on a specific metric value.
- Scheduled Scaling: Predictive scaling for known traffic spikes (e.g., business hours).
- Step Scaling: Adjusts capacity in stages based on the size of the metric breach.
III. Cost Optimization
- Managed Spot Training: Uses MaxWaitTimeInSeconds to handle interruptions.
- Inference Recommender: Automates load testing to select the cheapest instance for a latency target.
- Multi-Container Endpoints (MCE): Chains up to 15 containers in a single endpoint.

Visual Anchors

Scaling Decision Logic

Loading Diagram...

SageMaker Endpoint Architecture

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Step Scaling: Scaling based on the magnitude of a breach.
- Example: If CPU > 70%, add 2 instances; if CPU > 90%, add 5 instances.
Cold Start: The delay when a serverless function (Lambda) is invoked after being idle.
- Example: An ML model in Lambda takes 5 seconds to load weights from S3 on the first request but 100ms on subsequent requests.
Inference Recommender: An AWS tool that suggests instance types.
- Example: SageMaker recommends using ml.m5.large instead of ml.p3.2xlarge because it meets your 50ms latency goal at 1/10th the cost.

Worked Examples

Configuring Auto-Scaling with Boto3

To enable auto-scaling for an existing SageMaker endpoint, you must register the scalable target and then apply the policy.

python

import boto3
client = boto3.client('application-autoscaling')

# 1. Register the Target (Min: 1, Max: 10 instances)
client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId='endpoint/my-endpoint/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=1,
    MaxCapacity=10
)

# 2. Define the Target Tracking Policy (Maintain 50% CPU)
client.put_scaling_policy(
    PolicyName='CPUUtilScaling',
    ServiceNamespace='sagemaker',
    ResourceId='endpoint/my-endpoint/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 50.0,
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
        },
        'ScaleInCooldown': 300,
        'ScaleOutCooldown': 60
    }
)

[!NOTE] ScaleOutCooldown is usually shorter than ScaleInCooldown to allow the system to respond quickly to traffic spikes but remain stable during traffic drops.

Comparison Tables

Feature	Real-Time Endpoint	AWS Lambda	Batch Transform
Scaling	Horizontal (Instances)	Concurrent Executions	N/A (One-off)
Cost Model	Hourly per Instance	Per Request / Duration	Per Instance Hour
Max Timeout	60 Seconds	15 Minutes	No strict limit
Best For	Millisecond Latency	Intermittent Traffic	Massive Datasets

Checkpoint Questions

What is the difference between ScaleInCooldown and ScaleOutCooldown?
Why would you choose InvocationsPerInstance over CPUUtilization for scaling an MME?
How does Managed Spot Training handle an instance interruption?
What tool would you use to find the most cost-effective instance size for a specific model?

Muddy Points & Cross-Refs

MME Scaling: When using Multi-Model Endpoints, auto-scaling happens at the instance level, not the model level. If one model gets all the traffic, the entire instance cluster scales out, which may be inefficient if other models are idle.
Spot Interruption: Remember that Spot Instances can be reclaimed with a 2-minute warning. Always use Checkpoints in your training code to ensure progress is not lost.
Deep Dive: For more on Infrastructure as Code, see the CloudFormation vs. CDK guide.