Study Guide890 words

Scalable and Cost-Effective ML Solutions on AWS

Applying best practices to enable maintainable, scalable, and cost-effective ML solutions (for example, automatic scaling on SageMaker AI endpoints, dynamically adding Spot Instances, by using Amazon EC2 instances, by using Lambda behind the endpoints)

Scalable and Cost-Effective ML Solutions on AWS

This guide covers the best practices for deploying machine learning models on AWS that balance performance requirements with cost-efficiency and maintainability.

Learning Objectives

After studying this guide, you should be able to:

  • Evaluate the tradeoffs between SageMaker real-time endpoints, serverless (Lambda), and batch inference.
  • Configure SageMaker auto-scaling policies using target tracking, scheduled, and step scaling.
  • Implement cost-saving measures such as Managed Spot Training and Multi-Model Endpoints (MMEs).
  • Identify key metrics (CPU, Memory, Invocations) used to trigger scaling actions.

Key Terms & Glossary

  • Scale-Out/In: Adding or removing instances in a cluster to match demand.
  • Target Tracking: A scaling policy that maintains a metric (e.g., 50% CPU) by automatically adjusting capacity.
  • Managed Spot Training: A SageMaker feature that uses spare AWS capacity for training, saving up to 90% in costs.
  • Provisioned Concurrency: A Lambda feature that keeps functions "warm" to eliminate cold start latency for ML inference.
  • Multi-Model Endpoint (MME): A single SageMaker endpoint that can host hundreds of models on a shared container, significantly reducing costs for low-traffic models.

The "Big Idea"

The core challenge of ML Engineering is the Triple Constraint: Performance (Latency), Scalability (Throughput), and Cost. Effective infrastructure design uses Automation (IaC) to ensure consistency and Elasticity (Auto-scaling) to ensure you only pay for what you use, without manual intervention.

Formula / Concept Box

ConceptMetric / FormulaUse Case
Invocations Per InstanceTotal InvocationsInstance Count\frac{\text{Total Invocations}}{\text{Instance Count}}Best for scaling based on throughput
CPU Utilization% of CPU used\% \text{ of CPU used}Best for compute-heavy models (e.g., Deep Learning)
Model LatencyTime per inference (ms)\text{Time per inference (ms)}Monitoring performance impact during scaling
Cost Savings(1Spot PriceOn-Demand Price)×100(1 - \frac{\text{Spot Price}}{\text{On-Demand Price}}) \times 100Calculating the ROI of Spot Instances

Hierarchical Outline

  • I. Deployment Targets
    • SageMaker Real-Time: Low latency, persistent instances; supports Auto-scaling.
    • AWS Lambda: Serverless inference; best for intermittent traffic; uses Provisioned Concurrency for latency.
    • SageMaker Batch Transform: Non-real-time; processes large datasets; shuts down after completion.
  • II. Auto-Scaling Strategies
    • Target Tracking: "Set it and forget it" logic based on a specific metric value.
    • Scheduled Scaling: Predictive scaling for known traffic spikes (e.g., business hours).
    • Step Scaling: Adjusts capacity in stages based on the size of the metric breach.
  • III. Cost Optimization
    • Managed Spot Training: Uses MaxWaitTimeInSeconds to handle interruptions.
    • Inference Recommender: Automates load testing to select the cheapest instance for a latency target.
    • Multi-Container Endpoints (MCE): Chains up to 15 containers in a single endpoint.

Visual Anchors

Scaling Decision Logic

Loading Diagram...

SageMaker Endpoint Architecture

\begin{tikzpicture} [node distance=2cm, box/.style={rectangle, draw, minimum width=2.5cm, minimum height=1cm, align=center}]

\node (client) [box] {Client \ Requests}; \node (lb) [box, right of=client, xshift=1cm] {Load \ Balancer}; \node (inst1) [box, above right of=lb, xshift=1.5cm] {Instance A}; \node (inst2) [box, below right of=lb, xshift=1.5cm] {Instance B}; \node (asg) [box, right of=lb, xshift=4.5cm, dashed] {Auto-Scaling \ Policy};

\draw [->, thick] (client) -- (lb); \draw [->] (lb) -- (inst1); \draw [->] (lb) -- (inst2); \draw [<->, red, thick] (asg) -- (inst1) node[midway, above, sloped] {scale}; \draw [<->, red, thick] (asg) -- (inst2); \draw [->, blue] (inst1) -- ++(2.5,0) node[right] {CloudWatch Metrics}; \end{tikzpicture}

Definition-Example Pairs

  • Step Scaling: Scaling based on the magnitude of a breach.
    • Example: If CPU > 70%, add 2 instances; if CPU > 90%, add 5 instances.
  • Cold Start: The delay when a serverless function (Lambda) is invoked after being idle.
    • Example: An ML model in Lambda takes 5 seconds to load weights from S3 on the first request but 100ms on subsequent requests.
  • Inference Recommender: An AWS tool that suggests instance types.
    • Example: SageMaker recommends using ml.m5.large instead of ml.p3.2xlarge because it meets your 50ms latency goal at 1/10th the cost.

Worked Examples

Configuring Auto-Scaling with Boto3

To enable auto-scaling for an existing SageMaker endpoint, you must register the scalable target and then apply the policy.

python
import boto3 client = boto3.client('application-autoscaling') # 1. Register the Target (Min: 1, Max: 10 instances) client.register_scalable_target( ServiceNamespace='sagemaker', ResourceId='endpoint/my-endpoint/variant/AllTraffic', ScalableDimension='sagemaker:variant:DesiredInstanceCount', MinCapacity=1, MaxCapacity=10 ) # 2. Define the Target Tracking Policy (Maintain 50% CPU) client.put_scaling_policy( PolicyName='CPUUtilScaling', ServiceNamespace='sagemaker', ResourceId='endpoint/my-endpoint/variant/AllTraffic', ScalableDimension='sagemaker:variant:DesiredInstanceCount', PolicyType='TargetTrackingScaling', TargetTrackingScalingPolicyConfiguration={ 'TargetValue': 50.0, 'PredefinedMetricSpecification': { 'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance' }, 'ScaleInCooldown': 300, 'ScaleOutCooldown': 60 } )

[!NOTE] ScaleOutCooldown is usually shorter than ScaleInCooldown to allow the system to respond quickly to traffic spikes but remain stable during traffic drops.

Comparison Tables

FeatureReal-Time EndpointAWS LambdaBatch Transform
ScalingHorizontal (Instances)Concurrent ExecutionsN/A (One-off)
Cost ModelHourly per InstancePer Request / DurationPer Instance Hour
Max Timeout60 Seconds15 MinutesNo strict limit
Best ForMillisecond LatencyIntermittent TrafficMassive Datasets

Checkpoint Questions

  1. What is the difference between ScaleInCooldown and ScaleOutCooldown?
  2. Why would you choose InvocationsPerInstance over CPUUtilization for scaling an MME?
  3. How does Managed Spot Training handle an instance interruption?
  4. What tool would you use to find the most cost-effective instance size for a specific model?

Muddy Points & Cross-Refs

  • MME Scaling: When using Multi-Model Endpoints, auto-scaling happens at the instance level, not the model level. If one model gets all the traffic, the entire instance cluster scales out, which may be inefficient if other models are idle.
  • Spot Interruption: Remember that Spot Instances can be reclaimed with a 2-minute warning. Always use Checkpoints in your training code to ensure progress is not lost.
  • Deep Dive: For more on Infrastructure as Code, see the CloudFormation vs. CDK guide.

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free