Study Guide880 words

On-Demand vs. Provisioned Resources: A Study Guide for AWS Machine Learning

Difference between on-demand and provisioned resources

On-Demand vs. Provisioned Resources

This guide explores the architectural trade-offs between on-demand and provisioned resources within the AWS Machine Learning ecosystem, specifically focusing on Amazon SageMaker endpoints and serverless inference.

Learning Objectives

After studying this guide, you should be able to:

  • Differentiate between On-Demand Serverless Inference and Provisioned Concurrency.
  • Select the appropriate scaling policy (Target, Step, or Scheduled) based on workload predictability.
  • Evaluate the cost-benefit ratio of pre-warming resources for low-latency applications.
  • Identify use cases for manual on-demand scaling versus automated scaling.

Key Terms & Glossary

  • Provisioned Concurrency: A setting that keeps a specific number of serverless instances "warm" and ready to respond immediately to API calls.
  • Cold Start: The latency delay experienced when an on-demand resource must be initialized (downloading model, starting container) before processing a request.
  • Target Tracking: An auto-scaling policy that adjusts capacity to maintain a specific metric, like 70% CPU utilization.
  • Scheduled Scaling: A proactive scaling approach that adjusts capacity based on known time-based patterns using cron expressions.
  • On-Demand Scaling: A manual or reactive increase in resources to handle unpredictable surges.

The "Big Idea"

The core challenge in cloud infrastructure is the Latency-Cost Trade-off. Provisioned resources provide the lowest latency by eliminating "cold starts" but incur costs even when idle. On-demand resources optimize for cost by only running when needed but introduce latency during initialization. Choosing the right one depends entirely on your traffic's predictability and your application's sensitivity to delay.

Formula / Concept Box

ConceptMetric / RulePrimary Use Case
Cost OptimizationCostActive Execution Time\text{Cost} \propto \text{Active Execution Time}Intermittent, spikey traffic
Latency OptimizationLatencyInference Time\text{Latency} \approx \text{Inference Time} (No cold start)Real-time, user-facing apps
Scheduled Scalingt=Cron Expressiont = \text{Cron Expression}Predictable daily/weekly cycles
Provisioned ConcurrencyCapacity=Constant>0\text{Capacity} = \text{Constant} > 0Low-latency serverless

Hierarchical Outline

  1. Inference Models
    • Real-Time Inference: Managed endpoints with dedicated instances; supports auto-scaling.
    • Serverless Inference: Infrastructure managed by AWS; scales based on demand.
  2. Scaling Strategies
    • Predictable Patterns: Use Scheduled Scaling (e.g., more instances on Monday mornings).
    • Unpredictable Patterns: Use On-Demand or Target Tracking.
    • High Sensitivity: Use Provisioned Concurrency to prevent cold starts.
  3. Scaling Mechanisms
    • Target Tracking: Maintains a metric (e.g., InvocationsPerInstance).
    • Step Scaling: Increases capacity in "steps" based on the size of an alarm breach.

Visual Anchors

Decision Logic for Resource Selection

Loading Diagram...

Performance vs. Demand Graph

This TikZ diagram illustrates how Provisioned capacity stays ahead of demand compared to reactive On-Demand scaling.

\begin{tikzpicture} % Axes \draw [->] (0,0) -- (6,0) node[right] {Time}; \draw [->] (0,0) -- (0,4) node[above] {Capacity/Demand};

code
% Demand curve (sinusoidal) \draw [thick, blue] (0,0.5) .. controls (1,3.5) and (3,0.5) .. (5,3) node[right] {Traffic Demand}; % Provisioned (Straight line) \draw [red, dashed] (0,2.5) -- (5,2.5) node[right] {Provisioned Concurrency}; % On-Demand (Staircase) \draw [green!60!black, ultra thick] (0,0.5) -- (1,0.5) -- (1,2) -- (2.5,2) -- (2.5,1) -- (4,1) -- (4,3) -- (5,3) node[right] {On-Demand Scaling};

\end{tikzpicture}

Definition-Example Pairs

  • Scheduled Scaling: Defining capacity based on time.
    • Example: An e-commerce site triples its endpoint instances at 8:00 AM every Tuesday to handle a weekly "Flash Sale" email blast.
  • On-Demand (Manual): Manually adjusting instance counts.
    • Example: A marketing team manually increases SageMaker instances just before the Super Bowl kickoff because the traffic spike is a one-time, unpredictable event.
  • Serverless On-Demand: AWS handles scaling automatically from zero.
    • Example: A telehealth app that sends appointment reminders only a few times a day; the system sleeps in between.

Worked Examples

Scenario 1: The Chatbot Latency Issue

Problem: A company uses SageMaker Serverless Inference for a customer support chatbot. Users complain that the first message of the day takes 10 seconds to respond, while subsequent messages are fast.

Solution Step-by-Step:

  1. Identify the cause: This is a "Cold Start." The serverless container is spun down after inactivity.
  2. Select the tool: Provisioned Concurrency.
  3. Implementation: Set a minimum Provisioned Concurrency of 1. This keeps one container always warm.
  4. Result: The first user of the day receives an immediate response.

Scenario 2: Predictable Weekly Spikes

Problem: An analytics dashboard sees a 400% increase in traffic every Wednesday and Thursday, but is nearly idle on weekends.

Solution Step-by-Step:

  1. Analysis: The pattern is predictable and recurring.
  2. Select the tool: Scheduled Scaling.
  3. Configuration: Use a cron expression to scale up at 07:00 UTC on Wednesday and scale down at 18:00 UTC on Thursday.
  4. Benefit: Resources are ready before the spike hits, and costs are saved on weekends.

Checkpoint Questions

  1. Which scaling policy is best for maintaining a steady CPU utilization of 60%?
  2. What is the primary disadvantage of On-Demand Manual scaling?
  3. Why is Provisioned Concurrency preferred for fraud detection systems?
  4. When would On-Demand Serverless be more cost-effective than Real-Time Provisioned endpoints?

[!TIP] Answers: 1. Target Tracking. 2. It requires active monitoring and manual intervention. 3. Because fraud detection requires near-instantaneous (low-latency) inference. 4. When traffic is infrequent or has long idle periods.

Muddy Points & Cross-Refs

  • Provisioned Concurrency vs. Real-Time Endpoints: Both offer low latency. However, Real-Time endpoints use dedicated EC2-like instances, whereas Provisioned Concurrency is a feature of Serverless inference to mitigate cold starts.
  • Scaling vs. Instance Size: Scaling adds more units (horizontal), while instance size changes the power of one unit (vertical). This guide focuses on horizontal scaling.

Comparison Tables

FeatureOn-Demand (Serverless)Provisioned ConcurrencyReal-Time (Provisioned)
Cold StartYes (after idle)NoNo
Cost ModelPay-per-use (Duration)Hourly fee + Pay-per-useHourly fee per instance
Best ForIntermittent trafficSpiky traffic, low latencyConstant, high-volume traffic
ManagementZero infrastructurePartial (set warm pool)Full (choose instance types)

[!IMPORTANT] For the exam, remember: Predictable = Scheduled; Low Latency = Provisioned; Intermittent = On-Demand Serverless.

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free