On-Demand vs. Provisioned Resources: A Study Guide for AWS Machine Learning
Difference between on-demand and provisioned resources
On-Demand vs. Provisioned Resources
This guide explores the architectural trade-offs between on-demand and provisioned resources within the AWS Machine Learning ecosystem, specifically focusing on Amazon SageMaker endpoints and serverless inference.
Learning Objectives
After studying this guide, you should be able to:
- Differentiate between On-Demand Serverless Inference and Provisioned Concurrency.
- Select the appropriate scaling policy (Target, Step, or Scheduled) based on workload predictability.
- Evaluate the cost-benefit ratio of pre-warming resources for low-latency applications.
- Identify use cases for manual on-demand scaling versus automated scaling.
Key Terms & Glossary
- Provisioned Concurrency: A setting that keeps a specific number of serverless instances "warm" and ready to respond immediately to API calls.
- Cold Start: The latency delay experienced when an on-demand resource must be initialized (downloading model, starting container) before processing a request.
- Target Tracking: An auto-scaling policy that adjusts capacity to maintain a specific metric, like 70% CPU utilization.
- Scheduled Scaling: A proactive scaling approach that adjusts capacity based on known time-based patterns using cron expressions.
- On-Demand Scaling: A manual or reactive increase in resources to handle unpredictable surges.
The "Big Idea"
The core challenge in cloud infrastructure is the Latency-Cost Trade-off. Provisioned resources provide the lowest latency by eliminating "cold starts" but incur costs even when idle. On-demand resources optimize for cost by only running when needed but introduce latency during initialization. Choosing the right one depends entirely on your traffic's predictability and your application's sensitivity to delay.
Formula / Concept Box
| Concept | Metric / Rule | Primary Use Case |
|---|---|---|
| Cost Optimization | Intermittent, spikey traffic | |
| Latency Optimization | (No cold start) | Real-time, user-facing apps |
| Scheduled Scaling | Predictable daily/weekly cycles | |
| Provisioned Concurrency | Low-latency serverless |
Hierarchical Outline
- Inference Models
- Real-Time Inference: Managed endpoints with dedicated instances; supports auto-scaling.
- Serverless Inference: Infrastructure managed by AWS; scales based on demand.
- Scaling Strategies
- Predictable Patterns: Use Scheduled Scaling (e.g., more instances on Monday mornings).
- Unpredictable Patterns: Use On-Demand or Target Tracking.
- High Sensitivity: Use Provisioned Concurrency to prevent cold starts.
- Scaling Mechanisms
- Target Tracking: Maintains a metric (e.g., InvocationsPerInstance).
- Step Scaling: Increases capacity in "steps" based on the size of an alarm breach.
Visual Anchors
Decision Logic for Resource Selection
Performance vs. Demand Graph
This TikZ diagram illustrates how Provisioned capacity stays ahead of demand compared to reactive On-Demand scaling.
\begin{tikzpicture} % Axes \draw [->] (0,0) -- (6,0) node[right] {Time}; \draw [->] (0,0) -- (0,4) node[above] {Capacity/Demand};
% Demand curve (sinusoidal)
\draw [thick, blue] (0,0.5) .. controls (1,3.5) and (3,0.5) .. (5,3) node[right] {Traffic Demand};
% Provisioned (Straight line)
\draw [red, dashed] (0,2.5) -- (5,2.5) node[right] {Provisioned Concurrency};
% On-Demand (Staircase)
\draw [green!60!black, ultra thick] (0,0.5) -- (1,0.5) -- (1,2) -- (2.5,2) -- (2.5,1) -- (4,1) -- (4,3) -- (5,3) node[right] {On-Demand Scaling};\end{tikzpicture}
Definition-Example Pairs
- Scheduled Scaling: Defining capacity based on time.
- Example: An e-commerce site triples its endpoint instances at 8:00 AM every Tuesday to handle a weekly "Flash Sale" email blast.
- On-Demand (Manual): Manually adjusting instance counts.
- Example: A marketing team manually increases SageMaker instances just before the Super Bowl kickoff because the traffic spike is a one-time, unpredictable event.
- Serverless On-Demand: AWS handles scaling automatically from zero.
- Example: A telehealth app that sends appointment reminders only a few times a day; the system sleeps in between.
Worked Examples
Scenario 1: The Chatbot Latency Issue
Problem: A company uses SageMaker Serverless Inference for a customer support chatbot. Users complain that the first message of the day takes 10 seconds to respond, while subsequent messages are fast.
Solution Step-by-Step:
- Identify the cause: This is a "Cold Start." The serverless container is spun down after inactivity.
- Select the tool: Provisioned Concurrency.
- Implementation: Set a minimum Provisioned Concurrency of 1. This keeps one container always warm.
- Result: The first user of the day receives an immediate response.
Scenario 2: Predictable Weekly Spikes
Problem: An analytics dashboard sees a 400% increase in traffic every Wednesday and Thursday, but is nearly idle on weekends.
Solution Step-by-Step:
- Analysis: The pattern is predictable and recurring.
- Select the tool: Scheduled Scaling.
- Configuration: Use a cron expression to scale up at 07:00 UTC on Wednesday and scale down at 18:00 UTC on Thursday.
- Benefit: Resources are ready before the spike hits, and costs are saved on weekends.
Checkpoint Questions
- Which scaling policy is best for maintaining a steady CPU utilization of 60%?
- What is the primary disadvantage of On-Demand Manual scaling?
- Why is Provisioned Concurrency preferred for fraud detection systems?
- When would On-Demand Serverless be more cost-effective than Real-Time Provisioned endpoints?
[!TIP] Answers: 1. Target Tracking. 2. It requires active monitoring and manual intervention. 3. Because fraud detection requires near-instantaneous (low-latency) inference. 4. When traffic is infrequent or has long idle periods.
Muddy Points & Cross-Refs
- Provisioned Concurrency vs. Real-Time Endpoints: Both offer low latency. However, Real-Time endpoints use dedicated EC2-like instances, whereas Provisioned Concurrency is a feature of Serverless inference to mitigate cold starts.
- Scaling vs. Instance Size: Scaling adds more units (horizontal), while instance size changes the power of one unit (vertical). This guide focuses on horizontal scaling.
Comparison Tables
| Feature | On-Demand (Serverless) | Provisioned Concurrency | Real-Time (Provisioned) |
|---|---|---|---|
| Cold Start | Yes (after idle) | No | No |
| Cost Model | Pay-per-use (Duration) | Hourly fee + Pay-per-use | Hourly fee per instance |
| Best For | Intermittent traffic | Spiky traffic, low latency | Constant, high-volume traffic |
| Management | Zero infrastructure | Partial (set warm pool) | Full (choose instance types) |
[!IMPORTANT] For the exam, remember: Predictable = Scheduled; Low Latency = Provisioned; Intermittent = On-Demand Serverless.