On-Demand vs. Provisioned Resources

This guide explores the architectural trade-offs between on-demand and provisioned resources within the AWS Machine Learning ecosystem, specifically focusing on Amazon SageMaker endpoints and serverless inference.

Learning Objectives

After studying this guide, you should be able to:

Differentiate between On-Demand Serverless Inference and Provisioned Concurrency.
Select the appropriate scaling policy (Target, Step, or Scheduled) based on workload predictability.
Evaluate the cost-benefit ratio of pre-warming resources for low-latency applications.
Identify use cases for manual on-demand scaling versus automated scaling.

Key Terms & Glossary

Provisioned Concurrency: A setting that keeps a specific number of serverless instances "warm" and ready to respond immediately to API calls.
Cold Start: The latency delay experienced when an on-demand resource must be initialized (downloading model, starting container) before processing a request.
Target Tracking: An auto-scaling policy that adjusts capacity to maintain a specific metric, like 70% CPU utilization.
Scheduled Scaling: A proactive scaling approach that adjusts capacity based on known time-based patterns using cron expressions.
On-Demand Scaling: A manual or reactive increase in resources to handle unpredictable surges.

The "Big Idea"

The core challenge in cloud infrastructure is the Latency-Cost Trade-off. Provisioned resources provide the lowest latency by eliminating "cold starts" but incur costs even when idle. On-demand resources optimize for cost by only running when needed but introduce latency during initialization. Choosing the right one depends entirely on your traffic's predictability and your application's sensitivity to delay.

Formula / Concept Box

Concept	Metric / Rule	Primary Use Case
Cost Optimization	$\text{Cost} \propto \text{Active Execution Time}$	Intermittent, spikey traffic
Latency Optimization	$\text{Latency} \approx \text{Inference Time}$ (No cold start)	Real-time, user-facing apps
Scheduled Scaling	$t = \text{Cron Expression}$	Predictable daily/weekly cycles
Provisioned Concurrency	$\text{Capacity} = \text{Constant} > 0$	Low-latency serverless

Hierarchical Outline

Inference Models
- Real-Time Inference: Managed endpoints with dedicated instances; supports auto-scaling.
- Serverless Inference: Infrastructure managed by AWS; scales based on demand.
Scaling Strategies
- Predictable Patterns: Use Scheduled Scaling (e.g., more instances on Monday mornings).
- Unpredictable Patterns: Use On-Demand or Target Tracking.
- High Sensitivity: Use Provisioned Concurrency to prevent cold starts.
Scaling Mechanisms
- Target Tracking: Maintains a metric (e.g., InvocationsPerInstance).
- Step Scaling: Increases capacity in "steps" based on the size of an alarm breach.

Visual Anchors

Decision Logic for Resource Selection

Loading Diagram...

Performance vs. Demand Graph

This TikZ diagram illustrates how Provisioned capacity stays ahead of demand compared to reactive On-Demand scaling.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Scheduled Scaling: Defining capacity based on time.
- Example: An e-commerce site triples its endpoint instances at 8:00 AM every Tuesday to handle a weekly "Flash Sale" email blast.
On-Demand (Manual): Manually adjusting instance counts.
- Example: A marketing team manually increases SageMaker instances just before the Super Bowl kickoff because the traffic spike is a one-time, unpredictable event.
Serverless On-Demand: AWS handles scaling automatically from zero.
- Example: A telehealth app that sends appointment reminders only a few times a day; the system sleeps in between.

Worked Examples

Scenario 1: The Chatbot Latency Issue

Problem: A company uses SageMaker Serverless Inference for a customer support chatbot. Users complain that the first message of the day takes 10 seconds to respond, while subsequent messages are fast.

Solution Step-by-Step:

Identify the cause: This is a "Cold Start." The serverless container is spun down after inactivity.
Select the tool: Provisioned Concurrency.
Implementation: Set a minimum Provisioned Concurrency of 1. This keeps one container always warm.
Result: The first user of the day receives an immediate response.

Scenario 2: Predictable Weekly Spikes

Problem: An analytics dashboard sees a 400% increase in traffic every Wednesday and Thursday, but is nearly idle on weekends.

Solution Step-by-Step:

Analysis: The pattern is predictable and recurring.
Select the tool: Scheduled Scaling.
Configuration: Use a cron expression to scale up at 07:00 UTC on Wednesday and scale down at 18:00 UTC on Thursday.
Benefit: Resources are ready before the spike hits, and costs are saved on weekends.

Checkpoint Questions

Which scaling policy is best for maintaining a steady CPU utilization of 60%?
What is the primary disadvantage of On-Demand Manual scaling?
Why is Provisioned Concurrency preferred for fraud detection systems?
When would On-Demand Serverless be more cost-effective than Real-Time Provisioned endpoints?

[!TIP] Answers: 1. Target Tracking. 2. It requires active monitoring and manual intervention. 3. Because fraud detection requires near-instantaneous (low-latency) inference. 4. When traffic is infrequent or has long idle periods.

Muddy Points & Cross-Refs

Provisioned Concurrency vs. Real-Time Endpoints: Both offer low latency. However, Real-Time endpoints use dedicated EC2-like instances, whereas Provisioned Concurrency is a feature of Serverless inference to mitigate cold starts.
Scaling vs. Instance Size: Scaling adds more units (horizontal), while instance size changes the power of one unit (vertical). This guide focuses on horizontal scaling.

Comparison Tables

Feature	On-Demand (Serverless)	Provisioned Concurrency	Real-Time (Provisioned)
Cold Start	Yes (after idle)	No	No
Cost Model	Pay-per-use (Duration)	Hourly fee + Pay-per-use	Hourly fee per instance
Best For	Intermittent traffic	Spiky traffic, low latency	Constant, high-volume traffic
Management	Zero infrastructure	Partial (set warm pool)	Full (choose instance types)

[!IMPORTANT] For the exam, remember: Predictable = Scheduled; Low Latency = Provisioned; Intermittent = On-Demand Serverless.

On-Demand vs. Provisioned Resources

Learning Objectives

After studying this guide, you should be able to:

Differentiate between On-Demand Serverless Inference and Provisioned Concurrency.
Select the appropriate scaling policy (Target, Step, or Scheduled) based on workload predictability.
Evaluate the cost-benefit ratio of pre-warming resources for low-latency applications.
Identify use cases for manual on-demand scaling versus automated scaling.

Key Terms & Glossary

Provisioned Concurrency: A setting that keeps a specific number of serverless instances "warm" and ready to respond immediately to API calls.
Cold Start: The latency delay experienced when an on-demand resource must be initialized (downloading model, starting container) before processing a request.
Target Tracking: An auto-scaling policy that adjusts capacity to maintain a specific metric, like 70% CPU utilization.
Scheduled Scaling: A proactive scaling approach that adjusts capacity based on known time-based patterns using cron expressions.
On-Demand Scaling: A manual or reactive increase in resources to handle unpredictable surges.

The "Big Idea"

Formula / Concept Box

Concept	Metric / Rule	Primary Use Case
Cost Optimization	$\text{Cost} \propto \text{Active Execution Time}$	Intermittent, spikey traffic
Latency Optimization	$\text{Latency} \approx \text{Inference Time}$ (No cold start)	Real-time, user-facing apps
Scheduled Scaling	$t = \text{Cron Expression}$	Predictable daily/weekly cycles
Provisioned Concurrency	$\text{Capacity} = \text{Constant} > 0$	Low-latency serverless

Hierarchical Outline

Inference Models
- Real-Time Inference: Managed endpoints with dedicated instances; supports auto-scaling.
- Serverless Inference: Infrastructure managed by AWS; scales based on demand.
Scaling Strategies
- Predictable Patterns: Use Scheduled Scaling (e.g., more instances on Monday mornings).
- Unpredictable Patterns: Use On-Demand or Target Tracking.
- High Sensitivity: Use Provisioned Concurrency to prevent cold starts.
Scaling Mechanisms
- Target Tracking: Maintains a metric (e.g., InvocationsPerInstance).
- Step Scaling: Increases capacity in "steps" based on the size of an alarm breach.

Visual Anchors

Decision Logic for Resource Selection

Loading Diagram...

Performance vs. Demand Graph

This TikZ diagram illustrates how Provisioned capacity stays ahead of demand compared to reactive On-Demand scaling.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Scheduled Scaling: Defining capacity based on time.
- Example: An e-commerce site triples its endpoint instances at 8:00 AM every Tuesday to handle a weekly "Flash Sale" email blast.
On-Demand (Manual): Manually adjusting instance counts.
- Example: A marketing team manually increases SageMaker instances just before the Super Bowl kickoff because the traffic spike is a one-time, unpredictable event.
Serverless On-Demand: AWS handles scaling automatically from zero.
- Example: A telehealth app that sends appointment reminders only a few times a day; the system sleeps in between.

Worked Examples

Scenario 1: The Chatbot Latency Issue

Solution Step-by-Step:

Identify the cause: This is a "Cold Start." The serverless container is spun down after inactivity.
Select the tool: Provisioned Concurrency.
Implementation: Set a minimum Provisioned Concurrency of 1. This keeps one container always warm.
Result: The first user of the day receives an immediate response.

Scenario 2: Predictable Weekly Spikes

Problem: An analytics dashboard sees a 400% increase in traffic every Wednesday and Thursday, but is nearly idle on weekends.

Solution Step-by-Step:

Analysis: The pattern is predictable and recurring.
Select the tool: Scheduled Scaling.
Configuration: Use a cron expression to scale up at 07:00 UTC on Wednesday and scale down at 18:00 UTC on Thursday.
Benefit: Resources are ready before the spike hits, and costs are saved on weekends.

Checkpoint Questions

Which scaling policy is best for maintaining a steady CPU utilization of 60%?
What is the primary disadvantage of On-Demand Manual scaling?
Why is Provisioned Concurrency preferred for fraud detection systems?
When would On-Demand Serverless be more cost-effective than Real-Time Provisioned endpoints?

[!TIP] Answers: 1. Target Tracking. 2. It requires active monitoring and manual intervention. 3. Because fraud detection requires near-instantaneous (low-latency) inference. 4. When traffic is infrequent or has long idle periods.

Muddy Points & Cross-Refs

Provisioned Concurrency vs. Real-Time Endpoints: Both offer low latency. However, Real-Time endpoints use dedicated EC2-like instances, whereas Provisioned Concurrency is a feature of Serverless inference to mitigate cold starts.
Scaling vs. Instance Size: Scaling adds more units (horizontal), while instance size changes the power of one unit (vertical). This guide focuses on horizontal scaling.

Comparison Tables

Feature	On-Demand (Serverless)	Provisioned Concurrency	Real-Time (Provisioned)
Cold Start	Yes (after idle)	No	No
Cost Model	Pay-per-use (Duration)	Hourly fee + Pay-per-use	Hourly fee per instance
Best For	Intermittent traffic	Spiky traffic, low latency	Constant, high-volume traffic
Management	Zero infrastructure	Partial (set warm pool)	Full (choose instance types)

[!IMPORTANT] For the exam, remember: Predictable = Scheduled; Low Latency = Provisioned; Intermittent = On-Demand Serverless.

On-Demand vs. Provisioned Resources: A Study Guide for AWS Machine Learning

On-Demand vs. Provisioned Resources

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

Decision Logic for Resource Selection

Performance vs. Demand Graph

Definition-Example Pairs

Worked Examples

Scenario 1: The Chatbot Latency Issue

Scenario 2: Predictable Weekly Spikes

Checkpoint Questions

Muddy Points & Cross-Refs

Comparison Tables

On-Demand vs. Provisioned Resources: A Study Guide for AWS Machine Learning

On-Demand vs. Provisioned Resources

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

Decision Logic for Resource Selection

Performance vs. Demand Graph

Definition-Example Pairs

Worked Examples

Scenario 1: The Chatbot Latency Issue

Scenario 2: Predictable Weekly Spikes

Checkpoint Questions

Muddy Points & Cross-Refs

Comparison Tables