Study Guide: Deployment Strategies and Rollback Actions in AWS ML

Learning Objectives

After studying this guide, you should be able to:

Differentiate between All At Once, Canary, and Linear deployment strategies.
Identify the role of Amazon CloudWatch alarms in triggering automatic rollbacks.
Configure traffic-shifting parameters such as CanarySize and WaitIntervalInSeconds.
Evaluate the trade-offs between deployment speed and risk management for ML models.

Key Terms & Glossary

Blue Fleet: The current, stable version of the model endpoint currently serving production traffic.
Green Fleet: The new version of the model endpoint being deployed to replace the old one.
Baking Period: A predetermined interval during which CloudWatch alarms monitor the new fleet for errors or performance issues.
Traffic Shifting: The process of redirecting incoming requests from the blue fleet to the green fleet.
Rollback: The automated process of reverting all traffic back to the blue fleet if the green fleet fails its health checks.

The "Big Idea"

In machine learning, deploying a new model is high-risk because performance in production may differ from training. Blue/Green deployment strategies allow for safe transitions. By using automated traffic shifting and monitoring, AWS ensures that if a new model version (Green) performs poorly, the system automatically reverts to the previous stable version (Blue), minimizing user impact and downtime.

Formula / Concept Box

Parameter	Description	Valid Values / Constraints
`Type`	Method of measuring traffic size	`CAPACITY_PERCENT` or `INSTANCE_COUNT`
`CanarySize`	Portion of traffic sent to Canary	Must be $\le 50\%$ of green fleet capacity
`WaitInterval`	Time to monitor before final shift	Measured in seconds (e.g., 600s = 10 min)
`Rollback Trigger`	Condition to revert traffic	CloudWatch Alarm = `ALARM` state

Hierarchical Outline

Blue/Green Deployment Overview
- Requires Amazon CloudWatch for health monitoring.
- Objective: Zero-downtime updates with automated recovery.
Traffic-Shifting Modes
- All At Once
  - Instant 100% transition to Green fleet.
  - Highest speed, highest risk.
- Canary
  - Two-step process: subset first, then the remainder.
  - Balanced risk management.
- Linear
  - Most granular control.
  - Incremental steps until 100% is reached.
Rollback Mechanisms
- Triggered automatically by alarms during the baking period.
- Immediate redirection of 100% traffic back to Blue.

Visual Anchors

Traffic Shifting Comparison

Loading Diagram...

Linear Progression Visualization

This TikZ diagram represents the gradual traffic increase in a Linear strategy.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

All At Once: Shifting all traffic simultaneously.
- Example: An internal-only experimental tool where 10-minute downtime or bugs are acceptable for the sake of speed.
Canary Deployment: Sending a small "canary" group of users to the new version first.
- Example: A retail site directing 5% of users to a new recommendation engine to ensure it doesn't crash before moving 100% of users over.
Linear Deployment: Increasing traffic by 10% every 5 minutes.
- Example: A high-traffic banking app where stability is critical and gradual load testing is required during the rollout.

Worked Examples

Scenario: Configuring a Canary Deployment in SageMaker

Goal: Deploy a new model version using a 30% Canary shift with a 10-minute baking period.

Define the Canary Size: Set Type to CAPACITY_PERCENT and Value to 30.
- Note: Ensure this is $\le 50\%$ as per AWS best practices.
Define the Wait Interval: Set WaitIntervalInSeconds to 600 (10 minutes).
Configure Alarms: Attach a CloudWatch Alarm to the endpoint that monitors Invocation5XXErrors.
Execution:
- SageMaker shifts 30% of traffic to the Green fleet.
- For 10 minutes, both fleets run. SageMaker monitors the alarm.
- If Invocation5XXErrors > threshold, SageMaker automatically kills the Green fleet and returns all traffic to Blue.
- If 10 minutes pass without an alarm, 100% of traffic moves to Green.

Checkpoint Questions

What is the maximum recommended CanarySize percentage for a SageMaker deployment?
Which AWS service is strictly required to implement automatic rollbacks in a Blue/Green strategy?
True or False: In an "All At Once" deployment, the old (Blue) fleet is deleted immediately after the traffic shifts to the Green fleet.
What is the primary advantage of a Linear deployment over a Canary deployment?

▶Click for Answers

50%.
Amazon CloudWatch.
False. The Blue fleet is kept during the baking period to allow for a potential rollback.
Granularity; it allows for more incremental testing and load monitoring rather than just a two-step shift.

Muddy Points & Cross-Refs

Capacity vs. Instance Count: You can shift traffic based on the percentage of total traffic (CAPACITY_PERCENT) or by the specific number of server instances (INSTANCE_COUNT). Use percentage for auto-scaling environments.
Baking Period vs. Training: Do not confuse the "baking period" (deployment monitoring) with model "training time."
Cross-Reference: See Domain 4 (Monitoring) for specific CloudWatch metrics that make good rollback triggers (e.g., Latency, 5XX Errors).

Comparison Tables

Strategy	Shift Steps	Risk Level	Speed	Best For...
All At Once	1 Step (100%)	High	Fast	Non-critical apps / Test environments
Canary	2 Steps (X% then 100%)	Medium	Medium	Standard production model updates
Linear	Multiple Steps (Incremental)	Low	Slow	Mission-critical, high-traffic applications