Study Guide: Deployment Strategies and Rollback Actions in AWS ML
Deployment strategies and rollback actions (for example, blue/green, canary, linear)
Study Guide: Deployment Strategies and Rollback Actions in AWS ML
Learning Objectives
After studying this guide, you should be able to:
- Differentiate between All At Once, Canary, and Linear deployment strategies.
- Identify the role of Amazon CloudWatch alarms in triggering automatic rollbacks.
- Configure traffic-shifting parameters such as
CanarySizeandWaitIntervalInSeconds. - Evaluate the trade-offs between deployment speed and risk management for ML models.
Key Terms & Glossary
- Blue Fleet: The current, stable version of the model endpoint currently serving production traffic.
- Green Fleet: The new version of the model endpoint being deployed to replace the old one.
- Baking Period: A predetermined interval during which CloudWatch alarms monitor the new fleet for errors or performance issues.
- Traffic Shifting: The process of redirecting incoming requests from the blue fleet to the green fleet.
- Rollback: The automated process of reverting all traffic back to the blue fleet if the green fleet fails its health checks.
The "Big Idea"
In machine learning, deploying a new model is high-risk because performance in production may differ from training. Blue/Green deployment strategies allow for safe transitions. By using automated traffic shifting and monitoring, AWS ensures that if a new model version (Green) performs poorly, the system automatically reverts to the previous stable version (Blue), minimizing user impact and downtime.
Formula / Concept Box
| Parameter | Description | Valid Values / Constraints |
|---|---|---|
Type | Method of measuring traffic size | CAPACITY_PERCENT or INSTANCE_COUNT |
CanarySize | Portion of traffic sent to Canary | Must be of green fleet capacity |
WaitInterval | Time to monitor before final shift | Measured in seconds (e.g., 600s = 10 min) |
Rollback Trigger | Condition to revert traffic | CloudWatch Alarm = ALARM state |
Hierarchical Outline
- Blue/Green Deployment Overview
- Requires Amazon CloudWatch for health monitoring.
- Objective: Zero-downtime updates with automated recovery.
- Traffic-Shifting Modes
- All At Once
- Instant 100% transition to Green fleet.
- Highest speed, highest risk.
- Canary
- Two-step process: subset first, then the remainder.
- Balanced risk management.
- Linear
- Most granular control.
- Incremental steps until 100% is reached.
- All At Once
- Rollback Mechanisms
- Triggered automatically by alarms during the baking period.
- Immediate redirection of 100% traffic back to Blue.
Visual Anchors
Traffic Shifting Comparison
Linear Progression Visualization
This TikZ diagram represents the gradual traffic increase in a Linear strategy.
\begin{tikzpicture}[scale=0.8] \draw[->] (0,0) -- (6,0) node[right] {Time}; \draw[->] (0,0) -- (0,4) node[above] {Traffic to Green (%)}; \draw[thick, blue] (0,0) -- (1,0) -- (1,1) -- (2,1) -- (2,2) -- (3,2) -- (3,3) -- (4,3) -- (4,4) -- (5,4); \node at (0.5,-0.3) {Start}; \node at (4.5,-0.3) {End}; \draw[dashed] (0,4) -- (4,4); \node at (-0.5,4) {100}; \node at (-0.5,2) {50}; \end{tikzpicture}
Definition-Example Pairs
- All At Once: Shifting all traffic simultaneously.
- Example: An internal-only experimental tool where 10-minute downtime or bugs are acceptable for the sake of speed.
- Canary Deployment: Sending a small "canary" group of users to the new version first.
- Example: A retail site directing 5% of users to a new recommendation engine to ensure it doesn't crash before moving 100% of users over.
- Linear Deployment: Increasing traffic by 10% every 5 minutes.
- Example: A high-traffic banking app where stability is critical and gradual load testing is required during the rollout.
Worked Examples
Scenario: Configuring a Canary Deployment in SageMaker
Goal: Deploy a new model version using a 30% Canary shift with a 10-minute baking period.
- Define the Canary Size: Set
TypetoCAPACITY_PERCENTandValueto30.- Note: Ensure this is as per AWS best practices.
- Define the Wait Interval: Set
WaitIntervalInSecondsto600(10 minutes). - Configure Alarms: Attach a CloudWatch Alarm to the endpoint that monitors
Invocation5XXErrors. - Execution:
- SageMaker shifts 30% of traffic to the Green fleet.
- For 10 minutes, both fleets run. SageMaker monitors the alarm.
- If
Invocation5XXErrors> threshold, SageMaker automatically kills the Green fleet and returns all traffic to Blue. - If 10 minutes pass without an alarm, 100% of traffic moves to Green.
Checkpoint Questions
- What is the maximum recommended
CanarySizepercentage for a SageMaker deployment? - Which AWS service is strictly required to implement automatic rollbacks in a Blue/Green strategy?
- True or False: In an "All At Once" deployment, the old (Blue) fleet is deleted immediately after the traffic shifts to the Green fleet.
- What is the primary advantage of a Linear deployment over a Canary deployment?
▶Click for Answers
- 50%.
- Amazon CloudWatch.
- False. The Blue fleet is kept during the baking period to allow for a potential rollback.
- Granularity; it allows for more incremental testing and load monitoring rather than just a two-step shift.
Muddy Points & Cross-Refs
- Capacity vs. Instance Count: You can shift traffic based on the percentage of total traffic (
CAPACITY_PERCENT) or by the specific number of server instances (INSTANCE_COUNT). Use percentage for auto-scaling environments. - Baking Period vs. Training: Do not confuse the "baking period" (deployment monitoring) with model "training time."
- Cross-Reference: See Domain 4 (Monitoring) for specific CloudWatch metrics that make good rollback triggers (e.g., Latency, 5XX Errors).
Comparison Tables
| Strategy | Shift Steps | Risk Level | Speed | Best For... |
|---|---|---|---|---|
| All At Once | 1 Step (100%) | High | Fast | Non-critical apps / Test environments |
| Canary | 2 Steps (X% then 100%) | Medium | Medium | Standard production model updates |
| Linear | Multiple Steps (Incremental) | Low | Slow | Mission-critical, high-traffic applications |