Study Guide925 words

Study Guide: Deployment Strategies and Rollback Actions in AWS ML

Deployment strategies and rollback actions (for example, blue/green, canary, linear)

Study Guide: Deployment Strategies and Rollback Actions in AWS ML

Learning Objectives

After studying this guide, you should be able to:

  • Differentiate between All At Once, Canary, and Linear deployment strategies.
  • Identify the role of Amazon CloudWatch alarms in triggering automatic rollbacks.
  • Configure traffic-shifting parameters such as CanarySize and WaitIntervalInSeconds.
  • Evaluate the trade-offs between deployment speed and risk management for ML models.

Key Terms & Glossary

  • Blue Fleet: The current, stable version of the model endpoint currently serving production traffic.
  • Green Fleet: The new version of the model endpoint being deployed to replace the old one.
  • Baking Period: A predetermined interval during which CloudWatch alarms monitor the new fleet for errors or performance issues.
  • Traffic Shifting: The process of redirecting incoming requests from the blue fleet to the green fleet.
  • Rollback: The automated process of reverting all traffic back to the blue fleet if the green fleet fails its health checks.

The "Big Idea"

In machine learning, deploying a new model is high-risk because performance in production may differ from training. Blue/Green deployment strategies allow for safe transitions. By using automated traffic shifting and monitoring, AWS ensures that if a new model version (Green) performs poorly, the system automatically reverts to the previous stable version (Blue), minimizing user impact and downtime.

Formula / Concept Box

ParameterDescriptionValid Values / Constraints
TypeMethod of measuring traffic sizeCAPACITY_PERCENT or INSTANCE_COUNT
CanarySizePortion of traffic sent to CanaryMust be 50%\le 50\% of green fleet capacity
WaitIntervalTime to monitor before final shiftMeasured in seconds (e.g., 600s = 10 min)
Rollback TriggerCondition to revert trafficCloudWatch Alarm = ALARM state

Hierarchical Outline

  • Blue/Green Deployment Overview
    • Requires Amazon CloudWatch for health monitoring.
    • Objective: Zero-downtime updates with automated recovery.
  • Traffic-Shifting Modes
    • All At Once
      • Instant 100% transition to Green fleet.
      • Highest speed, highest risk.
    • Canary
      • Two-step process: subset first, then the remainder.
      • Balanced risk management.
    • Linear
      • Most granular control.
      • Incremental steps until 100% is reached.
  • Rollback Mechanisms
    • Triggered automatically by alarms during the baking period.
    • Immediate redirection of 100% traffic back to Blue.

Visual Anchors

Traffic Shifting Comparison

Loading Diagram...

Linear Progression Visualization

This TikZ diagram represents the gradual traffic increase in a Linear strategy.

\begin{tikzpicture}[scale=0.8] \draw[->] (0,0) -- (6,0) node[right] {Time}; \draw[->] (0,0) -- (0,4) node[above] {Traffic to Green (%)}; \draw[thick, blue] (0,0) -- (1,0) -- (1,1) -- (2,1) -- (2,2) -- (3,2) -- (3,3) -- (4,3) -- (4,4) -- (5,4); \node at (0.5,-0.3) {Start}; \node at (4.5,-0.3) {End}; \draw[dashed] (0,4) -- (4,4); \node at (-0.5,4) {100}; \node at (-0.5,2) {50}; \end{tikzpicture}

Definition-Example Pairs

  • All At Once: Shifting all traffic simultaneously.
    • Example: An internal-only experimental tool where 10-minute downtime or bugs are acceptable for the sake of speed.
  • Canary Deployment: Sending a small "canary" group of users to the new version first.
    • Example: A retail site directing 5% of users to a new recommendation engine to ensure it doesn't crash before moving 100% of users over.
  • Linear Deployment: Increasing traffic by 10% every 5 minutes.
    • Example: A high-traffic banking app where stability is critical and gradual load testing is required during the rollout.

Worked Examples

Scenario: Configuring a Canary Deployment in SageMaker

Goal: Deploy a new model version using a 30% Canary shift with a 10-minute baking period.

  1. Define the Canary Size: Set Type to CAPACITY_PERCENT and Value to 30.
    • Note: Ensure this is 50%\le 50\% as per AWS best practices.
  2. Define the Wait Interval: Set WaitIntervalInSeconds to 600 (10 minutes).
  3. Configure Alarms: Attach a CloudWatch Alarm to the endpoint that monitors Invocation5XXErrors.
  4. Execution:
    • SageMaker shifts 30% of traffic to the Green fleet.
    • For 10 minutes, both fleets run. SageMaker monitors the alarm.
    • If Invocation5XXErrors > threshold, SageMaker automatically kills the Green fleet and returns all traffic to Blue.
    • If 10 minutes pass without an alarm, 100% of traffic moves to Green.

Checkpoint Questions

  1. What is the maximum recommended CanarySize percentage for a SageMaker deployment?
  2. Which AWS service is strictly required to implement automatic rollbacks in a Blue/Green strategy?
  3. True or False: In an "All At Once" deployment, the old (Blue) fleet is deleted immediately after the traffic shifts to the Green fleet.
  4. What is the primary advantage of a Linear deployment over a Canary deployment?
Click for Answers
  1. 50%.
  2. Amazon CloudWatch.
  3. False. The Blue fleet is kept during the baking period to allow for a potential rollback.
  4. Granularity; it allows for more incremental testing and load monitoring rather than just a two-step shift.

Muddy Points & Cross-Refs

  • Capacity vs. Instance Count: You can shift traffic based on the percentage of total traffic (CAPACITY_PERCENT) or by the specific number of server instances (INSTANCE_COUNT). Use percentage for auto-scaling environments.
  • Baking Period vs. Training: Do not confuse the "baking period" (deployment monitoring) with model "training time."
  • Cross-Reference: See Domain 4 (Monitoring) for specific CloudWatch metrics that make good rollback triggers (e.g., Latency, 5XX Errors).

Comparison Tables

StrategyShift StepsRisk LevelSpeedBest For...
All At Once1 Step (100%)HighFastNon-critical apps / Test environments
Canary2 Steps (X% then 100%)MediumMediumStandard production model updates
LinearMultiple Steps (Incremental)LowSlowMission-critical, high-traffic applications

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free