AWS Disaster Recovery Strategies: A Comprehensive Study Guide
Disaster recovery scenarios (for example, backup and restore, pilot light, warm standby, multi-site)
AWS Disaster Recovery Strategies: A Comprehensive Study Guide
This guide covers the four primary Disaster Recovery (DR) strategies on AWS as defined in the AWS Certified Solutions Architect - Professional (SAP-C02) curriculum. It explores the trade-offs between cost, complexity, and recovery metrics.
Learning Objectives
After studying this guide, you should be able to:
- Differentiate between Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
- Identify the four AWS DR strategies: Backup & Restore, Pilot Light, Warm Standby, and Multi-Site (Active-Active).
- Evaluate business requirements to select the most cost-effective DR strategy.
- Understand how AWS global infrastructure (Regions and AZs) supports business continuity.
Key Terms & Glossary
- RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration of service. (How long can we be down?)
- RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time. (How much data can we lose?)
- Failover: The process of switching to a redundant or standby computer server, system, hardware component, or network upon the failure or abnormal termination of the previously active application.
- Pilot Light: A DR strategy where a minimal version of an environment is always running in the cloud (usually just the database/data) to keep costs low.
- Warm Standby: A DR strategy where a scaled-down but functional version of the full environment is always running.
The "Big Idea"
Disaster Recovery is not a one-size-fits-all solution; it is a spectrum of trade-offs. On one end, you have Backup & Restore, which is inexpensive but slow to recover. On the other end, you have Multi-Site Active-Active, which provides near-instant recovery but at a significant cost. The goal of a Solutions Architect is to align the technical strategy with the organization's Business Continuity Plan (BCP) by balancing the cost of downtime against the cost of the DR solution.
Formula / Concept Box
| Metric | Definition | Focus Area |
|---|---|---|
| RPO | Time since last backup/sync | Data Integrity (Avoid losing work) |
| RTO | Time taken to bring the system back online | Availability (Minimize downtime) |
[!IMPORTANT] RPO = "Back in time" (Data loss limit)
RTO = "Forward in time" (Restoration speed limit)
Hierarchical Outline
- DR Fundamentals
- Risk Assessment (Single AZ vs. Multi-Region failure)
- Impact of separation (AZs are physically separated by kilometers to prevent localized disaster impact)
- The Four DR Strategies
- Backup & Restore (Lowest cost, hours-to-days RTO/RPO)
- Pilot Light (Data live, compute off/template-based)
- Warm Standby (Smallest-capacity compute always running)
- Multi-Site (Active-Active) (Zero downtime, highest cost)
- Selection Criteria
- Business criticality
- Budgetary constraints
- Complexity of implementation
Visual Anchors
DR Strategy Spectrum
RPO and RTO Visualized
\begin{tikzpicture}[node distance=2cm, font=\small] \draw[thick, ->] (0,0) -- (10,0) node[anchor=north] {Time}; \filldraw[red] (5,0) circle (2pt) node[anchor=south] {Disaster Event};
% RPO
\draw[dashed] (2,0) -- (2,1);
\draw[<->] (2,0.5) -- (5,0.5) node[midway, above] {RPO (Data Loss)};
\node[below] at (2,0) {Last Backup};
% RTO
\draw[dashed] (8,0) -- (8,1);
\draw[<->] (5,0.5) -- (8,0.5) node[midway, above] {RTO (Downtime)};
\node[below] at (8,0) {Restored};\end{tikzpicture}
Definition-Example Pairs
- Backup & Restore
- Definition: Data is backed up to Amazon S3; infrastructure is redeployed via CloudFormation only when a disaster occurs.
- Example: A company's internal payroll system where losing 24 hours of data is acceptable and employees can wait a day for the system to return.
- Pilot Light
- Definition: The "heart" of the application (usually the database) is kept running and up-to-date, while other layers are kept as idle templates (AMIs/Snapshots).
- Example: A retail site that replicates its RDS database to another region but only starts EC2 instances when the primary region fails.
- Warm Standby
- Definition: A functional, smaller-scale version of the application is always running in the DR region (e.g., 2 small instances instead of 10 large ones).
- Example: A critical SaaS application that must be back at full capacity within 15 minutes.
Worked Examples
Scenario: The Budget-Conscious Startup
Problem: A startup has a web application. They can afford to lose 4 hours of data, and they need to be back online within 12 hours of a regional failure. They have a very tight budget. Solution:
- Use Backup & Restore.
- Schedule Amazon RDS snapshots every 4 hours (meeting the RPO).
- Use AWS CloudFormation or CDK to script the infrastructure. In a disaster, the script triggers the creation of the VPC, Load Balancers, and EC2 instances from the latest AMIs.
- Restore the RDS snapshot into the new region.
Scenario: The Zero-Downtime Financial App
Problem: A global banking app cannot afford any downtime. If a region goes offline, users shouldn't even notice. Solution:
- Use Multi-Site (Active-Active).
- Deploy full capacity in Region A and Region B.
- Use Amazon Route 53 with a Latency or Failover routing policy to distribute traffic.
- Use Amazon Aurora Global Database for sub-second data replication.
Checkpoint Questions
- What is the main difference between Pilot Light and Warm Standby?
- Which AWS service is primarily used to route traffic between regions during a failover?
- If a business requires an RPO of 0 (no data loss), which DR strategy is most appropriate?
- Why does AWS recommend conducting a risk assessment before choosing a DR strategy?
Muddy Points & Cross-Refs
- Pilot Light vs. Warm Standby: This is the most common area of confusion. Think of Pilot Light as "Data is on, Compute is off." Think of Warm Standby as "Data is on, Compute is on but small."
- High Availability (HA) vs. Disaster Recovery (DR): HA is about failing over within a region (between AZs). DR is about failing over between Regions. See Section 7: Ensuring Business Continuity for the deep dive on this distinction.
Comparison Tables
| Strategy | Cost | RTO | RPO | Complexity |
|---|---|---|---|---|
| Backup & Restore | $ | Hours | Hours | Low |
| Pilot Light | $$ | Minutes/Hours | Minutes | Medium |
| Warm Standby | $$$ | Minutes | Seconds/Minutes | High |
| Multi-Site | $$$$ | Real-time | Near Zero | Very High |
[!TIP] For the SAP-C02 exam, always look for keywords like "minimal cost" (Backup/Restore) vs. "minimal downtime" (Multi-Site).