AWS Disaster Recovery Strategies: A Comprehensive Study Guide

This guide covers the four primary Disaster Recovery (DR) strategies on AWS as defined in the AWS Certified Solutions Architect - Professional (SAP-C02) curriculum. It explores the trade-offs between cost, complexity, and recovery metrics.

Learning Objectives

After studying this guide, you should be able to:

Differentiate between Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
Identify the four AWS DR strategies: Backup & Restore, Pilot Light, Warm Standby, and Multi-Site (Active-Active).
Evaluate business requirements to select the most cost-effective DR strategy.
Understand how AWS global infrastructure (Regions and AZs) supports business continuity.

Key Terms & Glossary

RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration of service. (How long can we be down?)
RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time. (How much data can we lose?)
Failover: The process of switching to a redundant or standby computer server, system, hardware component, or network upon the failure or abnormal termination of the previously active application.
Pilot Light: A DR strategy where a minimal version of an environment is always running in the cloud (usually just the database/data) to keep costs low.
Warm Standby: A DR strategy where a scaled-down but functional version of the full environment is always running.

The "Big Idea"

Disaster Recovery is not a one-size-fits-all solution; it is a spectrum of trade-offs. On one end, you have Backup & Restore, which is inexpensive but slow to recover. On the other end, you have Multi-Site Active-Active, which provides near-instant recovery but at a significant cost. The goal of a Solutions Architect is to align the technical strategy with the organization's Business Continuity Plan (BCP) by balancing the cost of downtime against the cost of the DR solution.

Formula / Concept Box

Metric	Definition	Focus Area
RPO	Time since last backup/sync	Data Integrity (Avoid losing work)
RTO	Time taken to bring the system back online	Availability (Minimize downtime)

[!IMPORTANT] RPO = "Back in time" (Data loss limit)
RTO = "Forward in time" (Restoration speed limit)

Hierarchical Outline

DR Fundamentals
- Risk Assessment (Single AZ vs. Multi-Region failure)
- Impact of separation (AZs are physically separated by kilometers to prevent localized disaster impact)
The Four DR Strategies
- Backup & Restore (Lowest cost, hours-to-days RTO/RPO)
- Pilot Light (Data live, compute off/template-based)
- Warm Standby (Smallest-capacity compute always running)
- Multi-Site (Active-Active) (Zero downtime, highest cost)
Selection Criteria
- Business criticality
- Budgetary constraints
- Complexity of implementation

Visual Anchors

DR Strategy Spectrum

Loading Diagram...

RPO and RTO Visualized

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Backup & Restore
- Definition: Data is backed up to Amazon S3; infrastructure is redeployed via CloudFormation only when a disaster occurs.
- Example: A company's internal payroll system where losing 24 hours of data is acceptable and employees can wait a day for the system to return.
Pilot Light
- Definition: The "heart" of the application (usually the database) is kept running and up-to-date, while other layers are kept as idle templates (AMIs/Snapshots).
- Example: A retail site that replicates its RDS database to another region but only starts EC2 instances when the primary region fails.
Warm Standby
- Definition: A functional, smaller-scale version of the application is always running in the DR region (e.g., 2 small instances instead of 10 large ones).
- Example: A critical SaaS application that must be back at full capacity within 15 minutes.

Worked Examples

Scenario: The Budget-Conscious Startup

Problem: A startup has a web application. They can afford to lose 4 hours of data, and they need to be back online within 12 hours of a regional failure. They have a very tight budget. Solution:

Use Backup & Restore.
Schedule Amazon RDS snapshots every 4 hours (meeting the RPO).
Use AWS CloudFormation or CDK to script the infrastructure. In a disaster, the script triggers the creation of the VPC, Load Balancers, and EC2 instances from the latest AMIs.
Restore the RDS snapshot into the new region.

Scenario: The Zero-Downtime Financial App

Problem: A global banking app cannot afford any downtime. If a region goes offline, users shouldn't even notice. Solution:

Use Multi-Site (Active-Active).
Deploy full capacity in Region A and Region B.
Use Amazon Route 53 with a Latency or Failover routing policy to distribute traffic.
Use Amazon Aurora Global Database for sub-second data replication.

Checkpoint Questions

What is the main difference between Pilot Light and Warm Standby?
Which AWS service is primarily used to route traffic between regions during a failover?
If a business requires an RPO of 0 (no data loss), which DR strategy is most appropriate?
Why does AWS recommend conducting a risk assessment before choosing a DR strategy?

Muddy Points & Cross-Refs

Pilot Light vs. Warm Standby: This is the most common area of confusion. Think of Pilot Light as "Data is on, Compute is off." Think of Warm Standby as "Data is on, Compute is on but small."
High Availability (HA) vs. Disaster Recovery (DR): HA is about failing over within a region (between AZs). DR is about failing over between Regions. See Section 7: Ensuring Business Continuity for the deep dive on this distinction.

Comparison Tables

Strategy	Cost	RTO	RPO	Complexity
Backup & Restore	$	Hours	Hours	Low
Pilot Light	$$	Minutes/Hours	Minutes	Medium
Warm Standby	$$$	Minutes	Seconds/Minutes	High
Multi-Site	$$$$	Real-time	Near Zero	Very High

[!TIP] For the SAP-C02 exam, always look for keywords like "minimal cost" (Backup/Restore) vs. "minimal downtime" (Multi-Site).