Disaster Recovery (DR) Strategies & Resilience

This guide covers the core architectural patterns and metrics required to ensure business continuity on AWS, specifically focusing on the trade-offs between cost, complexity, and recovery speed.

Learning Objectives

After studying this guide, you should be able to:

Define and differentiate between Recovery Point Objective (RPO) and Recovery Time Objective (RTO).
Categorize the four primary AWS DR strategies based on their cost and recovery speed.
Identify appropriate AWS services (RDS, S3, Aurora, Route 53) used to implement specific DR patterns.
Select the optimal DR strategy based on specific business requirements for downtime and data loss.

Key Terms & Glossary

RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time (e.g., "We can lose 5 minutes of data").
RTO (Recovery Time Objective): The maximum acceptable delay between service interruption and restoration (e.g., "The system must be up in 4 hours").
Failover: The process of automatically or manually switching to a redundant or standby IT system upon the failure of the primary system.
Pilot Light: A DR strategy where a minimal version of an environment is always running (usually just the data layer) to keep costs low.
Warm Standby: A scaled-down but functional version of the full environment that runs in a second region.

The "Big Idea"

In traditional IT, disaster recovery often meant shipping tapes to a mountain vault. In the cloud, DR is about shifting from static recovery to automated orchestration. The "Big Idea" is that high availability (HA) protects against local failures (like a single instance or AZ), while Disaster Recovery (DR) protects against large-scale disasters (like an entire AWS Region going offline). You must balance the cost of the solution against the cost of the downtime.

Formula / Concept Box

Strategy	RPO	RTO	Cost	Complexity
Backup & Restore	Hours	24h+	$ (Low)	Simple
Pilot Light	Minutes	Hours	$$	Moderate
Warm Standby	Seconds	Minutes	$$$	High
Multi-Site (Active-Active)	Zero / Near-Zero	Real-time	$$$$	Very High

Hierarchical Outline

Foundational Metrics
- RPO: Focuses on Data. Determines backup frequency.
- RTO: Focuses on Time. Determines automation level.
The DR Spectrum
- Backup & Restore: Data is in S3; infrastructure is redeployed via CloudFormation during a disaster.
- Pilot Light: Database is live (e.g., RDS Read Replica); App Servers are "off" (AMIs ready to boot).
- Warm Standby: Small "fleet" of instances always running; scales up via Auto Scaling during failover.
- Multi-Site: Full traffic handled by two regions simultaneously; Route 53 manages the steering.
Service-Specific Resilience
- RDS: Multi-AZ (Synchronous) for HA; Read Replicas (Asynchronous) for DR.
- Aurora: Global Databases with sub-second data replication.
- DynamoDB: Global Tables for multi-region active-active workloads.

Visual Anchors

Understanding RPO vs. RTO

Loading Diagram...

The DR Strategy Spectrum

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Backup & Restore
- Definition: Keeping copies of data and configuration to recreate the environment after a failure.
- Example: An internal HR system that can afford to be down for 24 hours. Daily EBS snapshots are taken and stored in S3. If the region fails, an admin uses Terraform to rebuild the VPC in another region and restores the database from the snapshot.
Pilot Light
- Definition: Keeping the most critical data components live while keeping compute resources idle.
- Example: An e-commerce site keeps an RDS Read Replica in a secondary region. Web servers are NOT running, but their AMIs are ready. If the primary region fails, the replica is promoted to primary, and the web fleet is launched.

Worked Examples

Example 1: Calculating Potential Downtime

Scenario: A company updates its application 6 times a year, with each update requiring 4 hours of downtime. Additionally, they suffer one failure per quarter that takes 70 minutes to fix. What is their approximate annual availability?

Step-by-Step Breakdown:

Update Downtime: $6 updates $\times 4$ hours = 24 hours/year$.
Failure Downtime: $4 \text{ quarters} \times 70 \text{ minutes} = 280 \text{ minutes} \approx 4.6 \text{ hours/year}$.
Total Downtime: $24 + 4.6 = 28.6 hours/year$.
Availability Calculation: There are 8,760 hours in a year. (8760 - 28.6) / 8760 = 99.67%.
Rounding Rule: In DR planning, always round down to be conservative. The expected availability is 99%.

Checkpoint Questions

Which DR strategy offers the lowest RTO but has the highest cost?
In a Pilot Light strategy, what happens to the database when a disaster occurs?
How does an RDS Multi-AZ deployment differ from a Cross-Region Read Replica in terms of RPO?
If a business requirement states "We must be back online within 10 minutes of a failure," which strategy is the minimum required?

▶Click to see answers

Multi-Site (Active-Active).
The database (often a read replica) is promoted to a primary standalone instance.
Multi-AZ is synchronous (Zero RPO/Data Loss) but only within one region. Cross-Region Read Replicas are asynchronous (Low RPO/Some Data Loss) but protect against regional failure.
Warm Standby (Pilot Light typically takes 30+ minutes to spin up resources, while Warm Standby is already running at low capacity).

Disaster Recovery (DR) Strategies & Resilience

This guide covers the core architectural patterns and metrics required to ensure business continuity on AWS, specifically focusing on the trade-offs between cost, complexity, and recovery speed.

Learning Objectives

After studying this guide, you should be able to:

Define and differentiate between Recovery Point Objective (RPO) and Recovery Time Objective (RTO).
Categorize the four primary AWS DR strategies based on their cost and recovery speed.
Identify appropriate AWS services (RDS, S3, Aurora, Route 53) used to implement specific DR patterns.
Select the optimal DR strategy based on specific business requirements for downtime and data loss.

Key Terms & Glossary

RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time (e.g., "We can lose 5 minutes of data").
RTO (Recovery Time Objective): The maximum acceptable delay between service interruption and restoration (e.g., "The system must be up in 4 hours").
Failover: The process of automatically or manually switching to a redundant or standby IT system upon the failure of the primary system.
Pilot Light: A DR strategy where a minimal version of an environment is always running (usually just the data layer) to keep costs low.
Warm Standby: A scaled-down but functional version of the full environment that runs in a second region.

The "Big Idea"

Formula / Concept Box

Strategy	RPO	RTO	Cost	Complexity
Backup & Restore	Hours	24h+	$ (Low)	Simple
Pilot Light	Minutes	Hours	$$	Moderate
Warm Standby	Seconds	Minutes	$$$	High
Multi-Site (Active-Active)	Zero / Near-Zero	Real-time	$$$$	Very High

Hierarchical Outline

Foundational Metrics
- RPO: Focuses on Data. Determines backup frequency.
- RTO: Focuses on Time. Determines automation level.
The DR Spectrum
- Backup & Restore: Data is in S3; infrastructure is redeployed via CloudFormation during a disaster.
- Pilot Light: Database is live (e.g., RDS Read Replica); App Servers are "off" (AMIs ready to boot).
- Warm Standby: Small "fleet" of instances always running; scales up via Auto Scaling during failover.
- Multi-Site: Full traffic handled by two regions simultaneously; Route 53 manages the steering.
Service-Specific Resilience
- RDS: Multi-AZ (Synchronous) for HA; Read Replicas (Asynchronous) for DR.
- Aurora: Global Databases with sub-second data replication.
- DynamoDB: Global Tables for multi-region active-active workloads.

Visual Anchors

Understanding RPO vs. RTO

Loading Diagram...

The DR Strategy Spectrum

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Backup & Restore
- Definition: Keeping copies of data and configuration to recreate the environment after a failure.
- Example: An internal HR system that can afford to be down for 24 hours. Daily EBS snapshots are taken and stored in S3. If the region fails, an admin uses Terraform to rebuild the VPC in another region and restores the database from the snapshot.
Pilot Light
- Definition: Keeping the most critical data components live while keeping compute resources idle.
- Example: An e-commerce site keeps an RDS Read Replica in a secondary region. Web servers are NOT running, but their AMIs are ready. If the primary region fails, the replica is promoted to primary, and the web fleet is launched.

Worked Examples

Example 1: Calculating Potential Downtime

Step-by-Step Breakdown:

Update Downtime: $6 updates $\times 4$ hours = 24 hours/year$.
Failure Downtime: $4 \text{ quarters} \times 70 \text{ minutes} = 280 \text{ minutes} \approx 4.6 \text{ hours/year}$.
Total Downtime: $24 + 4.6 = 28.6 hours/year$.
Availability Calculation: There are 8,760 hours in a year. (8760 - 28.6) / 8760 = 99.67%.
Rounding Rule: In DR planning, always round down to be conservative. The expected availability is 99%.

Checkpoint Questions

Which DR strategy offers the lowest RTO but has the highest cost?
In a Pilot Light strategy, what happens to the database when a disaster occurs?
How does an RDS Multi-AZ deployment differ from a Cross-Region Read Replica in terms of RPO?
If a business requirement states "We must be back online within 10 minutes of a failure," which strategy is the minimum required?

▶Click to see answers

Multi-Site (Active-Active).
The database (often a read replica) is promoted to a primary standalone instance.
Multi-AZ is synchronous (Zero RPO/Data Loss) but only within one region. Cross-Region Read Replicas are asynchronous (Low RPO/Some Data Loss) but protect against regional failure.
Warm Standby (Pilot Light typically takes 30+ minutes to spin up resources, while Warm Standby is already running at low capacity).

AWS Disaster Recovery (DR) Strategies & Resilience

Disaster Recovery (DR) Strategies & Resilience

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

Understanding RPO vs. RTO

The DR Strategy Spectrum

Definition-Example Pairs

Worked Examples

Example 1: Calculating Potential Downtime

Checkpoint Questions

AWS Disaster Recovery (DR) Strategies & Resilience

Disaster Recovery (DR) Strategies & Resilience

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

Understanding RPO vs. RTO

The DR Strategy Spectrum

Definition-Example Pairs

Worked Examples

Example 1: Calculating Potential Downtime

Checkpoint Questions