AWS Disaster Recovery: Architecting for Business Continuity

This guide explores the four primary disaster recovery (DR) strategies on AWS, ranging from low-cost archival methods to high-availability multi-site configurations. Understanding the trade-offs between cost, complexity, and recovery speed is essential for the AWS Certified Solutions Architect - Professional exam.

Learning Objectives

Define and differentiate Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
Compare the four AWS DR strategies: Backup & Restore, Pilot Light, Warm Standby, and Multi-site.
Identify key AWS services that enable rapid failover, such as AWS Elastic Disaster Recovery (DRS), Route 53, and Global Accelerator.
Select the appropriate DR strategy based on specific business requirements and budget constraints.

Key Terms & Glossary

Recovery Time Objective (RTO): The maximum acceptable delay between the service interruption and restoration of service. Example: A bank needing to be back online within 15 minutes.
Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time. Example: A retail site that can afford to lose 5 minutes of transaction data.
Failover: The process of automatically or manually switching to a redundant or standby system upon the failure of the primary system. Example: Switching traffic from US-East-1 to US-West-2 when a region goes dark.
AWS Elastic Disaster Recovery (DRS): A service that minimizes downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications. Example: Replicating on-premises VMware VMs to AWS EC2 in a ready-to-launch state.

The "Big Idea"

Disaster Recovery is not a "one-size-fits-all" solution; it is a spectrum of trade-offs. On one end, you minimize cost but accept high downtime (Backup & Restore). On the other, you minimize downtime but pay for redundant, active infrastructure (Multi-site). Effective DR planning involves aligning technical architecture with the organization's Business Continuity Plan (BCP).

Formula / Concept Box

Metric	Definition	Impact of Lower Value
RTO	"How long until we are back up?"	Higher cost; faster business resumption.
RPO	"How much data can we lose?"	Higher cost; more frequent/continuous replication.

[!TIP] As RTO and RPO approach zero, the cost of the architecture increases exponentially.

Hierarchical Outline

Disaster Recovery Foundations
- Risk Assessment: Evaluating impact of AZ vs. Regional failures.
- Multi-AZ vs. Multi-Region: Multi-AZ protects against local events (floods, power); Multi-Region protects against massive geographic disasters.
The Four DR Strategies
- Backup & Restore: Cheapest; involves restoring data from backups (S3, AWS Backup).
- Pilot Light: Core data is live (replicated); compute is "dark" (scaled at zero) until needed.
- Warm Standby: A scaled-down version of the environment is always running.
- Multi-site (Active-Active): Full capacity running in two or more regions simultaneously.
AWS Recovery Tools
- AWS DRS: Block-level replication for VMs.
- Global Accelerator: Provides faster traffic failover than standard DNS TTL-based failover.
- Route 53 Health Checks: Automates traffic redirection.

Visual Anchors

The DR Strategy Spectrum

Loading Diagram...

Pilot Light Architecture Overview

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Backup & Restore: The process of copying data to a secondary location to be restored later.
- Example: Nightly snapshots of EBS volumes stored in S3 and copied to another region.
Pilot Light: Keeping the "fire" (database) going while the rest of the house (app servers) is dark.
- Example: An Aurora Global Database replicating to a second region with an empty Auto Scaling Group that only launches instances during a disaster.
Warm Standby: Maintaining a "small version" of the production environment in a secondary region.
- Example: A 2-node cluster in Region A and a 1-node, small-instance cluster in Region B that can scale up instantly.

Comparison Tables

Strategy	RTO (Time)	RPO (Data)	Relative Cost	Main AWS Services
Backup & Restore	Hours	24 Hours	$	S3, AWS Backup, Glacier
Pilot Light	10s of Minutes	Minutes	$$	AWS DRS, Aurora Global DB, Route 53
Warm Standby	Minutes	Seconds	$$$	ASG, RDS Read Replicas, ELB
Multi-site	Real-time	Near-zero	$$$$	Route 53, Global Accelerator, DynamoDB Global Tables

Worked Examples

Scenario 1: The E-Commerce Giant

Requirement: A global retailer loses $1M per hour of downtime. They need an RPO of < 1 minute and RTO of < 5 minutes. Solution: Multi-site Active-Active. By using DynamoDB Global Tables (multi-region active-active) and AWS Global Accelerator, traffic can be shifted away from a failing region in seconds with zero data loss for recently committed transactions.

Scenario 2: The Internal HR Portal

Requirement: A company portal used for employee benefits. It must be recoverable, but employees can wait a day if a regional disaster occurs. Solution: Backup & Restore. Using AWS Backup to automate cross-region copies of EBS snapshots and RDS backups. This minimizes costs while ensuring data is safe in a separate geographic area.

Checkpoint Questions

What is the primary difference between Pilot Light and Warm Standby regarding compute resources?
Why does AWS Global Accelerator provide faster failover than Amazon Route 53?
Which AWS service is the recommended successor to CloudEndure for block-level replication?
How does Aurora Global Database reduce RTO compared to standard RDS Read Replicas?

Muddy Points & Cross-Refs

DNS TTL Issues: Even with a low TTL, Route 53 failover can be delayed by client-side DNS caching. Cross-ref: Look into AWS Global Accelerator for IP-based failover that bypasses DNS caching.
Aurora vs. RDS DR: Standard RDS replicas are promoted to standalone (minutes), whereas Aurora Global Database secondary clusters are promoted in < 1 minute. Cross-ref: See Database Selection for more on Aurora vs. RDS overhead.
Warm Standby Scale: The primary difference between Warm Standby and Multi-site is that Warm Standby is under-provisioned and requires scaling time, whereas Multi-site is full-capacity and requires zero scaling.