AWS Disaster Recovery: Architecting for Business Continuity
Disaster recovery strategies (for example, using AWS Elastic Disaster Recovery, pilot light, warm standby, and multi-site)
AWS Disaster Recovery: Architecting for Business Continuity
This guide explores the four primary disaster recovery (DR) strategies on AWS, ranging from low-cost archival methods to high-availability multi-site configurations. Understanding the trade-offs between cost, complexity, and recovery speed is essential for the AWS Certified Solutions Architect - Professional exam.
Learning Objectives
- Define and differentiate Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
- Compare the four AWS DR strategies: Backup & Restore, Pilot Light, Warm Standby, and Multi-site.
- Identify key AWS services that enable rapid failover, such as AWS Elastic Disaster Recovery (DRS), Route 53, and Global Accelerator.
- Select the appropriate DR strategy based on specific business requirements and budget constraints.
Key Terms & Glossary
- Recovery Time Objective (RTO): The maximum acceptable delay between the service interruption and restoration of service. Example: A bank needing to be back online within 15 minutes.
- Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time. Example: A retail site that can afford to lose 5 minutes of transaction data.
- Failover: The process of automatically or manually switching to a redundant or standby system upon the failure of the primary system. Example: Switching traffic from US-East-1 to US-West-2 when a region goes dark.
- AWS Elastic Disaster Recovery (DRS): A service that minimizes downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications. Example: Replicating on-premises VMware VMs to AWS EC2 in a ready-to-launch state.
The "Big Idea"
Disaster Recovery is not a "one-size-fits-all" solution; it is a spectrum of trade-offs. On one end, you minimize cost but accept high downtime (Backup & Restore). On the other, you minimize downtime but pay for redundant, active infrastructure (Multi-site). Effective DR planning involves aligning technical architecture with the organization's Business Continuity Plan (BCP).
Formula / Concept Box
| Metric | Definition | Impact of Lower Value |
|---|---|---|
| RTO | "How long until we are back up?" | Higher cost; faster business resumption. |
| RPO | "How much data can we lose?" | Higher cost; more frequent/continuous replication. |
[!TIP] As RTO and RPO approach zero, the cost of the architecture increases exponentially.
Hierarchical Outline
- Disaster Recovery Foundations
- Risk Assessment: Evaluating impact of AZ vs. Regional failures.
- Multi-AZ vs. Multi-Region: Multi-AZ protects against local events (floods, power); Multi-Region protects against massive geographic disasters.
- The Four DR Strategies
- Backup & Restore: Cheapest; involves restoring data from backups (S3, AWS Backup).
- Pilot Light: Core data is live (replicated); compute is "dark" (scaled at zero) until needed.
- Warm Standby: A scaled-down version of the environment is always running.
- Multi-site (Active-Active): Full capacity running in two or more regions simultaneously.
- AWS Recovery Tools
- AWS DRS: Block-level replication for VMs.
- Global Accelerator: Provides faster traffic failover than standard DNS TTL-based failover.
- Route 53 Health Checks: Automates traffic redirection.
Visual Anchors
The DR Strategy Spectrum
Pilot Light Architecture Overview
Definition-Example Pairs
- Backup & Restore: The process of copying data to a secondary location to be restored later.
- Example: Nightly snapshots of EBS volumes stored in S3 and copied to another region.
- Pilot Light: Keeping the "fire" (database) going while the rest of the house (app servers) is dark.
- Example: An Aurora Global Database replicating to a second region with an empty Auto Scaling Group that only launches instances during a disaster.
- Warm Standby: Maintaining a "small version" of the production environment in a secondary region.
- Example: A 2-node cluster in Region A and a 1-node, small-instance cluster in Region B that can scale up instantly.
Comparison Tables
| Strategy | RTO (Time) | RPO (Data) | Relative Cost | Main AWS Services |
|---|---|---|---|---|
| Backup & Restore | Hours | 24 Hours | $ | S3, AWS Backup, Glacier |
| Pilot Light | 10s of Minutes | Minutes | $$ | AWS DRS, Aurora Global DB, Route 53 |
| Warm Standby | Minutes | Seconds | $$$ | ASG, RDS Read Replicas, ELB |
| Multi-site | Real-time | Near-zero | $$$$ | Route 53, Global Accelerator, DynamoDB Global Tables |
Worked Examples
Scenario 1: The E-Commerce Giant
Requirement: A global retailer loses $1M per hour of downtime. They need an RPO of < 1 minute and RTO of < 5 minutes. Solution: Multi-site Active-Active. By using DynamoDB Global Tables (multi-region active-active) and AWS Global Accelerator, traffic can be shifted away from a failing region in seconds with zero data loss for recently committed transactions.
Scenario 2: The Internal HR Portal
Requirement: A company portal used for employee benefits. It must be recoverable, but employees can wait a day if a regional disaster occurs. Solution: Backup & Restore. Using AWS Backup to automate cross-region copies of EBS snapshots and RDS backups. This minimizes costs while ensuring data is safe in a separate geographic area.
Checkpoint Questions
- What is the primary difference between Pilot Light and Warm Standby regarding compute resources?
- Why does AWS Global Accelerator provide faster failover than Amazon Route 53?
- Which AWS service is the recommended successor to CloudEndure for block-level replication?
- How does Aurora Global Database reduce RTO compared to standard RDS Read Replicas?
Muddy Points & Cross-Refs
- DNS TTL Issues: Even with a low TTL, Route 53 failover can be delayed by client-side DNS caching. Cross-ref: Look into AWS Global Accelerator for IP-based failover that bypasses DNS caching.
- Aurora vs. RDS DR: Standard RDS replicas are promoted to standalone (minutes), whereas Aurora Global Database secondary clusters are promoted in < 1 minute. Cross-ref: See Database Selection for more on Aurora vs. RDS overhead.
- Warm Standby Scale: The primary difference between Warm Standby and Multi-site is that Warm Standby is under-provisioned and requires scaling time, whereas Multi-site is full-capacity and requires zero scaling.