Comprehensive Study Guide: Configuring Disaster Recovery Solutions on AWS
Configuring disaster recovery solutions
Comprehensive Study Guide: Configuring Disaster Recovery Solutions on AWS
This study guide covers the architectural principles and implementation strategies for ensuring business continuity and disaster recovery (DR) on AWS, aligned with the AWS Certified Solutions Architect - Professional (SAP-C02) curriculum.
Learning Objectives
After studying this guide, you should be able to:
- Differentiate between High Availability (HA) and Disaster Recovery (DR).
- Define and calculate Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
- Select the appropriate DR strategy (Backup & Restore, Pilot Light, Warm Standby, or Multi-site) based on business requirements.
- Implement data and database replication across AWS Regions.
- Configure automated failover using Amazon Route 53 and AWS Global Accelerator.
Key Terms & Glossary
- Disaster Recovery (DR): The process of preparing for and recovering from a disaster that impacts a broad geographical area (e.g., a regional AWS outage).
- High Availability (HA): A design approach that ensures a system remains operational during local component failures (e.g., an Availability Zone outage).
- Failover: The automatic or manual process of switching to a redundant or standby IT system upon the failure of the primary system.
- Failback: The process of restoring operations to the primary site after they have been shifted to a DR site during failover.
- RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration of service.
- RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time (e.g., "we can lose up to 15 minutes of data").
The "Big Idea"
Disaster Recovery is a trade-off between cost and time. As the requirement for RTO and RPO approaches zero (meaning zero downtime and zero data loss), the complexity and cost of the solution increase exponentially. An architect's primary job is to find the "sweet spot" where the cost of the DR solution does not exceed the potential business loss caused by a disaster.
Formula / Concept Box
Core DR Metrics
| Metric | Definition | Focus Area |
|---|---|---|
| RPO | Data Integrity / Loss | |
| RTO | Service Availability / Downtime |
[!TIP] Pro Tip: In the SAP-C02 exam, look for keywords like "near-zero RTO" to immediately identify that a Multi-site Active-Active strategy is required, regardless of the higher cost.
Hierarchical Outline
- Foundations of Business Continuity
- AWS Global Infrastructure: Utilizing Regions and Availability Zones.
- Automation: Using Infrastructure as Code (IaC) like AWS CloudFormation to recreate environments.
- Disaster Recovery Strategies
- Backup & Restore: Traditional tape/disk-to-cloud backups (High RTO/RPO).
- Pilot Light: Data is live; infrastructure is "dark" (stopped/minimal).
- Warm Standby: Small-scale version of the environment is always running.
- Multi-site (Active-Active): Fully functional identical environments in multiple regions.
- Implementation Tools
- Data Replication: Aurora Global Database, DynamoDB Global Tables, S3 Cross-Region Replication (CRR).
- Compute Recovery: AWS Elastic Disaster Recovery (DRS) for block-level replication.
- Traffic Management: Amazon Route 53 Health Checks and Failover Routing.
Visual Anchors
RPO and RTO Timeline
DR Strategy Spectrum
Definition-Example Pairs
-
Pilot Light Strategy
- Definition: Keeping a "quenched" version of your environment where the database is live and replicating, but application servers are only provisioned (or started) during a disaster.
- Example: An enterprise keeps an RDS instance replicating to another region via a Read Replica, but the EC2 application fleet is stored as an Amazon Machine Image (AMI) and only launched via Auto Scaling when Route 53 detects a failure.
-
Warm Standby Strategy
- Definition: A scaled-down but fully functional version of your environment is always running in a secondary region.
- Example: A web application runs on 10 instances in Region A (Primary) and 2 instances in Region B (Secondary). During a disaster, the Auto Scaling Group in Region B scales up from 2 to 10 instances to handle the full load.
Worked Examples
Scenario: High-Volume E-Commerce Site
Requirement: The business requires an RPO of 1 minute and an RTO of less than 15 minutes. The application uses an RDS MySQL database and EC2 instances.
Solution Steps:
- Database: Use Amazon Aurora Global Database. It provides sub-second cross-region replication (meeting the 1-minute RPO).
- Compute: Implement a Pilot Light or Warm Standby. Pre-configure an Auto Scaling Group in the secondary region with a minimum capacity of 0 or 1.
- Deployment: Use AWS CloudFormation to ensure the VPC and networking are identical in both regions.
- Traffic: Configure Amazon Route 53 with failover routing and health checks pointing to the primary region's Load Balancer.
- Recovery: Upon failure, promote the Aurora secondary cluster to primary and scale the ASG in the secondary region.
Checkpoint Questions
- Which DR strategy offers the lowest cost but the highest RTO?
- What is the main difference between Route 53 and AWS Global Accelerator regarding failover speed?
- Which AWS service is the successor to CloudEndure for block-level replication of VMs?
- True/False: In a Pilot Light scenario, application servers are always running at full capacity.
▶Click to reveal answers
- Backup & Restore.
- Global Accelerator is typically faster because it avoids DNS caching/TTL issues by using static IP addresses.
- AWS Elastic Disaster Recovery (DRS).
- False. (They are usually stopped or not provisioned at all; only the database/data is live).
Muddy Points & Cross-Refs
- HA vs. DR: Students often confuse them. Remember: HA is within a region (multi-AZ). DR is across regions (multi-region).
- Route 53 TTL: Even with automated failover, DNS TTL can delay users from reaching the new region. To mitigate this, consider AWS Global Accelerator, which uses the AWS backbone and does not rely on client-side DNS caching.
- Read Replicas vs. Multi-AZ: Multi-AZ is for HA (synchronous, same region). Read Replicas are for scaling and DR (asynchronous, can be cross-region).
Comparison Tables
Disaster Recovery Strategies Comparison
| Strategy | RPO | RTO | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | Hours | 24h+ | $ | Low |
| Pilot Light | Minutes | 10s of Minutes | $$ | Medium |
| Warm Standby | Seconds | Minutes | $$$ | High |
| Multi-Site | Near Zero | Near Zero | $$$$ | Very High |