Mastering Disaster Recovery Planning: AWS SAP-C02 Study Guide

Learning Objectives

After studying this guide, you should be able to:

Distinguish between High Availability (HA) and Disaster Recovery (DR) architectures.
Define and calculate Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
Identify the four primary AWS DR strategies: Backup & Restore, Pilot Light, Warm Standby, and Multi-site Active-Active.
Align business continuity requirements with the appropriate cost and complexity tier.

Key Terms & Glossary

Disaster: A large-scale event (natural or man-made) that impacts a broad geographical area, potentially impairing an Availability Zone (AZ) or an entire Region.
RTO (Recovery Time Objective): The maximum allowable downtime for a workload after a disaster before service must be restored.
RPO (Recovery Point Objective): The maximum allowable data loss measured in time (e.g., "we can afford to lose 15 minutes of data").
Failover: The process of switching to a redundant or standby IT system upon the failure of the primary system.
Failback: The process of restoring operations to the primary machine or facility after they have been shifted to a secondary during failover.

The "Big Idea"

Disaster Recovery is a trade-off between cost and time. While High Availability (HA) handles local component failures (like a single EC2 instance dying), DR handles regional catastrophes. A perfect DR plan ensures that the business stays operational, but achieving zero data loss and zero downtime is exponentially more expensive than accepting a few hours of recovery time. Your goal as an architect is to find the "Sweet Spot" where the cost of the DR solution does not exceed the cost of the potential downtime.

Formula / Concept Box

KPI	Focus	Metric	Impact of Lower Values
RTO	Downtime	Time (Mins/Hours)	Faster recovery; higher infrastructure costs.
RPO	Data Loss	Time (Mins/Hours)	Less data lost; higher sync/backup costs.

[!IMPORTANT] The solution's costs and complexity rise as the RTO and/or RPO values decrease. Zero RTO/RPO requires synchronous replication and active-active setups.

Hierarchical Outline

Core Definitions
- HA vs. DR: HA addresses local failures; DR addresses large-scale/regional failures.
- Business Continuity Plan (BCP): The overarching organizational strategy that DR plans support.
Disaster Recovery Strategies
- Backup & Restore: Lower cost, highest RTO/RPO. Tape/S3 backups.
- Pilot Light: Core data is live; application servers are "off" (AMIs ready).
- Warm Standby: A scaled-down version of the environment is always running.
- Multi-site (Active-Active): Full capacity running in two or more regions; near-zero RTO.
Implementation Tools
- AWS Elastic Disaster Recovery (DRS): Automated recovery to AWS.
- Amazon Route 53: Global traffic routing and health checks.

Visual Anchors

The DR Strategy Spectrum

Loading Diagram...

RPO vs RTO Timeline

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Pilot Light: Maintaining a "dormant" version of your environment where only critical data-providing services (like databases) are running and kept up to date.
- Example: Keeping an Amazon RDS instance replicating to a second region, but having no EC2 instances running there. If a disaster occurs, you use an Autoscaling Group to quickly provision the web servers from pre-configured AMIs.
Warm Standby: Keeping a "scaled-down" version of your environment always running in a second region.
- Example: Running 1 small EC2 instance in Region B to mirror a 10-instance cluster in Region A. In a disaster, the 1 instance is scaled up to 10 instances via an Autoscaling policy.

Worked Examples

Scenario: The Critical Financial App

Requirement: A banking application requires that in the event of a regional failure, the service must be back online within 15 minutes (RTO) and must not lose more than 5 minutes of data (RPO).

Step-by-Step Breakdown:

Assess Backup & Restore: Restoration from S3/EBS snapshots usually takes hours for large datasets. Rejected (RTO too high).
Assess Pilot Light: Provisioning a full fleet of EC2 instances and updating DNS usually takes 10-20 minutes. Borderline (Risky RTO).
Assess Warm Standby: Since a small fleet is already running, scaling up and switching DNS via Route 53 can happen in under 10 minutes. RDS Multi-Region Read Replicas can provide an RPO of seconds/minutes. Selected.
Final Architecture: Route 53 Health Checks + RDS Cross-Region Read Replica + Low-capacity ASG in Region B.

Checkpoint Questions

Which DR strategy involves keeping a scaled-down but fully functional version of your application always running?
What is the main difference between HA and DR in an AWS context?
If an organization has an RPO of 24 hours, which AWS DR strategy is most cost-effective?
Why does the cost of a solution increase as the RTO decreases?

Muddy Points & Cross-Refs

Pilot Light vs. Warm Standby: This is a common exam confusion point. Think of Pilot Light as a "quenched fire" (data is there, but the engine isn't running). Think of Warm Standby as a "idling engine" (everything is running, just at low RPMs).
HA isn't DR: Students often think Multi-AZ is DR. It is NOT. Multi-AZ protects against a data center failure, but a regional flood or earthquake requires Multi-Region (DR).
Cross-Ref: For more on setting up the underlying networking for these strategies, see Chapter 4: Global Networking with Route 53 and CloudFront.

Comparison Tables

Disaster Recovery Strategy Comparison

Strategy	RTO (Time)	RPO (Data)	Cost	Complexity
Backup & Restore	Hours	24 Hours+	$	Low
Pilot Light	10s of Mins	Minutes	$$	Medium
Warm Standby	Minutes	Seconds/Mins	$$$	High
Multi-site	Real-time	Near Zero	$$$$	Very High