AWS SAP-C02: Designing for Business Continuity
Design a solution to ensure business continuity
Designing a Solution for Business Continuity
This guide focuses on the strategies and architectural patterns required to ensure business continuity on AWS, specifically for the SAP-C02 (Solutions Architect Professional) exam. It explores the transition from local high availability to geographic disaster recovery.
Learning Objectives
After studying this guide, you will be able to:
- Differentiate between High Availability (HA) and Disaster Recovery (DR).
- Define and calculate Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
- Evaluate and select among the four primary AWS DR strategies based on business requirements.
- Design a business continuity plan that aligns technical solutions with organizational risk assessments.
Key Terms & Glossary
- Business Continuity Plan (BCP): A comprehensive document outlining how a business will continue to operate during an unplanned disruption.
- RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration of service.
- RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time (e.g., losing 15 minutes of transactions).
- Failover: The process of switching to a redundant or standby computer server, system, or network upon the failure of the previously active application.
- Pilot Light: A DR strategy where a minimal version of the environment is always running in another region, primarily the data and core configuration.
The "Big Idea"
[!IMPORTANT] Business Continuity is the art of geographic decoupling. While High Availability (HA) protects you against a failing server or a single data center (AZ), Disaster Recovery (DR) protects you against a regional catastrophe. The "Big Idea" is that resilience is a spectrum: as you move toward zero data loss and zero downtime, the cost and complexity of your architecture increase exponentially.
Formula / Concept Box
| Metric | Definition | User Perspective |
|---|---|---|
| RTO | Time to restore service | "How long until I can log back in?" |
| RPO | Max data loss (time) | "How much of my work was lost since the last save?" |
Hierarchical Outline
- HA vs. DR Fundamentals
- High Availability (HA): Focuses on component-level redundancy within a region (Multi-AZ).
- Disaster Recovery (DR): Focuses on site-level redundancy across regions (Cross-Region).
- The Planning Process
- Risk Assessment: Evaluating the impact of AZ vs. Regional failures.
- Business Impact Analysis (BIA): Determining the financial cost of downtime to set RTO/RPO.
- AWS Disaster Recovery Strategies
- Backup and Restore: Low cost, high RTO/RPO (Hours).
- Pilot Light: Core data live; application tier "dark" (minutes/hours).
- Warm Standby: Scaled-down version of full environment always running (minutes).
- Multi-Site Active-Active: Zero downtime; traffic split across regions (seconds/real-time).
Visual Anchors
DR Strategy Spectrum
Regional Failover Architecture
Definition-Example Pairs
- Warm Standby: A DR strategy where a "scaled down" but fully functional version of the environment is always running in the DR region.
- Example: An e-commerce site running on 2 small EC2 instances in the DR region, while the primary region runs on 20 large instances. If primary fails, the DR instances scale up automatically.
- Pilot Light: A strategy where only the most critical data is replicated (like a database), while application servers are stopped or only exist as AMIs.
- Example: Keeping an RDS Read Replica in a second region. The web server layer is deployed via CloudFormation only after a disaster is declared.
Worked Examples
Scenario: Choosing a DR Strategy
Company X has a mission-critical banking application. Their Business Impact Analysis shows that 1 hour of downtime costs $1,000,000, and they cannot lose more than 5 minutes of transaction data.
- Requirement: RTO < 1 hour; RPO < 5 minutes.
- Elimination:
- Backup & Restore is out (RTO is usually hours/days).
- Pilot Light is risky (provisioning app servers might take > 1 hour depending on complexity).
- Solution: Warm Standby or Multi-Site.
- Selection: Given the $1M/hr cost, Warm Standby is the most cost-effective choice that guarantees meeting the 1-hour RTO, as the environment is already "warm" and just needs to scale.
Checkpoint Questions
- What is the main difference between HA and DR in an AWS context?
- If an organization uses
Snapshot Replicationevery 12 hours, what is their RPO? - Which DR strategy involves having a scaled-down version of the full environment always running?
- How does Route 53 support business continuity?
▶Click for Answers
- HA handles local failures (AZ); DR handles large-scale/regional failures.
- 12 hours.
- Warm Standby.
- Through health checks and DNS failover routing policies.
Muddy Points & Cross-Refs
- HA vs DR Confusion: Students often think Multi-AZ is DR. Correction: Multi-AZ is HA. DR must involve separate geographic regions to protect against a regional event.
- RPO vs RTO: Remember P in RPO stands for Past (how far back do we go in the data?). T in RTO stands for Time (how long does it take to get back up?).
- Cross-Ref: See Chapter 6: Meeting Reliability Requirements for details on Auto Scaling and Self-Healing systems which form the foundation of HA.
Comparison Tables
| Strategy | RTO / RPO | Cost | Complexity | Strategy Description |
|---|---|---|---|---|
| Backup & Restore | Hours/Days | $ | Low | Restore snapshots after a disaster. |
| Pilot Light | Minutes/Hours | $$ | Medium | Live data, idle/stopped app servers. |
| Warm Standby | Minutes | $$$ | High | Scaled-down but active environment. |
| Multi-Site | Seconds | $$$$ | Very High | Full active-active in two regions. |