Resilience and Availability: Designing for Disruption in AWS
Designing an architecture that provides application and infrastructure availability in the event of a disruption
Resilience and Availability: Designing for Disruption in AWS
This guide explores the architectural patterns and AWS services required to maintain application and infrastructure availability during disruptions, ranging from single component failures to full regional outages.
Learning Objectives
By the end of this study guide, you should be able to:
- Differentiate between High Availability (HA) and Disaster Recovery (DR).
- Define and apply RTO and RPO metrics to business requirements.
- Evaluate and select appropriate DR strategies: Backup and Restore, Pilot Light, Warm Standby, and Multi-site.
- Implement cross-region data replication and traffic routing using AWS services like S3, RDS, and Route 53.
- Leverage Infrastructure as Code (IaC) to ensure consistent recovery environments.
Key Terms & Glossary
- Recovery Time Objective (RTO): The maximum acceptable delay between the interruption of service and restoration of service.
- Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time (e.g., "we can lose 15 minutes of data").
- Regional Service: A service where AWS manages availability across multiple AZs automatically (e.g., S3, DynamoDB, Route 53).
- Zonal Service: A service where resources are tied to a specific Availability Zone (e.g., EC2, EBS).
- Failover: The process of automatically or manually switching to a redundant or standby computer server, system, or network upon the failure of the previously active one.
The "Big Idea"
Availability is a spectrum of cost versus speed. Designing for disruption isn't just about "never going down"; it's about matching the technical architecture to the business's tolerance for downtime. While High Availability handles local, frequent failures (like a single server dying), Disaster Recovery handles rare, large-scale events (like a hurricane hitting a data center region).
Formula / Concept Box
| Metric | Definition | Goal |
|---|---|---|
| RPO | Amount of data loss (Time) | "How much work can we afford to lose?" |
| RTO | Duration of downtime (Time) | "How fast must we be back up?" |
| Availability % | Standard measure of system reliability |
[!IMPORTANT] As RTO and RPO decrease (approach zero), the cost and complexity of the architecture increase exponentially.
Hierarchical Outline
- High Availability (HA) vs. Disaster Recovery (DR)
- HA: Focuses on "local" resilience (Multi-AZ). Aim is 99.9% to 99.999% uptime.
- DR: Focuses on "regional" resilience (Multi-Region). Aim is business continuity after a catastrophe.
- Disaster Recovery Strategies
- Backup and Restore: Low cost, high RTO (restore from S3/Glacier).
- Pilot Light: Critical data is live; core infrastructure (like DBs) is running, but app servers are off/scaled to zero.
- Warm Standby: A scaled-down version of the full environment is always running in another region.
- Multi-site (Active-Active): Traffic is balanced across two or more regions simultaneously. Zero RTO/RPO.
- Implementation Tools
- Data: S3 Cross-Region Replication (CRR), RDS Read Replicas, DynamoDB Global Tables.
- Traffic: Route 53 (Failover, Latency, Geoproximity routing), AWS Global Accelerator.
- Automation: AWS CloudFormation (IaC) and AWS Systems Manager for configuration.
Visual Anchors
DR Strategy Decision Flow
Cross-Region Architecture
Definition-Example Pairs
- Pilot Light: Only the most critical "core" elements are always on (like a furnace's pilot light).
- Example: An RDS database is replicated to a second region, but EC2 instances are only created via an Auto Scaling Group triggered during a disaster.
- Warm Standby: A "shadow" environment that is always running but at a lower capacity.
- Example: A fleet of 2 small EC2 instances is always running in Region B, while Region A has 20 large instances. On failover, Region B scales up to 20.
- Fate Sharing: When a resource's availability is tied to the availability of its parent container.
- Example: An EC2 instance shares the fate of its specific Availability Zone; if the AZ fails, the instance fails.
Worked Examples
Scenario: The High-Stakes Financial Portal
Requirement: An application requires an RTO of less than 10 minutes and an RPO of less than 1 minute.
Step-by-Step Solution:
- Data Layer: Use Amazon Aurora Global Database. It provides sub-second replication (meeting the < 1 min RPO).
- Application Layer: Deploy a Warm Standby in a second region. Keep a minimal number of EC2 instances running behind an Application Load Balancer.
- Network Layer: Use Amazon Route 53 with a Failover Routing Policy. Configure a health check on the primary region's ALB.
- Recovery Process: When the health check fails, Route 53 points DNS to the secondary ALB. An Auto Scaling Group in the secondary region triggers to scale the EC2 fleet to handle full production load.
Checkpoint Questions
- What is the primary difference between a Zonal service and a Regional service regarding failure impact?
- Which DR strategy offers the lowest cost but the highest RTO?
- How does AWS CloudFormation support a Disaster Recovery plan?
- If a business can afford to lose 4 hours of data, what is their RPO?
▶Click to see answers
- Zonal services (EC2) fail if the specific AZ fails. Regional services (S3) are built to survive AZ failures automatically.
- Backup and Restore.
- It allows for "Infrastructure as Code," enabling the rapid, consistent deployment of identical infrastructure in a different region during a disaster.
- 4 hours.
Muddy Points & Cross-Refs
- The "Split Brain" Problem: In Active-Active setups, if the network between regions fails but both regions remain online, they might both try to write conflicting data. This is managed using distributed consistency models in services like DynamoDB Global Tables.
- Cross-Region Latency: Remember that data replication across regions is asynchronous due to the speed of light. This is why RPO is rarely truly "zero" in multi-region setups without specialized software.
- Deep Dive: For more on service limits, refer to the Service Quotas documentation to ensure your secondary region has enough capacity to scale during a failover.
Comparison Tables
Comparison of DR Strategies
| Strategy | RTO | RPO | Cost | Complexity |
|---|---|---|---|---|
| Backup & Restore | Hours+ | Hours+ | $ | Low |
| Pilot Light | Minutes/Hours | Minutes | $$ | Medium |
| Warm Standby | Minutes | Seconds/Minutes | $$$ | High |
| Multi-site | Real-time | Near Zero | $$$$ | Very High |