Comprehensive Study Guide: Multi-AZ and Multi-Region Architectures
Multi-AZ and multi-Region architectures
Comprehensive Study Guide: Multi-AZ and Multi-Region Architectures
This study guide covers the architectural principles of designing for high availability and disaster recovery on AWS, focusing on the trade-offs between Multi-AZ and Multi-Region deployments.
Learning Objectives
After studying this guide, you should be able to:
- Distinguish between Availability Zones (AZs) and Regions in terms of infrastructure and fault isolation.
- Evaluate when to use Multi-Region architectures versus Multi-AZ strategies based on business requirements.
- Define Recovery Time Objective (RTO) and Recovery Point Objective (RPO) and apply them to DR scenarios.
- Identify zonal vs. regional AWS services and their impact on workload reliability.
- Compare write patterns (Global, Local, Partitioned) for multi-region data consistency.
Key Terms & Glossary
- Availability Zone (AZ): One or more discrete data centers with redundant power, networking, and connectivity in an AWS Region.
- Region: A physical location around the world where AWS clusters data centers.
- Fault Isolation: The practice of limiting the impact of a failure to a specific set of components (limiting the "blast radius").
- RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration of service.
- RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time (e.g., "we can lose up to 5 minutes of data").
- Local Zone: Infrastructure deployment that places compute, storage, and other services closer to end-users for sub-millisecond latency.
The "Big Idea"
The core philosophy of AWS reliability is Fault Isolation. By distributing resources across physically separated Availability Zones, you protect against local failures (power, fire, etc.). Moving to a Multi-Region architecture provides the ultimate protection against large-scale disasters or regional service outages, but it introduces significant complexity and cost. Availability goals for most workloads are met using a Multi-AZ strategy; Multi-Region is for extreme requirements.
Formula / Concept Box
| Concept | Definition | Metric |
|---|---|---|
| RTO | "How quickly must I recover?" | Time (Seconds/Minutes/Hours) |
| RPO | "How much data can I afford to lose?" | Time (representing data age) |
| Blast Radius | The scope of impact when a component fails. | Zonal, Regional, or Global |
Hierarchical Outline
- AWS Global Infrastructure
- Availability Zones (AZs): Located < 100km apart; redundant power/fiber.
- Regions: Clusters of at least 3 AZs; geographically isolated.
- Local Zones: Zonal placement near industry centers for ultra-low latency.
- Edge Network: 300+ locations for CloudFront and Global Accelerator.
- Service Scopes
- Zonal Services: EC2, EBS (Fate-shared with the specific AZ).
- Regional Services: DynamoDB, S3 (Built-in Multi-AZ replication).
- Disaster Recovery (DR) Strategies
- Pilot Light: Core data is replicated; resources are off until needed.
- Warm Standby: A scaled-down but functional version of the environment.
- Multi-Site (Active-Active): Fully functional and scaled environment in 2+ regions.
- Data Replication & Routing
- Data: S3 Cross-Region Replication (CRR), Aurora Global Database, DynamoDB Global Tables.
- Traffic: Route 53 (Latency/Geo routing), AWS Global Accelerator.
Visual Anchors
Infrastructure Hierarchy
Regional vs Zonal Fault Isolation
\begin{tikzpicture} \draw[thick, dashed] (0,0) rectangle (6,4) node[pos=0.5, yshift=1.8cm] {AWS Region}; \draw[fill=blue!10] (0.5,0.5) rectangle (2,3) node[pos=0.5] {AZ A}; \draw[fill=blue!10] (2.5,0.5) rectangle (4,3) node[pos=0.5] {AZ B}; \draw[fill=blue!10] (4.5,0.5) rectangle (5.5,3) node[pos=0.5] {AZ C}; \node[draw, fill=red!20] at (1.25, 1.5) {EC2 (Zonal)}; \node[draw, fill=green!20, minimum width=4cm] at (3, 3.5) {S3 / DynamoDB (Regional)}; \end{tikzpicture}
Definition-Example Pairs
- Zonal Service: A service where resources share the fate of the specific AZ.
- Example: An Amazon EC2 instance resides in
us-east-1a. Ifus-east-1afails, that specific instance becomes unavailable.
- Example: An Amazon EC2 instance resides in
- Regional Service: A service that automatically spreads data/load across multiple AZs.
- Example: Amazon DynamoDB stores data across multiple AZs by default. A single AZ failure does not interrupt the service.
- Cross-Region Replication: Continuous, asynchronous copying of data to a different geographic region.
- Example: Using S3 CRR to copy objects from a bucket in
us-east-1toeu-west-1for compliance and DR.
- Example: Using S3 CRR to copy objects from a bucket in
Worked Examples
Scenario: The "Zero-Downtime" Requirement
Question: A financial application requires an RTO of zero and an RPO of zero. Which architecture should be chosen, and what is the cost implication?
Step-by-Step Breakdown:
- Analyze RTO=0: This implies the system must be "Hot" in two locations simultaneously (Active-Active).
- Analyze RPO=0: This requires synchronous replication or near-instant asynchronous replication (like DynamoDB Global Tables).
- Architecture Selection: Multi-Region Active-Active. Traffic is split using Route 53 or Global Accelerator.
- Database Choice: DynamoDB Global Tables (Write Local) or Aurora Global Database (Write Global) depending on conflict requirements.
- Cost: This is the most expensive option because you pay for 100% capacity in two or more regions at all times.
Checkpoint Questions
- What is the maximum distance typically between Availability Zones in a Region? (Answer: Less than 100 km)
- Which service would you use to route traffic based on the lowest network latency for an end-user? (Answer: Route 53 or AWS Global Accelerator)
- If a service is "Regional," do you need to manually configure it to be Multi-AZ? (Answer: No, it uses multiple AZs out of the box)
- What is the main difference between Pilot Light and Warm Standby? (Answer: Pilot Light keeps resources off; Warm Standby keeps a scaled-down version running)
Muddy Points & Cross-Refs
- Confusion over Local Zones vs AZs: Remember that Local Zones are zonal extensions. They are not a separate Region, but they are physically distant from the main Region's AZs to serve a specific metro area.
- Write Global vs. Write Local: In Multi-Region databases, "Write Global" (Aurora) sends all writes to one region. "Write Local" (DynamoDB) allows writes in any region, but requires conflict resolution strategies.
- Deep Study Pointers: Review the AWS Well-Architected Framework: Reliability Pillar for detailed availability math.
Comparison Tables
Multi-AZ vs. Multi-Region
| Feature | Multi-AZ | Multi-Region |
|---|---|---|
| Complexity | Low (Often native) | High (Manual sync/routing) |
| Latency | Single-digit ms | Tens to hundreds of ms |
| Cost | Standard | High (Double infrastructure + Data transfer) |
| Protection | Local disasters (Power/Fire) | Regional disasters / Massive outages |
| Typical Use Case | Standard HA requirements | Business Continuity / Compliance |
Disaster Recovery Strategies
| Strategy | RTO / RPO | Cost | Complexity |
|---|---|---|---|
| Backup & Restore | Hours/Days | $ | Simple |
| Pilot Light | Minutes/Hours | $$ | Moderate |
| Warm Standby | Minutes | $$$ | Moderate/High |
| Multi-Site (Active-Active) | Real-time (Zero) | $$$$ | High |