AWS Global Infrastructure: Architecture for High Availability and Resilience
AWS Global Infrastructure
AWS Global Infrastructure: Architecture for High Availability and Resilience
This guide explores the physical and logical structure of the Amazon Web Services (AWS) cloud, focusing on how to leverage Regions, Availability Zones, and the Global Edge Network to build reliable, high-performance applications.
Learning Objectives
After studying this guide, you should be able to:
- Distinguish between AWS Regions, Availability Zones (AZs), and Local Zones.
- Explain the concept of fault isolation and blast radius reduction.
- Differentiate between the use cases for Amazon CloudFront and AWS Global Accelerator.
- Evaluate when to use a Multi-AZ strategy versus a Multi-Region strategy.
- Identify zonal versus regional services and their impact on architectural fate-sharing.
Key Terms & Glossary
- Region: A physical geographical location (e.g., US-East-1) consisting of multiple, isolated, and physically separate Availability Zones.
- Availability Zone (AZ): One or more discrete data centers with redundant power, networking, and connectivity in an AWS Region.
- Local Zone: An extension of an AWS Region that places compute, storage, and other services closer to end-users in a specific geographic area (e.g., for sub-10ms latency).
- Edge Location: Sites used by services like CloudFront to cache content closer to users to reduce latency.
- Blast Radius: The maximum impact area of a single failure event within an architecture.
- RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time.
- RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration.
The "Big Idea"
The fundamental philosophy of AWS Global Infrastructure is redundancy at every layer. By providing multiple layers of isolation (from physical data centers to entire geographical regions), AWS enables architects to design systems where a failure in one component (a disk, a server, or even a whole data center) does not result in a total system outage. The goal is to move from "preventing failure" to "surviving failure."
Formula / Concept Box
| Concept | Metric / Scope | Rule of Thumb |
|---|---|---|
| AZ Distance | < 100 km apart | Far enough for independent power/grids; close enough for low-latency sync. |
| Minimum AZs | 3 per Region | Most Regions launch with 3+ AZs to ensure quorum and high availability. |
| Availability SLA | 99.99% | Standard for ALBs, NLBs, and many regional services. |
| S3 Durability | 99.999999999% | Eleven 9s of durability across multiple AZs. |
Hierarchical Outline
- I. Physical Infrastructure Hierarchy
- AWS Regions (Global scope; contains multiple AZs)
- Availability Zones (Regional scope; contains 1+ data centers)
- Data Centers (Physical facilities with independent power/cooling)
- II. Specialized Infrastructure
- Local Zones (Low-latency extensions near metro areas)
- Wavelength Zones (Embedded in 5G networks for mobile edge computing)
- Edge Network (300+ locations for CDN and Global Accelerator)
- III. Service Scoping
- Zonal Services: Resources like EC2 or EBS that live in a specific AZ.
- Regional Services: Services like S3 or DynamoDB that manage data across AZs automatically.
- Global Services: IAM, Route 53, and CloudFront which operate across all regions.
Visual Anchors
AWS Infrastructure Hierarchy
Fault Isolation Model
Definition-Example Pairs
- Fault Isolation Boundary: A logical or physical partition that prevents a failure from spreading.
- Example: Placing EC2 instances in different Availability Zones ensures that a power grid failure in one facility doesn't take down the entire application.
- Cross-Region Replication (CRR): Automatically copying data from a bucket in one region to a bucket in another.
- Example: A financial firm replicates S3 data from US-East-1 (N. Virginia) to US-West-2 (Oregon) to meet compliance requirements for geographic data separation.
- Asynchronous Replication: Data is written to the primary site first, then copied to the secondary site with a slight delay.
- Example: Using RDS Read Replicas in a different region to provide a Disaster Recovery (DR) target without impacting the performance of the primary database.
Worked Examples
Scenario: Low Latency Video Streaming
The Challenge: A company in Los Angeles needs to provide sub-10ms latency for a real-time video editing application, but the closest full AWS Region is in Oregon.
The Solution:
- Deploy compute resources (EC2) into an AWS Local Zone located in Los Angeles.
- This allows the workload to be physically closer to the users while still being managed by the Oregon parent region.
- Use Amazon CloudFront to cache static assets at Edge Locations globally for viewers outside of LA.
Scenario: Global API Failover
The Challenge: An API must remain available even if an entire AWS Region goes offline.
The Solution:
- Deploy the application in two Regions (e.g., Ireland and Sydney).
- Use AWS Global Accelerator to provide a set of static IP addresses.
- Global Accelerator will monitor the health of the endpoints in both regions and automatically route users to the healthy region with the lowest latency.
Checkpoint Questions
- What is the maximum distance typically maintained between Availability Zones within a region?
- Which service would you use to provide static IP addresses for a non-HTTP application requiring global failover?
- True or False: A "Regional Service" like Amazon DynamoDB requires you to manually configure replication across AZs.
- How do Local Zones differ from standard Availability Zones regarding their physical location?
▶Click to see answers
- Less than 100 kilometers (but several kilometers apart).
- AWS Global Accelerator.
- False. Regional services like DynamoDB handle multi-AZ replication automatically.
- Local Zones are located near large industry centers outside of the primary AWS Region geographic area.
Muddy Points & Cross-Refs
- Edge Locations vs. Local Zones: This is a common point of confusion. Edge Locations are primarily for caching (CDN) and routing (Global Accelerator). You cannot run a full VPC or EC2 instance inside a standard Edge Location. Local Zones are extension points where you can run EC2, EBS, and subnets.
- Multi-AZ vs. Multi-Region: Multi-AZ is the default for high availability (HA). Multi-Region is for Disaster Recovery (DR) and extreme availability. Do not use Multi-Region unless business requirements for RTO/RPO justify the added cost and complexity.
Comparison Tables
| Feature | Amazon CloudFront | AWS Global Accelerator |
|---|---|---|
| Primary Goal | Content Caching (CDN) | Network Path Optimization |
| Supported Protocols | HTTP / HTTPS / WebSockets | TCP / UDP (non-HTTP support) |
| How it Works | Caches data at the edge | Routes traffic via AWS private fiber |
| Best For | Videos, Images, Static Web | Gaming, IoT, Multi-region failover |
| IP Addresses | Dynamic (DNS-based) | Static Anycast IPs |