Designing Highly Available and Fault-Tolerant Architectures

This guide covers the critical strategies and AWS services required to build resilient cloud architectures that can withstand component, Availability Zone (AZ), or entire Regional failures.

Learning Objectives

By the end of this study guide, you should be able to:

Differentiate between High Availability (HA) and Fault Tolerance (FT).
Select appropriate AWS services to eliminate single points of failure (SPOF).
Explain the trade-offs between Multi-AZ and Multi-Region deployments.
Define and apply Disaster Recovery (DR) strategies: Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active-Active.
Calculate workload availability for redundant components.

Key Terms & Glossary

Availability Zone (AZ): One or more discrete data centers with redundant power, networking, and connectivity in an AWS Region.
Region: A physical location around the world where AWS clusters data centers.
RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time.
RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration of service.
Fault Tolerance: The ability of a system to continue operating without interruption during the failure of one or more of its components.
High Availability: A system design protocol that ensures a high level of operational performance (usually uptime) for a given period.

The "Big Idea"

The core philosophy of AWS resilience is "Everything fails, all the time." Instead of trying to build a single "indestructible" server, we build systems out of multiple smaller, independent components distributed across isolated infrastructure (AZs and Regions). True resilience is achieved when the system can automatically detect, route around, and recover from failures without human intervention.

Formula / Concept Box

Availability Calculation

Concept	Formula / Description	Example
Availability in Series	$A_{total} = A_1 \times A_2$	Two 99.9% services = 99.8% total
Availability in Parallel	$A_{total} = 1 - (1 - A)^n$	Two 99% instances = 99.99% total
RPO	Focuses on Data Loss	"We can lose 15 minutes of data."
RTO	Focuses on Downtime	"We must be back up in 2 hours."

Hierarchical Outline

Global Infrastructure Fundamentals
- Availability Zones: Low-latency links; isolated from local disasters.
- Regions: Geographic separation; used for ultimate disaster recovery.
Compute & Networking Resilience
- Elastic Load Balancing (ELB): Distributes traffic across healthy instances; performs health checks.
- Auto Scaling Groups (ASG): Automatically replaces failed instances and scales based on demand.
- Amazon Route 53: Global DNS that can perform health checks and failover between Regions.
Data Persistence Strategies
- RDS Multi-AZ: Synchronous replication to a standby in a different AZ (Automatic Failover).
- RDS Read Replicas: Asynchronous replication for scaling reads; can be cross-region for DR.
- Amazon S3: Inherently highly available; supports Cross-Region Replication (CRR).
Disaster Recovery (DR) Spectrum
- Backup and Restore: Lowest cost, highest RTO (manual).
- Pilot Light: Minimal version of environment always running (Core data).
- Warm Standby: Functional but scaled-down version of the environment.
- Multi-Site Active-Active: Zero downtime; traffic served from multiple regions simultaneously.

Visual Anchors

Standard Multi-AZ Architecture

Loading Diagram...

DR Strategy Spectrum

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Loose Coupling: Designing components so they do not have a hard dependency on each other.
- Example: Using Amazon SQS between a web server and a video processor. If the processor fails, messages stay in the queue until it recovers.
Immutable Infrastructure: Components are never updated in place; they are replaced with new versions.
- Example: When updating an application, instead of SSHing into an EC2 to pull code, you bake a new AMI and update the Auto Scaling Group.
Self-Healing: The system's ability to detect and fix its own issues.
- Example: An ASG terminating an instance that fails its ALB health check and launching a fresh one automatically.

Worked Examples

Example 1: Calculating Multi-AZ Uptime

Scenario: A solution uses two independent web servers in two different AZs. Each AZ has an availability of 99.9%. What is the theoretical availability of the web tier?

Solution:

Probability of AZ A failing = $1 - 0.999 = 0.001$
Probability of AZ B failing = $1 - 0.999 = 0.001$
Probability of BOTH failing simultaneously = $0.$001 \times 0.001 = 0.000001$$
Overall Availability = $1 - 0.000001 = 0.999999$ (99.9999% or "six nines").

Example 2: Designing for Regional Failure

Scenario: A company needs an RTO of less than 15 minutes for a complete Regional outage but wants to keep costs low.

Strategy: Pilot Light

Database: Use RDS with a cross-region read replica. In a disaster, promote the replica to primary.
App Tier: Keep an AMI ready in the secondary region. Keep no EC2 instances running (to save cost).
Execution: Use Route 53 to point to the new region's ALB once the ASG has scaled up the instances from the AMI.

Checkpoint Questions

What is the main difference between RDS Multi-AZ and RDS Read Replicas regarding data consistency?
A company requires an RPO of 0 (no data loss). Which DR strategy is NOT suitable?
How does Route 53 determine if it should fail over to a secondary region?
What AWS service provides a highly available buffer to ensure loose coupling between microservices?

▶Click to see answers

RDS Multi-AZ uses synchronous replication (strong consistency for failover), while Read Replicas use asynchronous replication (eventual consistency for scaling/DR).
Backup and Restore and Pilot Light are unsuitable because they usually involve some asynchronous data lag (RPO > 0).
Via Health Checks that monitor the endpoint or a CloudWatch Alarm.
Amazon SQS (Simple Queue Service).

Designing Highly Available and Fault-Tolerant Architectures

This guide covers the critical strategies and AWS services required to build resilient cloud architectures that can withstand component, Availability Zone (AZ), or entire Regional failures.

Learning Objectives

By the end of this study guide, you should be able to:

Differentiate between High Availability (HA) and Fault Tolerance (FT).
Select appropriate AWS services to eliminate single points of failure (SPOF).
Explain the trade-offs between Multi-AZ and Multi-Region deployments.
Define and apply Disaster Recovery (DR) strategies: Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active-Active.
Calculate workload availability for redundant components.

Key Terms & Glossary

Availability Zone (AZ): One or more discrete data centers with redundant power, networking, and connectivity in an AWS Region.
Region: A physical location around the world where AWS clusters data centers.
RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time.
RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration of service.
Fault Tolerance: The ability of a system to continue operating without interruption during the failure of one or more of its components.
High Availability: A system design protocol that ensures a high level of operational performance (usually uptime) for a given period.

The "Big Idea"

Formula / Concept Box

Availability Calculation

Concept	Formula / Description	Example
Availability in Series	$A_{total} = A_1 \times A_2$	Two 99.9% services = 99.8% total
Availability in Parallel	$A_{total} = 1 - (1 - A)^n$	Two 99% instances = 99.99% total
RPO	Focuses on Data Loss	"We can lose 15 minutes of data."
RTO	Focuses on Downtime	"We must be back up in 2 hours."

Hierarchical Outline

Global Infrastructure Fundamentals
- Availability Zones: Low-latency links; isolated from local disasters.
- Regions: Geographic separation; used for ultimate disaster recovery.
Compute & Networking Resilience
- Elastic Load Balancing (ELB): Distributes traffic across healthy instances; performs health checks.
- Auto Scaling Groups (ASG): Automatically replaces failed instances and scales based on demand.
- Amazon Route 53: Global DNS that can perform health checks and failover between Regions.
Data Persistence Strategies
- RDS Multi-AZ: Synchronous replication to a standby in a different AZ (Automatic Failover).
- RDS Read Replicas: Asynchronous replication for scaling reads; can be cross-region for DR.
- Amazon S3: Inherently highly available; supports Cross-Region Replication (CRR).
Disaster Recovery (DR) Spectrum
- Backup and Restore: Lowest cost, highest RTO (manual).
- Pilot Light: Minimal version of environment always running (Core data).
- Warm Standby: Functional but scaled-down version of the environment.
- Multi-Site Active-Active: Zero downtime; traffic served from multiple regions simultaneously.

Visual Anchors

Standard Multi-AZ Architecture

Loading Diagram...

DR Strategy Spectrum

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Loose Coupling: Designing components so they do not have a hard dependency on each other.
- Example: Using Amazon SQS between a web server and a video processor. If the processor fails, messages stay in the queue until it recovers.
Immutable Infrastructure: Components are never updated in place; they are replaced with new versions.
- Example: When updating an application, instead of SSHing into an EC2 to pull code, you bake a new AMI and update the Auto Scaling Group.
Self-Healing: The system's ability to detect and fix its own issues.
- Example: An ASG terminating an instance that fails its ALB health check and launching a fresh one automatically.

Worked Examples

Example 1: Calculating Multi-AZ Uptime

Scenario: A solution uses two independent web servers in two different AZs. Each AZ has an availability of 99.9%. What is the theoretical availability of the web tier?

Solution:

Probability of AZ A failing = $1 - 0.999 = 0.001$
Probability of AZ B failing = $1 - 0.999 = 0.001$
Probability of BOTH failing simultaneously = $0.$001 \times 0.001 = 0.000001$$
Overall Availability = $1 - 0.000001 = 0.999999$ (99.9999% or "six nines").

Example 2: Designing for Regional Failure

Scenario: A company needs an RTO of less than 15 minutes for a complete Regional outage but wants to keep costs low.

Strategy: Pilot Light

Database: Use RDS with a cross-region read replica. In a disaster, promote the replica to primary.
App Tier: Keep an AMI ready in the secondary region. Keep no EC2 instances running (to save cost).
Execution: Use Route 53 to point to the new region's ALB once the ASG has scaled up the instances from the AMI.

Checkpoint Questions

What is the main difference between RDS Multi-AZ and RDS Read Replicas regarding data consistency?
A company requires an RPO of 0 (no data loss). Which DR strategy is NOT suitable?
How does Route 53 determine if it should fail over to a secondary region?
What AWS service provides a highly available buffer to ensure loose coupling between microservices?

▶Click to see answers

RDS Multi-AZ uses synchronous replication (strong consistency for failover), while Read Replicas use asynchronous replication (eventual consistency for scaling/DR).
Backup and Restore and Pilot Light are unsuitable because they usually involve some asynchronous data lag (RPO > 0).
Via Health Checks that monitor the endpoint or a CloudWatch Alarm.
Amazon SQS (Simple Queue Service).

AWS Study Guide: Designing Highly Available and Fault-Tolerant Architectures

Designing Highly Available and Fault-Tolerant Architectures

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Availability Calculation

Hierarchical Outline

Visual Anchors

Standard Multi-AZ Architecture

DR Strategy Spectrum

Definition-Example Pairs

Worked Examples

Example 1: Calculating Multi-AZ Uptime

Example 2: Designing for Regional Failure

Checkpoint Questions

AWS Study Guide: Designing Highly Available and Fault-Tolerant Architectures

Designing Highly Available and Fault-Tolerant Architectures

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Availability Calculation

Hierarchical Outline

Visual Anchors

Standard Multi-AZ Architecture

DR Strategy Spectrum

Definition-Example Pairs

Worked Examples

Example 1: Calculating Multi-AZ Uptime

Example 2: Designing for Regional Failure

Checkpoint Questions