Designing Highly Available and Fault-Tolerant Architectures

Learning Objectives

After studying this guide, you should be able to:

Differentiate between High Availability (HA) and Fault Tolerance (FT).
Define and calculate Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
Select appropriate Disaster Recovery (DR) strategies (Backup & Restore, Pilot Light, Warm Standby, Multi-site).
Design Multi-AZ and Multi-Region architectures using AWS Global Infrastructure.
Leverage Load Balancing and Auto Scaling to mitigate component failure.

Key Terms & Glossary

High Availability (HA): A system design that ensures an agreed-upon level of operational performance (uptime) during a contractual measurement period.
Fault Tolerance (FT): The ability of a system to continue operating without interruption despite the failure of one or more components.
RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time (e.g., "we can lose 4 hours of data").
RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration of service.
Immutable Infrastructure: A strategy where servers are never modified after deployment. If a change is needed, new servers are built from a common image (AMI).

The "Big Idea"

In the cloud, failure is inevitable. Resilient architecture isn't about preventing failure entirely, but about designing for failure. By using AWS's distributed global infrastructure—Regions and Availability Zones—we can decouple components so that the failure of a single rack, data center, or even an entire geographic area does not result in a total service outage.

Formula / Concept Box

Concept	Metric Focus	Business Goal
RPO	Data Loss	"How much data can we afford to recreate?"
RTO	Downtime	"How quickly must we be back online?"
Availability %	Uptime	$99.99% = 52.6$ minutes of downtime per year

[!IMPORTANT] High Availability $\neq$ Fault Tolerance. HA aims for 99.9% uptime but may have a brief flicker during failover. FT aims for 0% downtime but is significantly more expensive.

Hierarchical Outline

AWS Global Infrastructure
- Availability Zones (AZs): Physically separated data centers with redundant power and networking.
- Regions: Geographic areas containing 3+ AZs; used for Multi-Region DR.
Compute & Network Resilience
- Elastic Load Balancing (ELB): Distributes traffic; performs health checks to skip failing instances.
- Auto Scaling Groups (ASG): Automatically replaces failed instances to maintain desired capacity.
- Amazon Route 53: DNS-level failover and health checks.
Database Resilience
- RDS Multi-AZ: Synchronous replication to a standby instance in a different AZ for failover.
- RDS Read Replicas: Asynchronous replication to offload read traffic (can be promoted to primary for DR).
Disaster Recovery Strategies
- Backup & Restore: Cheapest; highest RTO/RPO.
- Pilot Light: Core data is live; other resources are "off" until needed.
- Warm Standby: A scaled-down version of the environment is always running.
- Multi-Site (Active-Active): Zero RTO/RPO; most expensive.

Visual Anchors

Disaster Recovery Spectrum

Loading Diagram...

Multi-AZ Architecture

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Pilot Light Strategy: Keeping only the most critical data and core elements of an architecture running.
- Example: An application keeps an RDS database updated in a second region but keeps the EC2 application servers as stopped AMIs or a skeleton Auto Scaling Group that stays at size 0 until a disaster is declared.
Health Checks: A mechanism used by Load Balancers to determine if an instance is capable of handling requests.
- Example: An Application Load Balancer (ALB) sends an HTTP GET request to /health every 30 seconds. If the server returns a 500 error, the ALB stops sending traffic to that specific server.

Worked Examples

Problem: Selecting a DR Strategy

Scenario: A financial institution requires a Disaster Recovery plan where they can be back online within 15 minutes of a regional failure (RTO) and can afford to lose no more than 5 minutes of data (RPO). Cost should be minimized.

Step-by-Step Analysis:

Backup & Restore: RTO is usually hours/days. Discarded.
Pilot Light: RTO involves spinning up fleets; usually takes ~20-30 mins. Likely too slow.
Warm Standby: A small fleet is already running; scaling up takes ~5-10 mins. This meets the 15-minute RTO.
Multi-Site: Meets all requirements but is the most expensive option.

Conclusion: The Warm Standby strategy is the most cost-effective choice that meets the 15-minute RTO constraint.

Checkpoint Questions

What is the main difference between Multi-AZ and Read Replicas for Amazon RDS?
If a business needs an RTO of near zero, which DR strategy should they implement?
How does an Application Load Balancer (ALB) contribute to High Availability?
What happens to the DNS record during an RDS Multi-AZ failover?

▶Click to see answers

Multi-AZ is for HA/Failover (Synchronous); Read Replicas are for performance/scaling (Asynchronous).
Multi-site (Active-Active).
By automatically routing traffic away from unhealthy EC2 instances to healthy ones across multiple AZs.
AWS automatically updates the CNAME of the DB instance to point to the new standby-turned-primary instance.