Operating and Maintaining High-Availability Architectures

High Availability (HA) is the ability of a system to remain operational and accessible even during the failure of one or more of its components. In the context of AWS, this involves leveraging Multi-AZ and Multi-Region deployments, automated failover mechanisms, and loose coupling.

Learning Objectives

Evaluate existing architectures to identify and remediate single points of failure (SPOFs).
Differentiate between High Availability (HA) and Disaster Recovery (DR) strategies.
Implement automated failover for application and database layers.
Configure DNS and network routing policies to ensure seamless traffic redirection during outages.
Optimize RTO (Recovery Time Objective) and RPO (Recovery Point Objective) using AWS-managed services.

Key Terms & Glossary

Failover: A backup operational mode in which the functions of a system component are assumed by secondary system components when the primary becomes unavailable.
RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration of service.
RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time (e.g., "we can afford to lose 5 minutes of data").
Multi-AZ: Deploying resources across multiple Availability Zones within a single Region to protect against data center failures.
Multi-Region: Deploying resources across geographically separate AWS Regions to protect against regional outages.
Self-Healing: The ability of a system to detect failure and automatically take corrective action (e.g., Auto Scaling replacing a failed instance).

The "Big Idea"

The core of high availability is Designing for Failure. In a distributed system, failures are inevitable. Instead of trying to prevent every possible failure, AWS architects build systems that detect failures immediately and route around them automatically. This shift from "prevention" to "automated recovery" allows for 99.99% (four nines) or higher availability by minimizing the human intervention required during an incident.

Formula / Concept Box

Concept	Metric	Goal
Availability	$(\text{Uptime} / \text{Total Time}) \times 100$	Aim for "nines" (e.g., 99.99%)
RTO	Time Duration	Minimize time to restore service
RPO	Time Duration	Minimize data loss from last backup/sync
Loose Coupling	Dependency Type	Use SQS/SNS to prevent cascading failures

Hierarchical Outline

I. Infrastructure Resilience
- Multi-AZ Deployments: Synchronous replication, automatic failover (RDS, Aurora).
- Multi-Region Deployments: Asynchronous replication, global traffic management.
- Auto Scaling: Health checks (EC2, ELB) and automatic replacement.
II. Database High Availability
- Amazon RDS: Multi-AZ standby (synchronous).
- Amazon Aurora: Multi-AZ by default, Global Database for cross-region.
- Amazon DynamoDB: Global Tables (active-active replication).
III. Network & Routing
- Amazon Route 53: Health checks, Failover routing, Latency-based routing.
- AWS Global Accelerator: Static IP entry points, rapid failover via AWS backbone.
IV. Disaster Recovery (DR) Patterns
- Backup & Restore: High RTO/RPO, lowest cost.
- Pilot Light: Core data is live, compute is scaled at failover.
- Warm Standby: Scaled-down version of production is always running.
- Multi-Site/Active-Active: Zero RTO, highest cost.

Visual Anchors

Application Failover Logic

Loading Diagram...

Multi-AZ Database Architecture

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Statelessness: Designing applications so that any server can handle any request without local session data.
- Example: Storing user session tokens in Amazon ElastiCache instead of on the EC2 instance's local RAM.
Read Replicas: Copies of a database used to offload read traffic and provide a manual or automated failover target.
- Example: Using an Aurora Read Replica in a different region to provide local low-latency reads for global users.
Circuit Breaker Pattern: A design pattern used to detect failures and encapsulate the logic of preventing a failure from constantly recurring.
- Example: An application stops trying to call a failing third-party API for 30 seconds after 5 consecutive timeouts, returning a cached response instead.

Worked Examples

Scenario: Minimizing RTO for a Global Database Failover

Problem: A financial application uses Amazon RDS for MySQL. The business requires an RTO of less than 1 minute for regional failures.

Step-by-Step Solution:

Analyze Current State: Standard RDS Cross-Region Read Replicas take several minutes to promote and require DNS updates.
Select Service: Switch to Amazon Aurora Global Database.
Implement: Deploy a primary Aurora cluster in us-east-1 and a secondary cluster in eu-west-1.
Failover Process:
- Use the Global Database Failover feature.
- Promotion of the secondary cluster typically completes in under 1 minute.
- Update application connection strings or use a Route 53 CNAME with a low TTL (e.g., 60s) to point to the new cluster endpoint.

Checkpoint Questions

What is the primary difference between Route 53 and AWS Global Accelerator regarding failover speed?
In a "Pilot Light" DR strategy, which resources are typically kept running at all times?
Why is DynamoDB Global Tables considered an "active-active" architecture?
How does an ELB (Elastic Load Balancer) contribute to High Availability within a single region?

Muddy Points & Cross-Refs

RTO vs. RPO: People often confuse these. Remember: RPO is about Data Loss (think "Restore Point"), and RTO is about Downtime (think "Return To Operations").
Route 53 TTL Issues: Even with a low TTL, some client-side DNS resolvers ignore these values and cache IPs longer than intended. AWS Global Accelerator solves this by providing static Anycast IPs that do not rely on DNS propagation for failover.
Multi-AZ vs. Read Replicas: Multi-AZ is for high availability (automatic failover, synchronous). Read Replicas are primarily for scaling (manual failover, asynchronous).

Comparison Tables

Disaster Recovery Strategy Comparison

Strategy	RTO / RPO	Cost	Complexity
Backup & Restore	Hours	$	Low
Pilot Light	Minutes/Hours	$$	Medium
Warm Standby	Minutes	$$$	High
Multi-Site	Seconds (Real-time)	$$$$	Very High

AWS Database HA Comparison

Feature	RDS Multi-AZ	Aurora Global	DynamoDB Global Tables
Replication Type	Synchronous	Asynchronous (Cross-Region)	Asynchronous (Active-Active)
Failover Scope	AZ-Level (Auto)	Region-Level (Manual/Auto)	Region-Level (Automatic)
Read Access	Standby is Passive	Replicas are Active	All Replicas are Active