Operating and Maintaining High-Availability Architectures
Operating and maintaining high-availability architectures (for example, application failovers, database failovers)
Operating and Maintaining High-Availability Architectures
High Availability (HA) is the ability of a system to remain operational and accessible even during the failure of one or more of its components. In the context of AWS, this involves leveraging Multi-AZ and Multi-Region deployments, automated failover mechanisms, and loose coupling.
Learning Objectives
- Evaluate existing architectures to identify and remediate single points of failure (SPOFs).
- Differentiate between High Availability (HA) and Disaster Recovery (DR) strategies.
- Implement automated failover for application and database layers.
- Configure DNS and network routing policies to ensure seamless traffic redirection during outages.
- Optimize RTO (Recovery Time Objective) and RPO (Recovery Point Objective) using AWS-managed services.
Key Terms & Glossary
- Failover: A backup operational mode in which the functions of a system component are assumed by secondary system components when the primary becomes unavailable.
- RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration of service.
- RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time (e.g., "we can afford to lose 5 minutes of data").
- Multi-AZ: Deploying resources across multiple Availability Zones within a single Region to protect against data center failures.
- Multi-Region: Deploying resources across geographically separate AWS Regions to protect against regional outages.
- Self-Healing: The ability of a system to detect failure and automatically take corrective action (e.g., Auto Scaling replacing a failed instance).
The "Big Idea"
The core of high availability is Designing for Failure. In a distributed system, failures are inevitable. Instead of trying to prevent every possible failure, AWS architects build systems that detect failures immediately and route around them automatically. This shift from "prevention" to "automated recovery" allows for 99.99% (four nines) or higher availability by minimizing the human intervention required during an incident.
Formula / Concept Box
| Concept | Metric | Goal |
|---|---|---|
| Availability | Aim for "nines" (e.g., 99.99%) | |
| RTO | Time Duration | Minimize time to restore service |
| RPO | Time Duration | Minimize data loss from last backup/sync |
| Loose Coupling | Dependency Type | Use SQS/SNS to prevent cascading failures |
Hierarchical Outline
- I. Infrastructure Resilience
- Multi-AZ Deployments: Synchronous replication, automatic failover (RDS, Aurora).
- Multi-Region Deployments: Asynchronous replication, global traffic management.
- Auto Scaling: Health checks (EC2, ELB) and automatic replacement.
- II. Database High Availability
- Amazon RDS: Multi-AZ standby (synchronous).
- Amazon Aurora: Multi-AZ by default, Global Database for cross-region.
- Amazon DynamoDB: Global Tables (active-active replication).
- III. Network & Routing
- Amazon Route 53: Health checks, Failover routing, Latency-based routing.
- AWS Global Accelerator: Static IP entry points, rapid failover via AWS backbone.
- IV. Disaster Recovery (DR) Patterns
- Backup & Restore: High RTO/RPO, lowest cost.
- Pilot Light: Core data is live, compute is scaled at failover.
- Warm Standby: Scaled-down version of production is always running.
- Multi-Site/Active-Active: Zero RTO, highest cost.
Visual Anchors
Application Failover Logic
Multi-AZ Database Architecture
\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, fill=blue!10, text width=2.5cm, align=center}] \node (Primary) {Primary DB (AZ-A)}; \node (Standby) [right of=Primary, xshift=3cm] {Standby DB (AZ-B)}; \node (App) [above of=Primary, xshift=2.5cm] {Application Server};
\draw[<->, thick] (Primary) -- node[above] {Sync Replication} (Standby); \draw[->, thick] (App) -- node[left] {Writes} (Primary); \draw[dashed, ->, thick] (App) -- node[right] {Failover Link} (Standby);
\node[draw=none, fill=none, below of=Primary, yshift=1cm] {\textbf{Region A}}; \end{tikzpicture}
Definition-Example Pairs
- Statelessness: Designing applications so that any server can handle any request without local session data.
- Example: Storing user session tokens in Amazon ElastiCache instead of on the EC2 instance's local RAM.
- Read Replicas: Copies of a database used to offload read traffic and provide a manual or automated failover target.
- Example: Using an Aurora Read Replica in a different region to provide local low-latency reads for global users.
- Circuit Breaker Pattern: A design pattern used to detect failures and encapsulate the logic of preventing a failure from constantly recurring.
- Example: An application stops trying to call a failing third-party API for 30 seconds after 5 consecutive timeouts, returning a cached response instead.
Worked Examples
Scenario: Minimizing RTO for a Global Database Failover
Problem: A financial application uses Amazon RDS for MySQL. The business requires an RTO of less than 1 minute for regional failures.
Step-by-Step Solution:
- Analyze Current State: Standard RDS Cross-Region Read Replicas take several minutes to promote and require DNS updates.
- Select Service: Switch to Amazon Aurora Global Database.
- Implement: Deploy a primary Aurora cluster in
us-east-1and a secondary cluster ineu-west-1. - Failover Process:
- Use the Global Database Failover feature.
- Promotion of the secondary cluster typically completes in under 1 minute.
- Update application connection strings or use a Route 53 CNAME with a low TTL (e.g., 60s) to point to the new cluster endpoint.
Checkpoint Questions
- What is the primary difference between Route 53 and AWS Global Accelerator regarding failover speed?
- In a "Pilot Light" DR strategy, which resources are typically kept running at all times?
- Why is DynamoDB Global Tables considered an "active-active" architecture?
- How does an ELB (Elastic Load Balancer) contribute to High Availability within a single region?
Muddy Points & Cross-Refs
- RTO vs. RPO: People often confuse these. Remember: RPO is about Data Loss (think "Restore Point"), and RTO is about Downtime (think "Return To Operations").
- Route 53 TTL Issues: Even with a low TTL, some client-side DNS resolvers ignore these values and cache IPs longer than intended. AWS Global Accelerator solves this by providing static Anycast IPs that do not rely on DNS propagation for failover.
- Multi-AZ vs. Read Replicas: Multi-AZ is for high availability (automatic failover, synchronous). Read Replicas are primarily for scaling (manual failover, asynchronous).
Comparison Tables
Disaster Recovery Strategy Comparison
| Strategy | RTO / RPO | Cost | Complexity |
|---|---|---|---|
| Backup & Restore | Hours | $ | Low |
| Pilot Light | Minutes/Hours | $$ | Medium |
| Warm Standby | Minutes | $$$ | High |
| Multi-Site | Seconds (Real-time) | $$$$ | Very High |
AWS Database HA Comparison
| Feature | RDS Multi-AZ | Aurora Global | DynamoDB Global Tables |
|---|---|---|---|
| Replication Type | Synchronous | Asynchronous (Cross-Region) | Asynchronous (Active-Active) |
| Failover Scope | AZ-Level (Auto) | Region-Level (Manual/Auto) | Region-Level (Automatic) |
| Read Access | Standby is Passive | Replicas are Active | All Replicas are Active |