Study Guide1,050 words

Operating and Maintaining High-Availability Architectures

Operating and maintaining high-availability architectures (for example, application failovers, database failovers)

Operating and Maintaining High-Availability Architectures

High Availability (HA) is the ability of a system to remain operational and accessible even during the failure of one or more of its components. In the context of AWS, this involves leveraging Multi-AZ and Multi-Region deployments, automated failover mechanisms, and loose coupling.

Learning Objectives

  • Evaluate existing architectures to identify and remediate single points of failure (SPOFs).
  • Differentiate between High Availability (HA) and Disaster Recovery (DR) strategies.
  • Implement automated failover for application and database layers.
  • Configure DNS and network routing policies to ensure seamless traffic redirection during outages.
  • Optimize RTO (Recovery Time Objective) and RPO (Recovery Point Objective) using AWS-managed services.

Key Terms & Glossary

  • Failover: A backup operational mode in which the functions of a system component are assumed by secondary system components when the primary becomes unavailable.
  • RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration of service.
  • RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time (e.g., "we can afford to lose 5 minutes of data").
  • Multi-AZ: Deploying resources across multiple Availability Zones within a single Region to protect against data center failures.
  • Multi-Region: Deploying resources across geographically separate AWS Regions to protect against regional outages.
  • Self-Healing: The ability of a system to detect failure and automatically take corrective action (e.g., Auto Scaling replacing a failed instance).

The "Big Idea"

The core of high availability is Designing for Failure. In a distributed system, failures are inevitable. Instead of trying to prevent every possible failure, AWS architects build systems that detect failures immediately and route around them automatically. This shift from "prevention" to "automated recovery" allows for 99.99% (four nines) or higher availability by minimizing the human intervention required during an incident.

Formula / Concept Box

ConceptMetricGoal
Availability(Uptime/Total Time)×100(\text{Uptime} / \text{Total Time}) \times 100Aim for "nines" (e.g., 99.99%)
RTOTime DurationMinimize time to restore service
RPOTime DurationMinimize data loss from last backup/sync
Loose CouplingDependency TypeUse SQS/SNS to prevent cascading failures

Hierarchical Outline

  • I. Infrastructure Resilience
    • Multi-AZ Deployments: Synchronous replication, automatic failover (RDS, Aurora).
    • Multi-Region Deployments: Asynchronous replication, global traffic management.
    • Auto Scaling: Health checks (EC2, ELB) and automatic replacement.
  • II. Database High Availability
    • Amazon RDS: Multi-AZ standby (synchronous).
    • Amazon Aurora: Multi-AZ by default, Global Database for cross-region.
    • Amazon DynamoDB: Global Tables (active-active replication).
  • III. Network & Routing
    • Amazon Route 53: Health checks, Failover routing, Latency-based routing.
    • AWS Global Accelerator: Static IP entry points, rapid failover via AWS backbone.
  • IV. Disaster Recovery (DR) Patterns
    • Backup & Restore: High RTO/RPO, lowest cost.
    • Pilot Light: Core data is live, compute is scaled at failover.
    • Warm Standby: Scaled-down version of production is always running.
    • Multi-Site/Active-Active: Zero RTO, highest cost.

Visual Anchors

Application Failover Logic

Loading Diagram...

Multi-AZ Database Architecture

\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, fill=blue!10, text width=2.5cm, align=center}] \node (Primary) {Primary DB (AZ-A)}; \node (Standby) [right of=Primary, xshift=3cm] {Standby DB (AZ-B)}; \node (App) [above of=Primary, xshift=2.5cm] {Application Server};

\draw[<->, thick] (Primary) -- node[above] {Sync Replication} (Standby); \draw[->, thick] (App) -- node[left] {Writes} (Primary); \draw[dashed, ->, thick] (App) -- node[right] {Failover Link} (Standby);

\node[draw=none, fill=none, below of=Primary, yshift=1cm] {\textbf{Region A}}; \end{tikzpicture}

Definition-Example Pairs

  • Statelessness: Designing applications so that any server can handle any request without local session data.
    • Example: Storing user session tokens in Amazon ElastiCache instead of on the EC2 instance's local RAM.
  • Read Replicas: Copies of a database used to offload read traffic and provide a manual or automated failover target.
    • Example: Using an Aurora Read Replica in a different region to provide local low-latency reads for global users.
  • Circuit Breaker Pattern: A design pattern used to detect failures and encapsulate the logic of preventing a failure from constantly recurring.
    • Example: An application stops trying to call a failing third-party API for 30 seconds after 5 consecutive timeouts, returning a cached response instead.

Worked Examples

Scenario: Minimizing RTO for a Global Database Failover

Problem: A financial application uses Amazon RDS for MySQL. The business requires an RTO of less than 1 minute for regional failures.

Step-by-Step Solution:

  1. Analyze Current State: Standard RDS Cross-Region Read Replicas take several minutes to promote and require DNS updates.
  2. Select Service: Switch to Amazon Aurora Global Database.
  3. Implement: Deploy a primary Aurora cluster in us-east-1 and a secondary cluster in eu-west-1.
  4. Failover Process:
    • Use the Global Database Failover feature.
    • Promotion of the secondary cluster typically completes in under 1 minute.
    • Update application connection strings or use a Route 53 CNAME with a low TTL (e.g., 60s) to point to the new cluster endpoint.

Checkpoint Questions

  1. What is the primary difference between Route 53 and AWS Global Accelerator regarding failover speed?
  2. In a "Pilot Light" DR strategy, which resources are typically kept running at all times?
  3. Why is DynamoDB Global Tables considered an "active-active" architecture?
  4. How does an ELB (Elastic Load Balancer) contribute to High Availability within a single region?

Muddy Points & Cross-Refs

  • RTO vs. RPO: People often confuse these. Remember: RPO is about Data Loss (think "Restore Point"), and RTO is about Downtime (think "Return To Operations").
  • Route 53 TTL Issues: Even with a low TTL, some client-side DNS resolvers ignore these values and cache IPs longer than intended. AWS Global Accelerator solves this by providing static Anycast IPs that do not rely on DNS propagation for failover.
  • Multi-AZ vs. Read Replicas: Multi-AZ is for high availability (automatic failover, synchronous). Read Replicas are primarily for scaling (manual failover, asynchronous).

Comparison Tables

Disaster Recovery Strategy Comparison

StrategyRTO / RPOCostComplexity
Backup & RestoreHours$Low
Pilot LightMinutes/Hours$$Medium
Warm StandbyMinutes$$$High
Multi-SiteSeconds (Real-time)$$$$Very High

AWS Database HA Comparison

FeatureRDS Multi-AZAurora GlobalDynamoDB Global Tables
Replication TypeSynchronousAsynchronous (Cross-Region)Asynchronous (Active-Active)
Failover ScopeAZ-Level (Auto)Region-Level (Manual/Auto)Region-Level (Automatic)
Read AccessStandby is PassiveReplicas are ActiveAll Replicas are Active

Ready to study AWS Certified Solutions Architect - Professional (SAP-C02)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free