Study Guide920 words

AWS Study Guide: Designing Highly Available and Fault-Tolerant Architectures

Determining the AWS services required to provide a highly available and/or fault-tolerant architecture across AWS Regions or Availability Zones

Designing Highly Available and Fault-Tolerant Architectures

This guide covers the critical strategies and AWS services required to build resilient cloud architectures that can withstand component, Availability Zone (AZ), or entire Regional failures.

Learning Objectives

By the end of this study guide, you should be able to:

  • Differentiate between High Availability (HA) and Fault Tolerance (FT).
  • Select appropriate AWS services to eliminate single points of failure (SPOF).
  • Explain the trade-offs between Multi-AZ and Multi-Region deployments.
  • Define and apply Disaster Recovery (DR) strategies: Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active-Active.
  • Calculate workload availability for redundant components.

Key Terms & Glossary

  • Availability Zone (AZ): One or more discrete data centers with redundant power, networking, and connectivity in an AWS Region.
  • Region: A physical location around the world where AWS clusters data centers.
  • RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time.
  • RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration of service.
  • Fault Tolerance: The ability of a system to continue operating without interruption during the failure of one or more of its components.
  • High Availability: A system design protocol that ensures a high level of operational performance (usually uptime) for a given period.

The "Big Idea"

The core philosophy of AWS resilience is "Everything fails, all the time." Instead of trying to build a single "indestructible" server, we build systems out of multiple smaller, independent components distributed across isolated infrastructure (AZs and Regions). True resilience is achieved when the system can automatically detect, route around, and recover from failures without human intervention.

Formula / Concept Box

Availability Calculation

ConceptFormula / DescriptionExample
Availability in SeriesAtotal=A1×A2A_{total} = A_1 \times A_2Two 99.9% services = 99.8% total
Availability in ParallelAtotal=1(1A)nA_{total} = 1 - (1 - A)^nTwo 99% instances = 99.99% total
RPOFocuses on Data Loss"We can lose 15 minutes of data."
RTOFocuses on Downtime"We must be back up in 2 hours."

Hierarchical Outline

  1. Global Infrastructure Fundamentals
    • Availability Zones: Low-latency links; isolated from local disasters.
    • Regions: Geographic separation; used for ultimate disaster recovery.
  2. Compute & Networking Resilience
    • Elastic Load Balancing (ELB): Distributes traffic across healthy instances; performs health checks.
    • Auto Scaling Groups (ASG): Automatically replaces failed instances and scales based on demand.
    • Amazon Route 53: Global DNS that can perform health checks and failover between Regions.
  3. Data Persistence Strategies
    • RDS Multi-AZ: Synchronous replication to a standby in a different AZ (Automatic Failover).
    • RDS Read Replicas: Asynchronous replication for scaling reads; can be cross-region for DR.
    • Amazon S3: Inherently highly available; supports Cross-Region Replication (CRR).
  4. Disaster Recovery (DR) Spectrum
    • Backup and Restore: Lowest cost, highest RTO (manual).
    • Pilot Light: Minimal version of environment always running (Core data).
    • Warm Standby: Functional but scaled-down version of the environment.
    • Multi-Site Active-Active: Zero downtime; traffic served from multiple regions simultaneously.

Visual Anchors

Standard Multi-AZ Architecture

Loading Diagram...

DR Strategy Spectrum

\begin{tikzpicture}[scale=0.8] \draw[thick, ->] (0,0) -- (12,0) node[anchor=north] {Cost & Complexity}; \draw[thick, <-] (0,0.5) -- (12,0.5) node[anchor=south] {RPO / RTO (Speed of Recovery)};

code
\node[draw, rectangle, fill=blue!10] at (1.5, -1) {Backup}; \node[draw, rectangle, fill=blue!20] at (4.5, -1) {Pilot Light}; \node[draw, rectangle, fill=blue!30] at (7.5, -1) {Warm Standby}; \node[draw, rectangle, fill=blue!40] at (10.5, -1) {Active-Active}; \node[below] at (1.5, -1.5) {Hours/Days}; \node[below] at (10.5, -1.5) {Real-time};

\end{tikzpicture}

Definition-Example Pairs

  • Loose Coupling: Designing components so they do not have a hard dependency on each other.
    • Example: Using Amazon SQS between a web server and a video processor. If the processor fails, messages stay in the queue until it recovers.
  • Immutable Infrastructure: Components are never updated in place; they are replaced with new versions.
    • Example: When updating an application, instead of SSHing into an EC2 to pull code, you bake a new AMI and update the Auto Scaling Group.
  • Self-Healing: The system's ability to detect and fix its own issues.
    • Example: An ASG terminating an instance that fails its ALB health check and launching a fresh one automatically.

Worked Examples

Example 1: Calculating Multi-AZ Uptime

Scenario: A solution uses two independent web servers in two different AZs. Each AZ has an availability of 99.9%. What is the theoretical availability of the web tier?

Solution:

  1. Probability of AZ A failing = $1 - 0.999 = 0.001$
  2. Probability of AZ B failing = $1 - 0.999 = 0.001$
  3. Probability of BOTH failing simultaneously = $0.001 \times 0.001 = 0.000001$
  4. Overall Availability = $1 - 0.000001 = 0.999999$ (99.9999% or "six nines").

Example 2: Designing for Regional Failure

Scenario: A company needs an RTO of less than 15 minutes for a complete Regional outage but wants to keep costs low.

Strategy: Pilot Light

  • Database: Use RDS with a cross-region read replica. In a disaster, promote the replica to primary.
  • App Tier: Keep an AMI ready in the secondary region. Keep no EC2 instances running (to save cost).
  • Execution: Use Route 53 to point to the new region's ALB once the ASG has scaled up the instances from the AMI.

Checkpoint Questions

  1. What is the main difference between RDS Multi-AZ and RDS Read Replicas regarding data consistency?
  2. A company requires an RPO of 0 (no data loss). Which DR strategy is NOT suitable?
  3. How does Route 53 determine if it should fail over to a secondary region?
  4. What AWS service provides a highly available buffer to ensure loose coupling between microservices?
Click to see answers
  1. RDS Multi-AZ uses synchronous replication (strong consistency for failover), while Read Replicas use asynchronous replication (eventual consistency for scaling/DR).
  2. Backup and Restore and Pilot Light are unsuitable because they usually involve some asynchronous data lag (RPO > 0).
  3. Via Health Checks that monitor the endpoint or a CloudWatch Alarm.
  4. Amazon SQS (Simple Queue Service).

Ready to study AWS Certified Solutions Architect - Associate (SAA-C03)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free