Mastering High Availability and Resiliency on AWS

This study guide focuses on the architectural principles required to design and maintain highly available and resilient systems on AWS, specifically tailored for the Solutions Architect Professional (SAP-C02) exam.

Learning Objectives

By the end of this guide, you should be able to:

Differentiate between High Availability (HA) and Disaster Recovery (DR).
Design architectures that remediate Single Points of Failure (SPOF).
Implement loosely coupled dependencies using messaging services.
Configure resilient hybrid connectivity using AWS Direct Connect and VPN.
Evaluate and manage service quotas and IP address allocation for scaling.

Key Terms & Glossary

High Availability (HA): The ability of a workload to remain functional despite component-level failures. Focuses on "local" failures (e.g., a single instance or AZ).
Resiliency: The ability of a system to recover from interruptions and dynamically handle changes in demand.
Disaster Recovery (DR): The process of preparing for and recovering from large-scale events (e.g., regional outages).
Single Point of Failure (SPOF): Any part of a system that, if it fails, will stop the entire system from working.
Loose Coupling: An approach where components have little or no knowledge of the internal workings of other components, typically achieved via SQS or SNS.

The "Big Idea"

In a distributed cloud environment, failure is inevitable. The "Big Idea" of HA and Resiliency is to transition from Reactive Recovery (fixing things when they break) to Proactive Design (building systems that are "self-healing"). This involves distributing resources across multiple Availability Zones, automating scaling, and ensuring that data is replicated in real-time so that the loss of any single component is invisible to the end user.

Formula / Concept Box

Concept	Metric / Rule	Description
Availability %	$(Up / (Up + Down)) \times 100$	The percentage of time a system is operational.
SLA Serial	$A \times B = Total$	For sequential components, total availability is the product of both.
SLA Parallel	$1 - (1-A) $\times (1-B)$$	For redundant components, total availability is 1 minus the probability both fail.
Subnet Reservation	5 IP Addresses	AWS reserves the first 4 and the last 1 IP in every CIDR block.

Hierarchical Outline

I. Foundational HA Components
- Compute Scaling: Using Auto Scaling Groups (ASG) to maintain instance counts and Elastic Load Balancing (ELB) to distribute traffic.
- Storage Replication: Utilizing Amazon RDS Multi-AZ for synchronous replication and Amazon S3 for cross-region replication.
II. Advanced Resiliency Patterns
- Decoupling: Using Amazon SQS as a buffer to prevent system-wide failure during traffic spikes.
- Route 53 Policies: Implementing Latency-based or Geolocation routing to direct users to the most resilient endpoint.
III. Hybrid & Connectivity Resiliency
- Direct Connect (DX): Deploying redundant connections at separate DX locations.
- Failover Mechanisms: Setting up DX to VPN failover as a cost-effective backup strategy.

Visual Anchors

High Availability Multi-AZ Architecture

Loading Diagram...

Redundant Hybrid Connectivity (DX + VPN)

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Self-Healing: The ability of a system to detect failure and automatically remediate it.
- Example: An EC2 Auto Scaling Group detecting a failed health check and terminating the instance, then launching a fresh one to replace it.
Single-AZ Constraint: Some workloads cannot be distributed across AZs due to latency requirements.
- Example: Amazon EMR clusters are often constrained to a single AZ for job performance. Resiliency here requires automated redeployment scripts to recreate the cluster in a new AZ if the original fails.

Worked Examples

Scenario: Remediating a Single Point of Failure

Problem: A legacy application runs on a single EC2 instance with a local MySQL database. If the instance fails, the business loses all data and access.

Step-by-Step Remediation:

Extract Data: Move the database from the EC2 instance to Amazon RDS Multi-AZ. This provides automated synchronous replication to a standby instance in another AZ.
Stateless Compute: Modify the application to store session data in Amazon ElastiCache or DynamoDB so the EC2 instances are stateless.
Implement ASG: Wrap the EC2 instances in an Auto Scaling Group with a minimum capacity of 2 across two different AZs.
Load Balance: Place an Application Load Balancer (ALB) in front of the ASG to handle traffic distribution and health checks.

Checkpoint Questions

What is the main difference between HA and DR in terms of scope?
Why must an architect account for 5 reserved IP addresses when planning VPC subnets?
In a hybrid environment, what is a cost-effective alternative to having two separate Direct Connect locations?
Which AWS service would you use to decouple a front-end web tier from a back-end processing tier to ensure high resiliency?

[!TIP] Answer Hints: 1. HA = Local/Component failure; DR = Regional/Large-scale failure. 2. AWS reserves the first 4 and last 1 IP for internal networking. 3. DX for primary and VPN for backup. 4. Amazon SQS.

Muddy Points & Cross-Refs

EMR Resiliency: Students often struggle with why EMR is Single-AZ. It is for performance (reduced inter-node latency). Cross-ref: Data Engineering domain for EMR automation.
Sync vs Async Replication: RDS Multi-AZ is Synchronous (HA), while RDS Read Replicas are usually Asynchronous (Scalability/DR). Mixing these up is a common exam pitfall.

Comparison Tables

Feature	High Availability (HA)	Disaster Recovery (DR)
Primary Goal	Minimize downtime (Uptime)	Business Continuity (Recovery)
Scope	Availability Zones / Instances	Regions / Data Centers
Implementation	ELB, ASG, Multi-AZ RDS	Cross-Region Replication, Backup/Restore
Cost	Usually higher (Always-on)	Variable (Pilot Light vs. Warm Standby)

Connectivity	Performance	Cost	Resiliency
Direct Connect	High / Consistent	High	Medium (if single location)
Site-to-Site VPN	Variable (Internet)	Low	High (via multiple tunnels)
DX + VPN Failover	High (Primary)	Balanced	Very High