Hands-On Lab925 words

Lab: Building a Resilient Multi-AZ Architecture on AWS

Design reliable and resilient architectures

Lab: Building a Resilient Multi-AZ Architecture on AWS

This hands-on lab focuses on the Reliability Pillar of the AWS Well-Architected Framework. You will design and implement a self-healing architecture that leverages Multi-AZ deployments for both compute and database layers to meet high-availability requirements.

Prerequisites

  • AWS CLI: Installed and configured with aws configure.
  • IAM Permissions: AdministratorAccess or PowerUserAccess to manage VPC, EC2, RDS, and IAM.
  • Network: A default VPC in your region with at least two public subnets.
  • Region: Use us-east-1 (N. Virginia) for consistency with this lab guide.

[!WARNING] Remember to run the teardown commands at the end to avoid ongoing charges for RDS and EC2 instances.

Learning Objectives

  • Configure a Multi-AZ RDS instance for automated failover.
  • Implement an Auto Scaling Group (ASG) across multiple Availability Zones.
  • Simulate infrastructure failure to verify self-healing capabilities.
  • Understand the relationship between RTO/RPO and architectural choices.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Create a DB Subnet Group

RDS requires a subnet group that spans at least two Availability Zones to enable Multi-AZ.

bash
# Replace <SUBNET_ID_1> and <SUBNET_ID_2> with your actual subnet IDs aws rds create-db-subnet-group \ --db-subnet-group-name "brainybee-db-group" \ --db-subnet-group-description "Subnet group for resilient lab" \ --subnet-ids "<SUBNET_ID_1>" "<SUBNET_ID_2>"
Console alternative

Navigate to RDS > Subnet groups > Create DB subnet group. Select your VPC and add subnets from at least two different AZs.

Step 2: Provision a Multi-AZ RDS Instance

We will deploy a MySQL instance with high availability enabled. This creates a synchronous standby in a different AZ.

bash
aws rds create-db-instance \ --db-instance-identifier "brainybee-resilient-db" \ --db-instance-class "db.t3.micro" \ --engine "mysql" \ --master-username "admin" \ --master-user-password "BrainyBee123!" \ --allocated-storage 20 \ --db-subnet-group-name "brainybee-db-group" \ --multi-az \ --no-publicly-accessible

[!NOTE] The --multi-az flag is the key differentiator here. It ensures that if the primary AZ fails, RDS automatically updates the DNS record to point to the standby instance.

Step 3: Launch an Auto Scaling Group (ASG)

First, create a Launch Template for your web servers.

bash
# Create a simple launch template aws ec2 create-launch-template \ --launch-template-name "resilient-web-template" \ --launch-template-data '{"ImageId":"ami-0c55b159cbfafe1f0","InstanceType":"t2.micro"}' # Create the ASG spanning two subnets aws autoscaling create-auto-scaling-group \ --auto-scaling-group-name "resilient-asg" \ --launch-template "LaunchTemplateName=resilient-web-template" \ --min-size 2 --max-size 4 --desired-capacity 2 \ --vpc-zone-identifier "<SUBNET_ID_1>,<SUBNET_ID_2>"
Console alternative

Navigate to EC2 > Auto Scaling Groups > Create Auto Scaling group. Define a launch template first, then select your VPC and two subnets during the ASG wizard.

Step 4: Simulate a Failure

To test reliability, we will terminate one instance and observe the ASG behavior.

bash
# Find an instance ID INSTANCE_ID=$(aws ec2 describe-instances --filters "Name=tag:aws:autoscaling:groupName,Values=resilient-asg" --query "Reservations[0].Instances[0].InstanceId" --output text) # Terminate the instance aws ec2 terminate-instances --instance-ids $INSTANCE_ID

Checkpoints

  1. ASG Recovery: Run aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names resilient-asg. Within 2-3 minutes, you should see a new instance being launched to replace the terminated one.
  2. RDS Status: Run aws rds describe-db-instances --db-instance-identifier brainybee-resilient-db. Verify that MultiAZ is set to true and the status is available.
  3. Cross-AZ Distribution: Ensure your EC2 instances are running in different Availability Zones (e.g., one in us-east-1a and one in us-east-1b).

Teardown

To avoid costs, delete the resources created in this lab.

bash
# 1. Delete Auto Scaling Group aws autoscaling delete-auto-scaling-group --auto-scaling-group-name "resilient-asg" --force-delete # 2. Delete RDS Instance aws rds delete-db-instance --db-instance-identifier "brainybee-resilient-db" --skip-final-snapshot # 3. Delete DB Subnet Group aws rds delete-db-subnet-group --db-subnet-group-name "brainybee-db-group"

Troubleshooting

ErrorCauseFix
InvalidParameterValue for RDSOnly 1 subnet provided.Ensure the DB Subnet Group contains at least two subnets in different AZs.
ASG not launching instancesIAM permissions or AMI ID issues.Check Activity Tasks in the ASG console to see the failure reason.
Cannot connect to RDSSecurity Group rules.Ensure your EC2 Security Group is allowed to connect to RDS on port 3306.

Challenge

Pilot Light Implementation: How would you modify this architecture to achieve a lower cost but higher RTO?

  • Goal: Create an Amazon Machine Image (AMI) of your web server and store a database snapshot in a secondary region (us-west-2). Write a script that can provision the ASG and RDS instance from these assets only when a disaster occurs.

Cost Estimate

ServiceEstimated Hourly CostFree Tier Eligible?
EC2 (2x t2.micro)$0.0232Yes (750 hrs/mo)
RDS (db.t3.micro Multi-AZ)$0.0360Yes (Single AZ only)
ALB$0.0225Yes (Limited)
Total~$0.08 / hour-

Concept Review

As discussed in the AWS SAP-C02 Exam Guide, reliability is about the ability of a system to recover from infrastructure or service disruptions.

RTO vs. RPO Visualization

\begin{tikzpicture} \draw[->, thick] (0,0) -- (10,0) node[right] {Time}; \draw[red, ultra thick] (5, -0.5) -- (5, 2) node[above] {Disaster Event};

code
\draw[blue, <->] (2, 1) -- (5, 1) node[midway, above] {RPO (Data Loss)}; \draw[green!60!black, <->] (5, 1) -- (8, 1) node[midway, above] {RTO (Downtime)}; \node at (2, -0.5) {Last Backup}; \node at (8, -0.5) {Service Restored};

\end{tikzpicture}

StrategyRTO (Time)RPO (Data)Cost
Backup & RestoreHours24 Hours$$
Pilot LightMinutesMinutes$$
Warm StandbySecondsSeconds$$$
Multi-Site (Active-Active)ZeroZero$$$$ $

Ready to study AWS Certified Solutions Architect - Professional (SAP-C02)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free