Lab: Building a Resilient Multi-AZ Architecture on AWS

This hands-on lab focuses on the Reliability Pillar of the AWS Well-Architected Framework. You will design and implement a self-healing architecture that leverages Multi-AZ deployments for both compute and database layers to meet high-availability requirements.

Prerequisites

AWS CLI: Installed and configured with aws configure.
IAM Permissions: AdministratorAccess or PowerUserAccess to manage VPC, EC2, RDS, and IAM.
Network: A default VPC in your region with at least two public subnets.
Region: Use us-east-1 (N. Virginia) for consistency with this lab guide.

[!WARNING] Remember to run the teardown commands at the end to avoid ongoing charges for RDS and EC2 instances.

Learning Objectives

Configure a Multi-AZ RDS instance for automated failover.
Implement an Auto Scaling Group (ASG) across multiple Availability Zones.
Simulate infrastructure failure to verify self-healing capabilities.
Understand the relationship between RTO/RPO and architectural choices.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Create a DB Subnet Group

RDS requires a subnet group that spans at least two Availability Zones to enable Multi-AZ.

bash

# Replace <SUBNET_ID_1> and <SUBNET_ID_2> with your actual subnet IDs
aws rds create-db-subnet-group \
    --db-subnet-group-name "brainybee-db-group" \
    --db-subnet-group-description "Subnet group for resilient lab" \
    --subnet-ids "<SUBNET_ID_1>" "<SUBNET_ID_2>"

▶Console alternative

Navigate to RDS > Subnet groups > Create DB subnet group. Select your VPC and add subnets from at least two different AZs.

Step 2: Provision a Multi-AZ RDS Instance

We will deploy a MySQL instance with high availability enabled. This creates a synchronous standby in a different AZ.

bash

aws rds create-db-instance \
    --db-instance-identifier "brainybee-resilient-db" \
    --db-instance-class "db.t3.micro" \
    --engine "mysql" \
    --master-username "admin" \
    --master-user-password "BrainyBee123!" \
    --allocated-storage 20 \
    --db-subnet-group-name "brainybee-db-group" \
    --multi-az \
    --no-publicly-accessible

[!NOTE] The --multi-az flag is the key differentiator here. It ensures that if the primary AZ fails, RDS automatically updates the DNS record to point to the standby instance.

Step 3: Launch an Auto Scaling Group (ASG)

First, create a Launch Template for your web servers.

bash

# Create a simple launch template
aws ec2 create-launch-template \
    --launch-template-name "resilient-web-template" \
    --launch-template-data '{"ImageId":"ami-0c55b159cbfafe1f0","InstanceType":"t2.micro"}'

# Create the ASG spanning two subnets
aws autoscaling create-auto-scaling-group \
    --auto-scaling-group-name "resilient-asg" \
    --launch-template "LaunchTemplateName=resilient-web-template" \
    --min-size 2 --max-size 4 --desired-capacity 2 \
    --vpc-zone-identifier "<SUBNET_ID_1>,<SUBNET_ID_2>"

▶Console alternative

Navigate to EC2 > Auto Scaling Groups > Create Auto Scaling group. Define a launch template first, then select your VPC and two subnets during the ASG wizard.

Step 4: Simulate a Failure

To test reliability, we will terminate one instance and observe the ASG behavior.

bash

# Find an instance ID
INSTANCE_ID=$(aws ec2 describe-instances --filters "Name=tag:aws:autoscaling:groupName,Values=resilient-asg" --query "Reservations[0].Instances[0].InstanceId" --output text)

# Terminate the instance
aws ec2 terminate-instances --instance-ids $INSTANCE_ID

Checkpoints

ASG Recovery: Run aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names resilient-asg. Within 2-3 minutes, you should see a new instance being launched to replace the terminated one.
RDS Status: Run aws rds describe-db-instances --db-instance-identifier brainybee-resilient-db. Verify that MultiAZ is set to true and the status is available.
Cross-AZ Distribution: Ensure your EC2 instances are running in different Availability Zones (e.g., one in us-east-1a and one in us-east-1b).

Teardown

To avoid costs, delete the resources created in this lab.

bash

# 1. Delete Auto Scaling Group
aws autoscaling delete-auto-scaling-group --auto-scaling-group-name "resilient-asg" --force-delete

# 2. Delete RDS Instance
aws rds delete-db-instance --db-instance-identifier "brainybee-resilient-db" --skip-final-snapshot

# 3. Delete DB Subnet Group
aws rds delete-db-subnet-group --db-subnet-group-name "brainybee-db-group"

Troubleshooting

Error	Cause	Fix
`InvalidParameterValue` for RDS	Only 1 subnet provided.	Ensure the DB Subnet Group contains at least two subnets in different AZs.
ASG not launching instances	IAM permissions or AMI ID issues.	Check `Activity Tasks` in the ASG console to see the failure reason.
Cannot connect to RDS	Security Group rules.	Ensure your EC2 Security Group is allowed to connect to RDS on port 3306.

Challenge

Pilot Light Implementation: How would you modify this architecture to achieve a lower cost but higher RTO?

Goal: Create an Amazon Machine Image (AMI) of your web server and store a database snapshot in a secondary region (us-west-2). Write a script that can provision the ASG and RDS instance from these assets only when a disaster occurs.

Cost Estimate

Service	Estimated Hourly Cost	Free Tier Eligible?
EC2 (2x t2.micro)	$0.0232	Yes (750 hrs/mo)
RDS (db.t3.micro Multi-AZ)	$0.0360	Yes (Single AZ only)
ALB	$0.0225	Yes (Limited)
Total	~$0.08 / hour	-

Concept Review

As discussed in the AWS SAP-C02 Exam Guide, reliability is about the ability of a system to recover from infrastructure or service disruptions.

RTO vs. RPO Visualization

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Strategy	RTO (Time)	RPO (Data)	Cost
Backup & Restore	Hours	24 Hours	$$$
Pilot Light	Minutes	Minutes	$$$$
Warm Standby	Seconds	Seconds	$$$
Multi-Site (Active-Active)	Zero	Zero	$$$$

Lab: Building a Resilient Multi-AZ Architecture on AWS

Prerequisites

AWS CLI: Installed and configured with aws configure.
IAM Permissions: AdministratorAccess or PowerUserAccess to manage VPC, EC2, RDS, and IAM.
Network: A default VPC in your region with at least two public subnets.
Region: Use us-east-1 (N. Virginia) for consistency with this lab guide.

[!WARNING] Remember to run the teardown commands at the end to avoid ongoing charges for RDS and EC2 instances.

Learning Objectives

Configure a Multi-AZ RDS instance for automated failover.
Implement an Auto Scaling Group (ASG) across multiple Availability Zones.
Simulate infrastructure failure to verify self-healing capabilities.
Understand the relationship between RTO/RPO and architectural choices.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Create a DB Subnet Group

RDS requires a subnet group that spans at least two Availability Zones to enable Multi-AZ.

bash

# Replace <SUBNET_ID_1> and <SUBNET_ID_2> with your actual subnet IDs
aws rds create-db-subnet-group \
    --db-subnet-group-name "brainybee-db-group" \
    --db-subnet-group-description "Subnet group for resilient lab" \
    --subnet-ids "<SUBNET_ID_1>" "<SUBNET_ID_2>"

▶Console alternative

Navigate to RDS > Subnet groups > Create DB subnet group. Select your VPC and add subnets from at least two different AZs.

Step 2: Provision a Multi-AZ RDS Instance

We will deploy a MySQL instance with high availability enabled. This creates a synchronous standby in a different AZ.

bash

aws rds create-db-instance \
    --db-instance-identifier "brainybee-resilient-db" \
    --db-instance-class "db.t3.micro" \
    --engine "mysql" \
    --master-username "admin" \
    --master-user-password "BrainyBee123!" \
    --allocated-storage 20 \
    --db-subnet-group-name "brainybee-db-group" \
    --multi-az \
    --no-publicly-accessible

[!NOTE] The --multi-az flag is the key differentiator here. It ensures that if the primary AZ fails, RDS automatically updates the DNS record to point to the standby instance.

Step 3: Launch an Auto Scaling Group (ASG)

First, create a Launch Template for your web servers.

bash

# Create a simple launch template
aws ec2 create-launch-template \
    --launch-template-name "resilient-web-template" \
    --launch-template-data '{"ImageId":"ami-0c55b159cbfafe1f0","InstanceType":"t2.micro"}'

# Create the ASG spanning two subnets
aws autoscaling create-auto-scaling-group \
    --auto-scaling-group-name "resilient-asg" \
    --launch-template "LaunchTemplateName=resilient-web-template" \
    --min-size 2 --max-size 4 --desired-capacity 2 \
    --vpc-zone-identifier "<SUBNET_ID_1>,<SUBNET_ID_2>"

▶Console alternative

Navigate to EC2 > Auto Scaling Groups > Create Auto Scaling group. Define a launch template first, then select your VPC and two subnets during the ASG wizard.

Step 4: Simulate a Failure

To test reliability, we will terminate one instance and observe the ASG behavior.

bash

# Find an instance ID
INSTANCE_ID=$(aws ec2 describe-instances --filters "Name=tag:aws:autoscaling:groupName,Values=resilient-asg" --query "Reservations[0].Instances[0].InstanceId" --output text)

# Terminate the instance
aws ec2 terminate-instances --instance-ids $INSTANCE_ID

Checkpoints

ASG Recovery: Run aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names resilient-asg. Within 2-3 minutes, you should see a new instance being launched to replace the terminated one.
RDS Status: Run aws rds describe-db-instances --db-instance-identifier brainybee-resilient-db. Verify that MultiAZ is set to true and the status is available.
Cross-AZ Distribution: Ensure your EC2 instances are running in different Availability Zones (e.g., one in us-east-1a and one in us-east-1b).

Teardown

To avoid costs, delete the resources created in this lab.

bash

# 1. Delete Auto Scaling Group
aws autoscaling delete-auto-scaling-group --auto-scaling-group-name "resilient-asg" --force-delete

# 2. Delete RDS Instance
aws rds delete-db-instance --db-instance-identifier "brainybee-resilient-db" --skip-final-snapshot

# 3. Delete DB Subnet Group
aws rds delete-db-subnet-group --db-subnet-group-name "brainybee-db-group"

Troubleshooting

Error	Cause	Fix
`InvalidParameterValue` for RDS	Only 1 subnet provided.	Ensure the DB Subnet Group contains at least two subnets in different AZs.
ASG not launching instances	IAM permissions or AMI ID issues.	Check `Activity Tasks` in the ASG console to see the failure reason.
Cannot connect to RDS	Security Group rules.	Ensure your EC2 Security Group is allowed to connect to RDS on port 3306.

Challenge

Pilot Light Implementation: How would you modify this architecture to achieve a lower cost but higher RTO?

Goal: Create an Amazon Machine Image (AMI) of your web server and store a database snapshot in a secondary region (us-west-2). Write a script that can provision the ASG and RDS instance from these assets only when a disaster occurs.

Cost Estimate

Service	Estimated Hourly Cost	Free Tier Eligible?
EC2 (2x t2.micro)	$0.0232	Yes (750 hrs/mo)
RDS (db.t3.micro Multi-AZ)	$0.0360	Yes (Single AZ only)
ALB	$0.0225	Yes (Limited)
Total	~$0.08 / hour	-

Concept Review

As discussed in the AWS SAP-C02 Exam Guide, reliability is about the ability of a system to recover from infrastructure or service disruptions.

RTO vs. RPO Visualization

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Strategy	RTO (Time)	RPO (Data)	Cost
Backup & Restore	Hours	24 Hours	$$$
Pilot Light	Minutes	Minutes	$$$$
Warm Standby	Seconds	Seconds	$$$
Multi-Site (Active-Active)	Zero	Zero	$$$$