Lab: Building a Resilient Multi-AZ Architecture on AWS
Design reliable and resilient architectures
Lab: Building a Resilient Multi-AZ Architecture on AWS
This hands-on lab focuses on the Reliability Pillar of the AWS Well-Architected Framework. You will design and implement a self-healing architecture that leverages Multi-AZ deployments for both compute and database layers to meet high-availability requirements.
Prerequisites
- AWS CLI: Installed and configured with
aws configure. - IAM Permissions: AdministratorAccess or PowerUserAccess to manage VPC, EC2, RDS, and IAM.
- Network: A default VPC in your region with at least two public subnets.
- Region: Use
us-east-1(N. Virginia) for consistency with this lab guide.
[!WARNING] Remember to run the teardown commands at the end to avoid ongoing charges for RDS and EC2 instances.
Learning Objectives
- Configure a Multi-AZ RDS instance for automated failover.
- Implement an Auto Scaling Group (ASG) across multiple Availability Zones.
- Simulate infrastructure failure to verify self-healing capabilities.
- Understand the relationship between RTO/RPO and architectural choices.
Architecture Overview
Step-by-Step Instructions
Step 1: Create a DB Subnet Group
RDS requires a subnet group that spans at least two Availability Zones to enable Multi-AZ.
# Replace <SUBNET_ID_1> and <SUBNET_ID_2> with your actual subnet IDs
aws rds create-db-subnet-group \
--db-subnet-group-name "brainybee-db-group" \
--db-subnet-group-description "Subnet group for resilient lab" \
--subnet-ids "<SUBNET_ID_1>" "<SUBNET_ID_2>"▶Console alternative
Navigate to RDS > Subnet groups > Create DB subnet group. Select your VPC and add subnets from at least two different AZs.
Step 2: Provision a Multi-AZ RDS Instance
We will deploy a MySQL instance with high availability enabled. This creates a synchronous standby in a different AZ.
aws rds create-db-instance \
--db-instance-identifier "brainybee-resilient-db" \
--db-instance-class "db.t3.micro" \
--engine "mysql" \
--master-username "admin" \
--master-user-password "BrainyBee123!" \
--allocated-storage 20 \
--db-subnet-group-name "brainybee-db-group" \
--multi-az \
--no-publicly-accessible[!NOTE] The
--multi-azflag is the key differentiator here. It ensures that if the primary AZ fails, RDS automatically updates the DNS record to point to the standby instance.
Step 3: Launch an Auto Scaling Group (ASG)
First, create a Launch Template for your web servers.
# Create a simple launch template
aws ec2 create-launch-template \
--launch-template-name "resilient-web-template" \
--launch-template-data '{"ImageId":"ami-0c55b159cbfafe1f0","InstanceType":"t2.micro"}'
# Create the ASG spanning two subnets
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name "resilient-asg" \
--launch-template "LaunchTemplateName=resilient-web-template" \
--min-size 2 --max-size 4 --desired-capacity 2 \
--vpc-zone-identifier "<SUBNET_ID_1>,<SUBNET_ID_2>"▶Console alternative
Navigate to EC2 > Auto Scaling Groups > Create Auto Scaling group. Define a launch template first, then select your VPC and two subnets during the ASG wizard.
Step 4: Simulate a Failure
To test reliability, we will terminate one instance and observe the ASG behavior.
# Find an instance ID
INSTANCE_ID=$(aws ec2 describe-instances --filters "Name=tag:aws:autoscaling:groupName,Values=resilient-asg" --query "Reservations[0].Instances[0].InstanceId" --output text)
# Terminate the instance
aws ec2 terminate-instances --instance-ids $INSTANCE_IDCheckpoints
- ASG Recovery: Run
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names resilient-asg. Within 2-3 minutes, you should see a new instance being launched to replace the terminated one. - RDS Status: Run
aws rds describe-db-instances --db-instance-identifier brainybee-resilient-db. Verify thatMultiAZis set totrueand the status isavailable. - Cross-AZ Distribution: Ensure your EC2 instances are running in different Availability Zones (e.g., one in
us-east-1aand one inus-east-1b).
Teardown
To avoid costs, delete the resources created in this lab.
# 1. Delete Auto Scaling Group
aws autoscaling delete-auto-scaling-group --auto-scaling-group-name "resilient-asg" --force-delete
# 2. Delete RDS Instance
aws rds delete-db-instance --db-instance-identifier "brainybee-resilient-db" --skip-final-snapshot
# 3. Delete DB Subnet Group
aws rds delete-db-subnet-group --db-subnet-group-name "brainybee-db-group"Troubleshooting
| Error | Cause | Fix |
|---|---|---|
InvalidParameterValue for RDS | Only 1 subnet provided. | Ensure the DB Subnet Group contains at least two subnets in different AZs. |
| ASG not launching instances | IAM permissions or AMI ID issues. | Check Activity Tasks in the ASG console to see the failure reason. |
| Cannot connect to RDS | Security Group rules. | Ensure your EC2 Security Group is allowed to connect to RDS on port 3306. |
Challenge
Pilot Light Implementation: How would you modify this architecture to achieve a lower cost but higher RTO?
- Goal: Create an Amazon Machine Image (AMI) of your web server and store a database snapshot in a secondary region (
us-west-2). Write a script that can provision the ASG and RDS instance from these assets only when a disaster occurs.
Cost Estimate
| Service | Estimated Hourly Cost | Free Tier Eligible? |
|---|---|---|
| EC2 (2x t2.micro) | $0.0232 | Yes (750 hrs/mo) |
| RDS (db.t3.micro Multi-AZ) | $0.0360 | Yes (Single AZ only) |
| ALB | $0.0225 | Yes (Limited) |
| Total | ~$0.08 / hour | - |
Concept Review
As discussed in the AWS SAP-C02 Exam Guide, reliability is about the ability of a system to recover from infrastructure or service disruptions.
RTO vs. RPO Visualization
\begin{tikzpicture} \draw[->, thick] (0,0) -- (10,0) node[right] {Time}; \draw[red, ultra thick] (5, -0.5) -- (5, 2) node[above] {Disaster Event};
\draw[blue, <->] (2, 1) -- (5, 1) node[midway, above] {RPO (Data Loss)};
\draw[green!60!black, <->] (5, 1) -- (8, 1) node[midway, above] {RTO (Downtime)};
\node at (2, -0.5) {Last Backup};
\node at (8, -0.5) {Service Restored};\end{tikzpicture}
| Strategy | RTO (Time) | RPO (Data) | Cost |
|---|---|---|---|
| Backup & Restore | Hours | 24 Hours | $$ |
| Pilot Light | Minutes | Minutes | $$ |
| Warm Standby | Seconds | Seconds | $$$ |
| Multi-Site (Active-Active) | Zero | Zero | $$$$ $ |