Lab: Designing and Testing a Reliable Multi-AZ Web Architecture
Design a strategy to meet reliability requirements
Lab: Designing and Testing a Reliable Multi-AZ Web Architecture
This lab guides you through designing a resilient strategy to meet reliability requirements, specifically focusing on the AWS Well-Architected Framework principles of horizontal scaling, automatic recovery from failure, and testing recovery procedures.
[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges for the Application Load Balancer and EC2 instances.
Prerequisites
- An active AWS Account.
- AWS CLI configured with administrator-level permissions.
- A default VPC in your region with at least two public subnets.
- Basic knowledge of EC2, VPC, and Load Balancing.
Learning Objectives
- Deploy a Multi-AZ Application Load Balancer (ALB) to eliminate single points of failure.
- Configure an Auto Scaling Group (ASG) with health checks to enable self-healing.
- Test recovery procedures by simulating an instance failure and observing automatic replacement.
- Implement infrastructure as code (via CLI) to ensure repeatable, automated change management.
Architecture Overview
We will build a highly available web tier that spans two Availability Zones (AZs). The Load Balancer will distribute traffic, while the Auto Scaling Group ensures the desired number of healthy instances is maintained.
Step-by-Step Instructions
Step 1: Create a Security Group
We need a security group that allows HTTP traffic (Port 80) from the internet to our load balancer and instances.
# Create the Security Group
aws ec2 create-security-group \
--group-name brainybee-lab-sg \
--description "Allow HTTP traffic" \
--vpc-id <YOUR_VPC_ID>
# Authorize Inbound HTTP
aws ec2 authorize-security-group-ingress \
--group-name brainybee-lab-sg \
--protocol tcp \
--port 80 \
--cidr 0.0.0.0/0▶Console alternative
Navigate to
. Add an Inbound rule for HTTP (80) with source 0.0.0.0/0.
Step 2: Create an Application Load Balancer
The ALB provides the entry point for our application and performs health checks to ensure reliability.
# Create the ALB
aws elbv2 create-load-balancer \
--name brainybee-lab-alb \
--subnets <SUBNET_ID_AZ1> <SUBNET_ID_AZ2> \
--security-groups <SG_ID_FROM_STEP_1>Step 3: Create a Launch Template
A launch template ensures we "stop guessing capacity" and define the exact environment (AMI, instance type) for our workloads.
# Create the Launch Template
aws ec2 create-launch-template \
--launch-template-name ReliabilityLabTemplate \
--version-description version1 \
--launch-template-data '{"NetworkInterfaces":[{"AssociatePublicIpAddress":true,"DeviceIndex":0,"Groups":["<SG_ID_FROM_STEP_1>"]}],"ImageId":"ami-0c55b159cbfafe1f0","InstanceType":"t2.micro"}'Step 4: Create the Auto Scaling Group (ASG)
This is the core of our "Automatic Recovery" strategy. The ASG will maintain a minimum of 2 instances across 2 AZs.
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name brainybee-lab-asg \
--launch-template LaunchTemplateName=ReliabilityLabTemplate \
--min-size 2 \
--max-size 4 \
--desired-capacity 2 \
--vpc-zone-identifier "<SUBNET_ID_AZ1>,<SUBNET_ID_AZ2>"Checkpoints
- Deployment Verification: Run
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-name brainybee-lab-asg. EnsureDesiredCapacityis 2 andInstanceslist has 2 entries inInServicestate. - DNS Reachability: Copy the DNS Name of your ALB from the
describe-load-balancersoutput. Paste it into your browser. You should see the default web page (ensure your AMI has a web server running, or use a UserData script to install one).
Testing Recovery (Failure Simulation)
To meet the reliability requirement of "testing recovery procedures," we will manually terminate an instance.
- Identify Instance:
aws ec2 describe-instances --filters "Name=tag:aws:autoscaling:groupName,Values=brainybee-lab-asg" - Terminate Instance: Pick one
InstanceIdand run:bashaws ec2 terminate-instances --instance-ids <INSTANCE_ID> - Observe: Within 1-2 minutes, the ASG will detect the failure (due to EC2 status checks) and automatically launch a replacement instance to maintain the desired capacity of 2.
Clean-Up / Teardown
To avoid costs, you must delete the resources in this specific order:
# 1. Delete the ASG (This terminates the instances)
aws autoscaling delete-auto-scaling-group --auto-scaling-group-name brainybee-lab-asg --force-delete
# 2. Delete the ALB
aws elbv2 delete-load-balancer --load-balancer-arn <ALB_ARN>
# 3. Delete the Launch Template
aws ec2 delete-launch-template --launch-template-name ReliabilityLabTemplate
# 4. Delete the Security Group
aws ec2 delete-security-group --group-id <SG_ID>Troubleshooting
| Error | Cause | Fix |
|---|---|---|
Instance in wrong AZ | Subnets provided to ASG were in a single AZ. | Recreate ASG with subnets from two different AZs. |
ALB 503 Service Unavailable | Target Group not yet healthy or registered. | Wait 2-3 minutes for health checks to pass; verify Security Group allows ALB -> Instance traffic. |
CLI: Permission Denied | IAM Role lacks EC2/AutoScaling permissions. | Attach AmazonEC2FullAccess and AutoScalingFullAccess to your IAM user. |
Stretch Challenge
Implement Predictive Scaling: Instead of static capacity, use the AWS CLI to attach a Target Tracking Scaling Policy to your ASG that maintains a target average CPU utilization of 50%. This addresses the principle of "scaling to satisfy demand proactively."
Cost Estimate
| Service | Usage | Estimated Cost (Monthly/Pro-rated) |
|---|---|---|
| EC2 t2.micro | 2 Instances for 1 hour | $0.00 (Free Tier) or ~$0.02 |
| Application Load Balancer | 1 ALB for 1 hour | ~$0.025 |
| Data Transfer | Minimal | $0.00 |
| Total | <$0.10 |
Concept Review
This lab implemented several reliability pillars from the AWS SAP-C02 guide:
- Horizontal Scaling: We used an ASG and ALB to distribute load across multiple instances rather than relying on one large instance.
- Self-Healing: By setting
min-sizeanddesired-capacity, the ASG acts as a control loop that automatically recovers from instance-level failures. - Foundation Requirements: We leveraged the AWS Global Infrastructure (Multi-AZ) to protect against data center-level outages.