Lab: Building Self-Healing Infrastructure for Operational Excellence
Determine a strategy to improve overall operational excellence
Lab: Building Self-Healing Infrastructure for Operational Excellence
Operational Excellence is the cornerstone of the AWS Well-Architected Framework. In this lab, you will transition from manual monitoring to automated remediation. You will use AWS Config to detect configuration drift (specifically missing tags) and AWS Systems Manager (SSM) to automatically fix the issue without human intervention. This reflects the "Operations as Code" and "Refine Operations Procedures Frequently" design principles.
Prerequisites
- AWS Account: You must have an active AWS account with permissions to create IAM Roles, EC2 instances, AWS Config rules, and S3 buckets.
- AWS CLI: Installed and configured with
AdministratorAccesscredentials. - Region Selection: This lab is designed for
us-east-1(N. Virginia), though it can be adapted for any region.
Learning Objectives
- Establish a Configuration Recorder using AWS Config to monitor resource state.
- Implement Managed Config Rules to identify non-compliant resources.
- Develop an SSM Automation Document workflow for automated remediation.
- Verify the "Self-Healing" cycle by intentionally creating non-compliant resources.
Architecture Overview
This architecture demonstrates a closed-loop system where detection automatically triggers correction.
Step-by-Step Instructions
Step 1: Create an S3 Bucket for AWS Config
AWS Config requires an S3 bucket to store configuration history files.
# Generate a unique bucket name
BUCKET_NAME=brainybee-config-$(date +%s)
aws s3 mb s3://$BUCKET_NAME▶Console alternative
Navigate to
. Enter a unique name and keep default settings. Click
.
Step 2: Initialize the AWS Config Recorder
You must enable the recorder to start tracking resource changes.
# Check if a recorder already exists
aws configservice describe-configuration-recorders
# If none exists, create one (requires an IAM role, simplified here for brevity)
# Note: In a production environment, use a specific service-linked role.▶Console alternative
Navigate to
. Click
. Choose to record all resources in the region and select the S3 bucket created in Step 1.
Step 3: Deploy a Non-Compliant EC2 Instance
Create a small instance without any tags to serve as our "problem" resource.
aws ec2 run-instances \
--image-id ami-0c101f26f147fa7fd \
--count 1 \
--instance-type t2.micro \
--region us-east-1[!TIP] The AMI ID above is for Amazon Linux 2023 in us-east-1. If you are in a different region, find the local AMI ID first.
Step 4: Create the Config Rule and Remediation
We will use the managed rule required-tags and link it to the SSM document AWS-PublishSendMessageToTopic or a custom tagging script. For this lab, we will target the CostCenter tag.
# Create the Config Rule
aws configservice put-config-rule \
--config-rule '{"ConfigRuleName": "check-ec2-tags", "Source": {"Owner": "AWS", "SourceIdentifier": "REQUIRED_TAGS"}, "InputParameters": "{\"tag1Key\":\"CostCenter\"}"}'Step 5: Configure Automated Remediation
We will link the rule to the SSM document AWS-CreateTags.
▶Console Instructions (Recommended for visual linking)
- Go to
. 2. Click
. 3. Choose
. 4. Remediation Action:
. 5. Resource ID Parameter:
. 6. Parameters:
,
. 7. Click
.
Checkpoints
- Compliance Status: In the Config Console, check if
check-ec2-tagsshows "Non-compliant" for your new instance. - Remediation Execution: Navigate to Systems Manager > Automation. You should see an execution for
AWS-CreateTagsthat succeeded. - Final Verification: Run the following CLI command. You should see the
CostCentertag attached to your instance.
aws ec2 describe-tags --filters "Name=resource-id,Values=<YOUR_INSTANCE_ID>"Concept Review
| Feature | Description | Operational Excellence Value |
|---|---|---|
| AWS Config | Continuous monitoring and assessment service. | Provides visibility and audit trails for "Check" phase. |
| SSM Automation | Executes common maintenance and deployment tasks. | Reduces human error through "Operations as Code." |
| Remediation | Automatic trigger of actions based on compliance drift. | Minimizes Mean Time to Repair (MTTR). |
Troubleshooting
| Issue | Likely Cause | Solution |
|---|---|---|
| Rule stays "Compliant" | Recorder is not active. | Ensure Configuration Recorder is status ON. |
| Remediation Fails | Missing IAM Permissions. | Ensure the SSM service role has ec2:CreateTags permissions. |
| Resource not found | Regional mismatch. | Ensure Config, SSM, and EC2 are all in the same region. |
Stretch Challenge
Instead of just tagging the instance, modify the remediation to Stop any EC2 instance that is launched without a Department tag. Use the SSM document AWS-StopEC2Instance. This enforces strict governance policies common in enterprise environments.
Cost Estimate
[!IMPORTANT] Remember to run the teardown commands to avoid ongoing charges.
- AWS Config: $0.003 per configuration item recorded. Estimated: <$0.05.
- EC2: t2.micro is Free Tier eligible. Non-free: ~$0.0116/hr. Estimated: <$0.02.
- S3: Negligible for this volume of data. Estimated: $0.00.
- Total: Approximately $0.10 USD for 30 minutes of lab time.
Clean-Up / Teardown
- Terminate the EC2 Instance:
bash
aws ec2 terminate-instances --instance-ids <YOUR_INSTANCE_ID> - Delete the Config Rule:
bash
aws configservice delete-config-rule --config-rule-name check-ec2-tags - Delete the S3 Bucket:
bash
aws s3 rb s3://$BUCKET_NAME --force - Stop the Configuration Recorder (Optional): If you won't use Config again soon, stop the recorder to avoid costs from other resource changes in your account.
[!WARNING] Failure to terminate the EC2 instance may result in hourly charges if you have exceeded your Free Tier limit.