Automating AWS Operations: Incident Remediation with Systems Manager and EventBridge
Unit 8: AWS Operational Foundations
Automating AWS Operations: Incident Remediation with Systems Manager and EventBridge
This lab provides hands-on experience in implementing Operational Excellence and Reliability pillars of the AWS Well-Architected Framework. You will configure an automated remediation workflow that detects when an EC2 instance is stopped and automatically restarts it using Amazon EventBridge and AWS Systems Manager (SSM) Automation.
[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges for the EC2 resources provisioned.
Prerequisites
- An active AWS Account with Administrator access.
- AWS CLI installed and configured with your credentials.
- A basic understanding of IAM roles and EC2 instances.
- Region: Ensure you are operating in a single region (e.g.,
us-east-1).
Learning Objectives
- Configure an IAM Role for Systems Manager Automation.
- Implement Event-Driven Remediation using Amazon EventBridge.
- Execute SSM Automation Runbooks to manage resource states.
- Verify automated recovery actions in the AWS Management Console.
Architecture Overview
This lab follows a closed-loop remediation architecture:
Step-by-Step Instructions
Step 1: Create an IAM Service Role for SSM
SSM Automation requires permissions to perform actions (like starting an instance) on your behalf.
# 1. Create the trust policy file
echo '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "Service": "ssm.amazonaws.com" },
"Action": "sts:AssumeRole"
}
]
}' > ssm-trust-policy.json
# 2. Create the IAM role
aws iam create-role --role-name "BrainyBee-SSM-Automation-Role" --assume-role-policy-document file://ssm-trust-policy.json
# 3. Attach the AmazonSSMAutomationRole managed policy
aws iam attach-role-policy --role-name "BrainyBee-SSM-Automation-Role" --policy-arn arn:aws:iam::aws:policy/service-role/AmazonSSMAutomationRole▶Console alternative
Navigate to
. Select
, then choose
. Under use case, select
. Attach the
AmazonSSMAutomationRolepolicy and name it
BrainyBee-SSM-Automation-Role.
Step 2: Launch a Test EC2 Instance
We need an instance to monitor and remediate. We will use a t3.micro (or t2.micro if t3 is unavailable).
# Find a standard Amazon Linux 2023 AMI
AMI_ID=$(aws ec2 describe-images --owners amazon --filters "Name=name,Values=al2023-ami-*-x86_64" --query 'Images[0].ImageId' --output text)
# Launch the instance
aws ec2 run-instances \
--image-id $AMI_ID \
--count 1 \
--instance-type t3.micro \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=Remediation-Lab-Instance}]'[!NOTE] Note down the
InstanceIdfrom the output; you will need it for verification.
Step 3: Create the EventBridge Remediation Rule
We will create a rule that triggers when our specific instance enters the "stopped" state.
# 1. Create the rule
aws events put-rule \
--name "AutoRestartStoppedInstance" \
--event-pattern '{"source":["aws.ec2"],"detail-type":["EC2 Instance State-change Notification"],"detail":{"state":["stopped"]}}' \
--state ENABLED
# 2. Add the SSM Automation target
# Replace <YOUR_INSTANCE_ID> and <YOUR_ACCOUNT_ID> accordingly
aws events put-targets --rule "AutoRestartStoppedInstance" --targets '[{
"Id": "1",
"Arn": "arn:aws:ssm:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:automation-definition/AWS-StartEC2Instance:$DEFAULT",
"RoleArn": "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/BrainyBee-SSM-Automation-Role",
"InputTransformer": {
"InputPathsMap": {"instance":"$.detail.instance-id"},
"InputTemplate": "{\"InstanceId\": [\"<instance>\"]}"
}
}]'▶Console alternative
Navigate to
. Define the pattern as
,
, and select
. For the target, choose
and document
AWS-StartEC2Instance.
Step 4: Test the Remediation
Now, manually stop the instance to see the automation in action.
# Stop the instance
aws ec2 stop-instances --instance-ids <YOUR_INSTANCE_ID>Wait approximately 60 seconds for the EventBridge rule to trigger and the SSM Automation to execute.
Checkpoints
| Checkpoint | Action | Expected Result |
|---|---|---|
| Verification 1 | Run aws ec2 describe-instances --instance-ids <ID> | State should transition from stopping to stopped and then back to pending/running automatically. |
| Verification 2 | Check SSM Executions | Navigate to Systems Manager > Automation. You should see a successful execution of AWS-StartEC2Instance. |
| Verification 3 | EventBridge Metrics | Check CloudWatch Metrics for TriggeredRules under the AWS/Events namespace. |
Troubleshooting
| Error / Issue | Possible Cause | Fix |
|---|---|---|
| Automation Fails | IAM Role missing permissions | Ensure the role has AmazonSSMAutomationRole and the trust policy is correct for ssm.amazonaws.com. |
| Rule Not Triggering | Incorrect Event Pattern | Check if the EventBridge rule pattern correctly matches the stopped state and the aws.ec2 source. |
| Permission Denied | CLI user lacks iam:PassRole | Ensure your IAM user has permission to pass the role to the EventBridge target. |
The "Big Idea": Operational Foundations
This lab demonstrates the Operational Excellence pillar by treating operations as code. By using TikZ to visualize the flow of operational health, we see how AWS services provide a foundation for reliability.
\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, rounded corners, minimum width=3cm, minimum height=1cm, align=center}] \node (detect) {Detection$EventBridge)}; \node (analyze) [right of=detect, xshift=2.5cm] {Analysis$Rule Matching)}; \node (respond) [right of=analyze, xshift=2.5cm] {Response$SSM Automation)};
\draw[->, thick] (detect) -- (analyze);
\draw[->, thick] (analyze) -- (respond);
\node (loop) [below of=analyze, yshift=-0.5cm, draw=none] {\textit{Continuous Monitoring & Remediation}};\end{tikzpicture}
Teardown
To avoid unexpected costs, delete all resources created during this lab.
# 1. Delete the EventBridge Rule and Targets
aws events remove-targets --rule "AutoRestartStoppedInstance" --ids "1"
aws events delete-rule --name "AutoRestartStoppedInstance"
# 2. Terminate the EC2 Instance
aws ec2 terminate-instances --instance-ids <YOUR_INSTANCE_ID>
# 3. Delete the IAM Role
aws iam detach-role-policy --role-name "BrainyBee-SSM-Automation-Role" --policy-arn arn:aws:iam::aws:policy/service-role/AmazonSSMAutomationRole
aws iam delete-role --role-name "BrainyBee-SSM-Automation-Role"