Hands-On Lab845 words

Automating AWS Operations: Incident Remediation with Systems Manager and EventBridge

Unit 8: AWS Operational Foundations

Automating AWS Operations: Incident Remediation with Systems Manager and EventBridge

This lab provides hands-on experience in implementing Operational Excellence and Reliability pillars of the AWS Well-Architected Framework. You will configure an automated remediation workflow that detects when an EC2 instance is stopped and automatically restarts it using Amazon EventBridge and AWS Systems Manager (SSM) Automation.

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges for the EC2 resources provisioned.

Prerequisites

  • An active AWS Account with Administrator access.
  • AWS CLI installed and configured with your credentials.
  • A basic understanding of IAM roles and EC2 instances.
  • Region: Ensure you are operating in a single region (e.g., us-east-1).

Learning Objectives

  • Configure an IAM Role for Systems Manager Automation.
  • Implement Event-Driven Remediation using Amazon EventBridge.
  • Execute SSM Automation Runbooks to manage resource states.
  • Verify automated recovery actions in the AWS Management Console.

Architecture Overview

This lab follows a closed-loop remediation architecture:

Loading Diagram...

Step-by-Step Instructions

Step 1: Create an IAM Service Role for SSM

SSM Automation requires permissions to perform actions (like starting an instance) on your behalf.

bash
# 1. Create the trust policy file echo '{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "ssm.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }' > ssm-trust-policy.json # 2. Create the IAM role aws iam create-role --role-name "BrainyBee-SSM-Automation-Role" --assume-role-policy-document file://ssm-trust-policy.json # 3. Attach the AmazonSSMAutomationRole managed policy aws iam attach-role-policy --role-name "BrainyBee-SSM-Automation-Role" --policy-arn arn:aws:iam::aws:policy/service-role/AmazonSSMAutomationRole
Console alternative

Navigate to

IAM > Roles > Create Role

. Select

AWS Service

, then choose

Systems Manager

. Under use case, select

Allow SSM to call AWS services on your behalf

. Attach the

AmazonSSMAutomationRole

policy and name it

BrainyBee-SSM-Automation-Role

.

Step 2: Launch a Test EC2 Instance

We need an instance to monitor and remediate. We will use a t3.micro (or t2.micro if t3 is unavailable).

bash
# Find a standard Amazon Linux 2023 AMI AMI_ID=$(aws ec2 describe-images --owners amazon --filters "Name=name,Values=al2023-ami-*-x86_64" --query 'Images[0].ImageId' --output text) # Launch the instance aws ec2 run-instances \ --image-id $AMI_ID \ --count 1 \ --instance-type t3.micro \ --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=Remediation-Lab-Instance}]'

[!NOTE] Note down the InstanceId from the output; you will need it for verification.

Step 3: Create the EventBridge Remediation Rule

We will create a rule that triggers when our specific instance enters the "stopped" state.

bash
# 1. Create the rule aws events put-rule \ --name "AutoRestartStoppedInstance" \ --event-pattern '{"source":["aws.ec2"],"detail-type":["EC2 Instance State-change Notification"],"detail":{"state":["stopped"]}}' \ --state ENABLED # 2. Add the SSM Automation target # Replace <YOUR_INSTANCE_ID> and <YOUR_ACCOUNT_ID> accordingly aws events put-targets --rule "AutoRestartStoppedInstance" --targets '[{ "Id": "1", "Arn": "arn:aws:ssm:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:automation-definition/AWS-StartEC2Instance:$DEFAULT", "RoleArn": "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/BrainyBee-SSM-Automation-Role", "InputTransformer": { "InputPathsMap": {"instance":"$.detail.instance-id"}, "InputTemplate": "{\"InstanceId\": [\"<instance>\"]}" } }]'
Console alternative

Navigate to

Amazon EventBridge > Rules > Create rule

. Define the pattern as

AWS service: EC2

,

Event type: EC2 Instance State-change Notification

, and select

Specific state(s): stopped

. For the target, choose

Systems Manager Automation

and document

AWS-StartEC2Instance

.

Step 4: Test the Remediation

Now, manually stop the instance to see the automation in action.

bash
# Stop the instance aws ec2 stop-instances --instance-ids <YOUR_INSTANCE_ID>

Wait approximately 60 seconds for the EventBridge rule to trigger and the SSM Automation to execute.

Checkpoints

CheckpointActionExpected Result
Verification 1Run aws ec2 describe-instances --instance-ids <ID>State should transition from stopping to stopped and then back to pending/running automatically.
Verification 2Check SSM ExecutionsNavigate to Systems Manager > Automation. You should see a successful execution of AWS-StartEC2Instance.
Verification 3EventBridge MetricsCheck CloudWatch Metrics for TriggeredRules under the AWS/Events namespace.

Troubleshooting

Error / IssuePossible CauseFix
Automation FailsIAM Role missing permissionsEnsure the role has AmazonSSMAutomationRole and the trust policy is correct for ssm.amazonaws.com.
Rule Not TriggeringIncorrect Event PatternCheck if the EventBridge rule pattern correctly matches the stopped state and the aws.ec2 source.
Permission DeniedCLI user lacks iam:PassRoleEnsure your IAM user has permission to pass the role to the EventBridge target.

The "Big Idea": Operational Foundations

This lab demonstrates the Operational Excellence pillar by treating operations as code. By using TikZ to visualize the flow of operational health, we see how AWS services provide a foundation for reliability.

\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, rounded corners, minimum width=3cm, minimum height=1cm, align=center}] \node (detect) {Detection$EventBridge)}; \node (analyze) [right of=detect, xshift=2.5cm] {Analysis$Rule Matching)}; \node (respond) [right of=analyze, xshift=2.5cm] {Response$SSM Automation)};

code
\draw[->, thick] (detect) -- (analyze); \draw[->, thick] (analyze) -- (respond); \node (loop) [below of=analyze, yshift=-0.5cm, draw=none] {\textit{Continuous Monitoring & Remediation}};

\end{tikzpicture}

Teardown

To avoid unexpected costs, delete all resources created during this lab.

bash
# 1. Delete the EventBridge Rule and Targets aws events remove-targets --rule "AutoRestartStoppedInstance" --ids "1" aws events delete-rule --name "AutoRestartStoppedInstance" # 2. Terminate the EC2 Instance aws ec2 terminate-instances --instance-ids <YOUR_INSTANCE_ID> # 3. Delete the IAM Role aws iam detach-role-policy --role-name "BrainyBee-SSM-Automation-Role" --policy-arn arn:aws:iam::aws:policy/service-role/AmazonSSMAutomationRole aws iam delete-role --role-name "BrainyBee-SSM-Automation-Role"

Ready to study AWS Certified CloudOps Engineer - Associate (SOA-C03)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free