Mastering SSM Automation for Automated Remediation
Execute SSM Automation runbooks for remediation
Mastering SSM Automation for Automated Remediation
Automated remediation is the process of using software tools to identify, investigate, and resolve operational issues without human intervention. In AWS, AWS Systems Manager (SSM) Automation is the primary engine for this "self-healing" capability, allowing you to execute complex workflows called runbooks to fix common configuration issues and system failures.
Learning Objectives
After studying this guide, you should be able to:
- Define the role of SSM Automation documents (runbooks) in operational excellence.
- Distinguish between predefined and custom runbooks.
- Configure EventBridge rules to trigger automated remediation workflows.
- Implement AWS Config auto-remediation using Systems Manager.
- Identify common remediation use cases like instance recovery and security group fixes.
Key Terms & Glossary
- Runbook (Automation Document): A JSON or YAML file that defines the actions Systems Manager performs on your managed instances and other AWS resources.
- Managed Node: Any machine configured for use with Systems Manager (EC2 instances, edge devices, or on-premises servers via Hybrid Activations).
- Remediation: The act of correcting a fault or a non-compliant state (e.g., restarting a stopped service or closing an open port).
- Execution Manual/Automatic: The method by which a runbook starts; manual via console/CLI or automatic via events.
- Idempotency: A property of a runbook where multiple executions produce the same result without side effects.
The "Big Idea"
In traditional IT, when a server fails, a human receives a page, logs in, and runs a script. In the cloud, we shift to Event-Driven Operations. The "Big Idea" here is to treat operational knowledge as code. By codifying troubleshooting steps into SSM Runbooks, you reduce the Mean Time to Repair (MTTR) and eliminate human error during high-pressure incidents.
Formula / Concept Box
| Component | Description | Example |
|---|---|---|
| Trigger | The "When" (Event or Schedule) | CloudWatch Alarm, AWS Config Rule, EventBridge |
| Logic | The "What" (Steps and branching) | SSM Automation Document (YAML/JSON) |
| Target | The "Where" (Affected resources) | Resource ID, Tags, Resource Group |
| Permission | The "How" (IAM capabilities) | SSM Automation Service Role (iam:PassRole) |
Hierarchical Outline
- SSM Automation Overview
- Automation Documents: Structured as a series of steps (e.g.,
aws:executeAwsApi,aws:runCommand). - Action Types: Script execution, API calls, branching logic (
aws:branch), and approvals.
- Automation Documents: Structured as a series of steps (e.g.,
- Triggering Remediation
- Amazon EventBridge: Routes state changes (e.g., EC2 Instance State Change) to SSM Automation targets.
- AWS Config: Monitors resource compliance and triggers SSM documents when a resource is "NON_COMPLIANT."
- Amazon CloudWatch: Triggers runbooks based on metric thresholds (e.g., High CPU triggers a scale-out or restart).
- Runbook Varieties
- AWS-Managed (Predefined): Built-in runbooks for common tasks (e.g.,
AWS-RestartEC2Instance,AWS-PatchInstanceWithRollback). - Custom Runbooks: User-defined documents for specialized application logic.
- AWS-Managed (Predefined): Built-in runbooks for common tasks (e.g.,
Visual Anchors
Remediation Workflow
This flowchart illustrates how a system failure moves from detection to resolution.
Logic Structure of a Runbook
This diagram represents the internal flow of an SSM Automation document with branching logic.
Definition-Example Pairs
- Step-by-Step Execution: The ability to run runbooks one step at a time for debugging.
- Example: Running a complex multi-account patching runbook in "Interactive Mode" to verify each account's status before proceeding.
- Assume Role: The IAM role that Systems Manager assumes to perform actions on your behalf.
- Example: Creating a role with
ec2:StopInstancespermissions so the runbook can remediate a runaway instance.
- Example: Creating a role with
- Dynamic Target: Using tags to define which resources a runbook should act upon.
- Example: Targeting all instances with the tag
Environment: Productionfor a security patch.
- Example: Targeting all instances with the tag
Worked Examples
Example 1: Remediating a Public S3 Bucket
Problem: A user accidentally makes an S3 bucket public, violating security policy.
- Detection: AWS Config rule
s3-bucket-public-read-prohibitedidentifies the bucket as non-compliant. - Remediation Setup: Associate the AWS-managed runbook
AWS-DisableS3BucketPublicReadWritewith the Config rule. - Execution: AWS Config passes the
BucketNameparameter to the SSM runbook automatically. - Result: The runbook executes the
PutBucketPublicAccessBlockAPI call, securing the bucket in seconds.
Example 2: EC2 Status Check Recovery
Problem: An EC2 instance fails its underlying system status check.
- Detection: A CloudWatch Alarm is set for the metric
StatusCheckFailed_System. - Action: The alarm action is set to "EC2 Action: Recover this instance."
- Remediation: Behind the scenes, AWS triggers a recovery workflow that migrates the instance to a new physical host while retaining its ID, IP, and metadata.
Checkpoint Questions
- What is the difference between an SSM Runbook and an SSM Command Document?
- Which AWS service is best suited for triggering a remediation runbook when a specific API call is logged in CloudTrail?
- How does the
aws:branchaction improve the flexibility of a custom runbook? - Why is it a best practice to use an "Automation Service Role" rather than your own user permissions to run runbooks?
[!TIP] Answer to Q1: Runbooks (Automation) are used for complex, multi-step workflows across many AWS services, while Command Documents are primarily used for running scripts directly inside an EC2 instance operating system.
[!WARNING] Always ensure your SSM Automation service role has the
iam:PassRolepermission for the roles it needs to assume, or the automation will fail with an "Access Denied" error during execution.