Study Guide945 words

Mastering SSM Automation for Automated Remediation

Execute SSM Automation runbooks for remediation

Mastering SSM Automation for Automated Remediation

Automated remediation is the process of using software tools to identify, investigate, and resolve operational issues without human intervention. In AWS, AWS Systems Manager (SSM) Automation is the primary engine for this "self-healing" capability, allowing you to execute complex workflows called runbooks to fix common configuration issues and system failures.


Learning Objectives

After studying this guide, you should be able to:

  • Define the role of SSM Automation documents (runbooks) in operational excellence.
  • Distinguish between predefined and custom runbooks.
  • Configure EventBridge rules to trigger automated remediation workflows.
  • Implement AWS Config auto-remediation using Systems Manager.
  • Identify common remediation use cases like instance recovery and security group fixes.

Key Terms & Glossary

  • Runbook (Automation Document): A JSON or YAML file that defines the actions Systems Manager performs on your managed instances and other AWS resources.
  • Managed Node: Any machine configured for use with Systems Manager (EC2 instances, edge devices, or on-premises servers via Hybrid Activations).
  • Remediation: The act of correcting a fault or a non-compliant state (e.g., restarting a stopped service or closing an open port).
  • Execution Manual/Automatic: The method by which a runbook starts; manual via console/CLI or automatic via events.
  • Idempotency: A property of a runbook where multiple executions produce the same result without side effects.

The "Big Idea"

In traditional IT, when a server fails, a human receives a page, logs in, and runs a script. In the cloud, we shift to Event-Driven Operations. The "Big Idea" here is to treat operational knowledge as code. By codifying troubleshooting steps into SSM Runbooks, you reduce the Mean Time to Repair (MTTR) and eliminate human error during high-pressure incidents.


Formula / Concept Box

ComponentDescriptionExample
TriggerThe "When" (Event or Schedule)CloudWatch Alarm, AWS Config Rule, EventBridge
LogicThe "What" (Steps and branching)SSM Automation Document (YAML/JSON)
TargetThe "Where" (Affected resources)Resource ID, Tags, Resource Group
PermissionThe "How" (IAM capabilities)SSM Automation Service Role (iam:PassRole)

Hierarchical Outline

  • SSM Automation Overview
    • Automation Documents: Structured as a series of steps (e.g., aws:executeAwsApi, aws:runCommand).
    • Action Types: Script execution, API calls, branching logic (aws:branch), and approvals.
  • Triggering Remediation
    • Amazon EventBridge: Routes state changes (e.g., EC2 Instance State Change) to SSM Automation targets.
    • AWS Config: Monitors resource compliance and triggers SSM documents when a resource is "NON_COMPLIANT."
    • Amazon CloudWatch: Triggers runbooks based on metric thresholds (e.g., High CPU triggers a scale-out or restart).
  • Runbook Varieties
    • AWS-Managed (Predefined): Built-in runbooks for common tasks (e.g., AWS-RestartEC2Instance, AWS-PatchInstanceWithRollback).
    • Custom Runbooks: User-defined documents for specialized application logic.

Visual Anchors

Remediation Workflow

This flowchart illustrates how a system failure moves from detection to resolution.

Loading Diagram...

Logic Structure of a Runbook

This diagram represents the internal flow of an SSM Automation document with branching logic.

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Step-by-Step Execution: The ability to run runbooks one step at a time for debugging.
    • Example: Running a complex multi-account patching runbook in "Interactive Mode" to verify each account's status before proceeding.
  • Assume Role: The IAM role that Systems Manager assumes to perform actions on your behalf.
    • Example: Creating a role with ec2:StopInstances permissions so the runbook can remediate a runaway instance.
  • Dynamic Target: Using tags to define which resources a runbook should act upon.
    • Example: Targeting all instances with the tag Environment: Production for a security patch.

Worked Examples

Example 1: Remediating a Public S3 Bucket

Problem: A user accidentally makes an S3 bucket public, violating security policy.

  1. Detection: AWS Config rule s3-bucket-public-read-prohibited identifies the bucket as non-compliant.
  2. Remediation Setup: Associate the AWS-managed runbook AWS-DisableS3BucketPublicReadWrite with the Config rule.
  3. Execution: AWS Config passes the BucketName parameter to the SSM runbook automatically.
  4. Result: The runbook executes the PutBucketPublicAccessBlock API call, securing the bucket in seconds.

Example 2: EC2 Status Check Recovery

Problem: An EC2 instance fails its underlying system status check.

  1. Detection: A CloudWatch Alarm is set for the metric StatusCheckFailed_System.
  2. Action: The alarm action is set to "EC2 Action: Recover this instance."
  3. Remediation: Behind the scenes, AWS triggers a recovery workflow that migrates the instance to a new physical host while retaining its ID, IP, and metadata.

Checkpoint Questions

  1. What is the difference between an SSM Runbook and an SSM Command Document?
  2. Which AWS service is best suited for triggering a remediation runbook when a specific API call is logged in CloudTrail?
  3. How does the aws:branch action improve the flexibility of a custom runbook?
  4. Why is it a best practice to use an "Automation Service Role" rather than your own user permissions to run runbooks?

[!TIP] Answer to Q1: Runbooks (Automation) are used for complex, multi-step workflows across many AWS services, while Command Documents are primarily used for running scripts directly inside an EC2 instance operating system.

[!WARNING] Always ensure your SSM Automation service role has the iam:PassRole permission for the roles it needs to assume, or the automation will fail with an "Access Denied" error during execution.

Ready to study AWS Certified CloudOps Engineer - Associate (SOA-C03)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free