Chapter Study Guide: Event-Driven Remediation on AWS

This guide covers the automation of operational tasks and incident responses using AWS services, a core domain for the SysOps Administrator Associate (SOA-C03) exam.

Learning Objectives

By the end of this study guide, you should be able to:

Configure Amazon EventBridge rules to trigger automated remediation.
Route events to targets such as AWS Lambda and AWS Systems Manager (SSM) Automation.
Implement automated instance recovery using EC2 status checks.
Manage fleet-wide updates and patching using SSM Patch Manager.
Integrate AWS Health and Security Hub findings into automated notification and remediation workflows.

Key Terms & Glossary

EventBridge: A serverless event bus that makes it easy to connect applications using data from your own applications, integrated SaaS applications, and AWS services.
SSM Automation: A capability of AWS Systems Manager that simplifies common maintenance and deployment tasks of Amazon EC2 instances and other AWS resources.
Runbook: A document (JSON or YAML) that defines the actions that Systems Manager performs on your managed instances and other AWS resources.
Remediation: The process of correcting a fault or deficiency (e.g., automatically restarting a failed service or closing an open security group).
RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time.
RTO (Recovery Time Objective): The maximum acceptable amount of time to restore a system after a failure.

The "Big Idea"

In traditional IT, operations are often reactive: a human receives an alert and manually fixes the problem. In the cloud, we shift to event-driven remediation. This means infrastructure monitors its own state; when a failure or state change occurs (an event), AWS automatically triggers a script or workflow to fix it without human intervention. This minimizes MTTR (Mean Time To Repair) and ensures consistency across large-scale environments.

Formula / Concept Box

Concept	Description	Core Components
Event Pattern	The JSON structure used by EventBridge to match incoming events.	Source, Detail-Type, Detail
Remediation Target	The service that executes the fix.	Lambda, SSM Automation, Step Functions
EC2 Auto-Recovery	Specific action for hardware failure.	CloudWatch Alarm + `StatusCheckFailed_System`
AWS Config Rules	Monitoring resource configurations.	Trigger -> Evaluation -> Remediation

Hierarchical Outline

Event-Driven Components
- Event Sources: CloudWatch Alarms, AWS Config, Security Hub, AWS Health.
- The Bus: Amazon EventBridge (Default, Custom, or SaaS buses).
- The Targets: Lambda (custom code) or SSM (predefined runbooks).
Systems Manager (SSM) Operations
- Automation Runbooks: Executing multi-step workflows.
- Patch Manager: Automating security updates for EC2 fleets.
- Inventory: Tracking software and configurations.
High Availability & Continuity
- Multi-AZ Deployments: Automatic failover for RDS/Aurora.
- Backup & Restore: Using AWS Backup for centralized protection.
- S3 Replication: Cross-Region Replication (CRR) for regional resilience.

Visual Anchors

Event-Driven Remediation Workflow

Loading Diagram...

EC2 Auto-Recovery Logic

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

SSM Document (Runbook): A definition of the steps to take during automation.
- Example: AWS-RestartEC2Instance is a managed runbook that can be triggered when a health check fails.
CloudFormation Drift: When the actual state of a resource differs from the template it was deployed with.
- Example: A user manually changes a Security Group rule. CloudFormation detects this "drift" and can be configured to alert or remediate.
AWS Health Events: Communications regarding service issues or scheduled maintenance.
- Example: AWS notifies you of an upcoming EC2 retirement; EventBridge triggers a Lambda to migrate the workload during off-peak hours.

Worked Examples

Example 1: Remediating Unencrypted S3 Buckets

Scenario: Your company policy forbids unencrypted S3 buckets. You want to automate the fix.

Monitor: AWS Config is enabled with the managed rule s3-bucket-server-side-encryption-enabled.
Detect: A user creates a bucket without encryption. AWS Config marks the bucket as "Non-compliant."
Trigger: An EventBridge rule listens for AWS Config Compliance Change events.
Action: The rule triggers an SSM Automation runbook AWS-EnableS3BucketEncryption.
Verification: The runbook applies AES-256 encryption. The next Config evaluation marks the bucket "Compliant."

Example 2: EC2 Memory Alert Remediation

Scenario: An application has a known memory leak. When memory usage > 90%, the service must restart.

Monitor: Install the CloudWatch Agent on the EC2 instance to collect the mem_used_percent metric.
Alarm: Create a CloudWatch Alarm for memory usage > 90%.
Action: Configure the Alarm to send a notification to an SNS Topic.
Target: An AWS Lambda function subscribed to the SNS Topic executes a Remote Shell command via SSM Run Command to restart the specific service.

Checkpoint Questions

What is the primary difference between a System Status Check and an Instance Status Check in EC2?
Which service should you use to automate patching across a fleet of 500 EC2 instances?
How does Security Hub interact with Amazon EventBridge to perform remediation?
What is the benefit of using SSM Automation over a custom Lambda function for simple tasks like restarting an instance?
Explain the difference between Pilot Light and Warm Standby disaster recovery strategies in terms of RTO.