AWS Health and Incident Management Study Guide
AWS Health and Incident Management
AWS Health and Incident Management
AWS maintains a high availability standard, typically pursuing a 99.9% uptime for most services. However, when degradations or scheduled maintenance occur, AWS provides a suite of tools to communicate these events and facilitate automated remediation. This guide covers the visibility and response mechanisms available to a SysOps Administrator.
Learning Objectives
- Distinguish between the Public AWS Health Dashboard and the Personal Health Dashboard (PHD).
- Explain how to aggregate health events across multiple accounts using AWS Organizations.
- Describe the role of the AWS Health API in incident response and integration with third-party tools.
- Configure automated remediation workflows using Amazon EventBridge and AWS Lambda.
- Understand the physical and logical security measures AWS employs for incident management and decommissioning.
Key Terms & Glossary
- AWS Health Dashboard (Public): A global status page showing the current state of all AWS services across all regions without requiring an AWS login.
- AWS Personal Health Dashboard (PHD): A personalized view of AWS service health specifically affecting the resources in your account.
- AWS Health API: A programmatic interface (available to Business/Enterprise support) used to ingest health data into external systems like Slack or Splunk.
- AWS Health Aware (AHA): An open-source serverless solution that automates the delivery of health alerts to communication channels.
- Automated Remediation: The process of using code (Lambda) or automated tasks (SSM) to fix issues triggered by a health event without manual intervention.
The "Big Idea"
In the cloud, "everything fails all the time" (Werner Vogels). The goal of AWS Health and Incident Management is not just to see that something is broken, but to gain actionable visibility. By shifting from reactive monitoring (looking at a dashboard) to proactive automation (triggering a Lambda function when a health event occurs), organizations minimize downtime and maintain business continuity even during AWS-side infrastructure events.
Formula / Concept Box
| Tool | Primary Purpose | Key Feature |
|---|---|---|
| Service Health Dashboard | Global Visibility | No AWS login required; regional status. |
| Personal Health Dashboard | Resource Specificity | Shows impact on YOUR specific EC2/RDS instances. |
| EventBridge | Trigger Engine | Routes health events to SNS, Lambda, or SSM. |
| AWS Health API | External Integration | Requires Business/Enterprise support plan. |
| NIST 800-88 | Compliance | The standard AWS uses for decommissioning media. |
Hierarchical Outline
- I. Visibility Layers
- Public Status: Global view of all services; used for initial triage.
- Personalized Status: Logged-in view; includes local time zone conversion for events.
- Organizational View: Aggregates health events from all member accounts into the management account dashboard.
- II. Programmatic Access & Automation
- AWS Health API: Ingests events into Slack, Microsoft Teams, or ticketing systems.
- Amazon EventBridge Integration:
- Rules capture specific health events (e.g., "EC2 Scheduled Maintenance").
- Targets include AWS Lambda for code-based fixes or SSM Automation for runbooks.
- III. Physical & Incident Infrastructure
- AWS Incident Management Team: 24/7/365 coverage for infrastructure-level response.
- Data Center Security: Locations are isolated and autonomous with independent power/fire suppression.
Visual Anchors
Health Event Automation Flow
Public vs. Personal Health Dashboards
\begin{tikzpicture}[scale=0.8] \draw[thick, fill=blue!10] (0,0) circle (3cm); \draw[thick, fill=green!10] (1.5,0) circle (1.5cm); \node at (0, 2.3) {\textbf{Public Dashboard}}; \node at (0, 1.8) {\small All Services / All Regions}; \node at (2, 0) {\textbf{PHD}}; \node[align=center] at (2, -0.6) {\tiny Your Specific\\tiny Resources}; \end{tikzpicture}
Definition-Example Pairs
- Event Aggregation: The process of collecting data from multiple sources into a single pane of glass.
- Example: A company with 50 AWS accounts uses AWS Organizations to see all EBS failure alerts in one central dashboard rather than logging into 50 different consoles.
- Service Event Guidance: Specific steps provided by AWS to mitigate a degradation.
- Example: A PHD alert for an RDS instance might include a recommendation to failover to a Multi-AZ standby because of underlying hardware issues.
- NIST 800-88: A federal standard for media sanitization.
- Example: When an AWS hard drive reaches its end-of-life or fails, AWS technicians physically destroy or magnetically wipe the drive according to NIST 800-88 to ensure no customer data is recoverable.
Worked Examples
Scenario: Automated Response to EC2 Retirement
Problem: AWS sends a notification that an EC2 instance is scheduled for retirement due to hardware degradation. The admin wants to automate the restart of this instance during a maintenance window.
Step-by-Step Solution:
- Event Detection: Use Amazon EventBridge. Create a rule with the event source
aws.healthand serviceEC2. - Define Pattern: Filter for
AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED. - Set Target: Select SSM Automation as the target.
- Execute Runbook: Choose the
AWS-RestartEC2Instancerunbook. - Verification: The instance is automatically stopped and started on healthy hardware before the retirement date, and an SNS notification is sent to the team.
Checkpoint Questions
- Which AWS support plans are required to access the AWS Health API directly?
- True or False: You must be logged into an AWS account to view the status of the US-EAST-1 S3 service on the public health dashboard.
- What service allows you to trigger a Lambda function automatically when a "Scheduled Change" event appears in your Health Dashboard?
- How does AWS ensure data is not recoverable when a storage device is decommissioned?
- Where can you find the AWS Health Aware (AHA) solution if you want to customize your alert delivery?
▶Click to see answers
- Business and Enterprise support plans.
- False. The Public Health Dashboard is accessible to everyone at
health.aws.amazon.com/health/status. - Amazon EventBridge.
- By following the NIST 800-88 standards for media decommissioning.
- In the GitHub AWS Samples repository.