AWS Health and Incident Management

AWS maintains a high availability standard, typically pursuing a 99.9% uptime for most services. However, when degradations or scheduled maintenance occur, AWS provides a suite of tools to communicate these events and facilitate automated remediation. This guide covers the visibility and response mechanisms available to a SysOps Administrator.

Learning Objectives

Distinguish between the Public AWS Health Dashboard and the Personal Health Dashboard (PHD).
Explain how to aggregate health events across multiple accounts using AWS Organizations.
Describe the role of the AWS Health API in incident response and integration with third-party tools.
Configure automated remediation workflows using Amazon EventBridge and AWS Lambda.
Understand the physical and logical security measures AWS employs for incident management and decommissioning.

Key Terms & Glossary

AWS Health Dashboard (Public): A global status page showing the current state of all AWS services across all regions without requiring an AWS login.
AWS Personal Health Dashboard (PHD): A personalized view of AWS service health specifically affecting the resources in your account.
AWS Health API: A programmatic interface (available to Business/Enterprise support) used to ingest health data into external systems like Slack or Splunk.
AWS Health Aware (AHA): An open-source serverless solution that automates the delivery of health alerts to communication channels.
Automated Remediation: The process of using code (Lambda) or automated tasks (SSM) to fix issues triggered by a health event without manual intervention.

The "Big Idea"

In the cloud, "everything fails all the time" (Werner Vogels). The goal of AWS Health and Incident Management is not just to see that something is broken, but to gain actionable visibility. By shifting from reactive monitoring (looking at a dashboard) to proactive automation (triggering a Lambda function when a health event occurs), organizations minimize downtime and maintain business continuity even during AWS-side infrastructure events.

Formula / Concept Box

Tool	Primary Purpose	Key Feature
Service Health Dashboard	Global Visibility	No AWS login required; regional status.
Personal Health Dashboard	Resource Specificity	Shows impact on YOUR specific EC2/RDS instances.
EventBridge	Trigger Engine	Routes health events to SNS, Lambda, or SSM.
AWS Health API	External Integration	Requires Business/Enterprise support plan.
NIST 800-88	Compliance	The standard AWS uses for decommissioning media.

Hierarchical Outline

I. Visibility Layers
- Public Status: Global view of all services; used for initial triage.
- Personalized Status: Logged-in view; includes local time zone conversion for events.
- Organizational View: Aggregates health events from all member accounts into the management account dashboard.
II. Programmatic Access & Automation
- AWS Health API: Ingests events into Slack, Microsoft Teams, or ticketing systems.
- Amazon EventBridge Integration:
  - Rules capture specific health events (e.g., "EC2 Scheduled Maintenance").
  - Targets include AWS Lambda for code-based fixes or SSM Automation for runbooks.
III. Physical & Incident Infrastructure
- AWS Incident Management Team: 24/7/365 coverage for infrastructure-level response.
- Data Center Security: Locations are isolated and autonomous with independent power/fire suppression.

Visual Anchors

Health Event Automation Flow

Loading Diagram...

Public vs. Personal Health Dashboards

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Event Aggregation: The process of collecting data from multiple sources into a single pane of glass.
- Example: A company with 50 AWS accounts uses AWS Organizations to see all EBS failure alerts in one central dashboard rather than logging into 50 different consoles.
Service Event Guidance: Specific steps provided by AWS to mitigate a degradation.
- Example: A PHD alert for an RDS instance might include a recommendation to failover to a Multi-AZ standby because of underlying hardware issues.
NIST 800-88: A federal standard for media sanitization.
- Example: When an AWS hard drive reaches its end-of-life or fails, AWS technicians physically destroy or magnetically wipe the drive according to NIST 800-88 to ensure no customer data is recoverable.

Worked Examples

Scenario: Automated Response to EC2 Retirement

Problem: AWS sends a notification that an EC2 instance is scheduled for retirement due to hardware degradation. The admin wants to automate the restart of this instance during a maintenance window.

Step-by-Step Solution:

Event Detection: Use Amazon EventBridge. Create a rule with the event source aws.health and service EC2.
Define Pattern: Filter for AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED.
Set Target: Select SSM Automation as the target.
Execute Runbook: Choose the AWS-RestartEC2Instance runbook.
Verification: The instance is automatically stopped and started on healthy hardware before the retirement date, and an SNS notification is sent to the team.

Checkpoint Questions

Which AWS support plans are required to access the AWS Health API directly?
True or False: You must be logged into an AWS account to view the status of the US-EAST-1 S3 service on the public health dashboard.
What service allows you to trigger a Lambda function automatically when a "Scheduled Change" event appears in your Health Dashboard?
How does AWS ensure data is not recoverable when a storage device is decommissioned?
Where can you find the AWS Health Aware (AHA) solution if you want to customize your alert delivery?

▶Click to see answers

Business and Enterprise support plans.
False. The Public Health Dashboard is accessible to everyone at health.aws.amazon.com/health/status.
Amazon EventBridge.
By following the NIST 800-88 standards for media decommissioning.
In the GitHub AWS Samples repository.

AWS Health and Incident Management

Learning Objectives

Distinguish between the Public AWS Health Dashboard and the Personal Health Dashboard (PHD).
Explain how to aggregate health events across multiple accounts using AWS Organizations.
Describe the role of the AWS Health API in incident response and integration with third-party tools.
Configure automated remediation workflows using Amazon EventBridge and AWS Lambda.
Understand the physical and logical security measures AWS employs for incident management and decommissioning.

Key Terms & Glossary

AWS Health Dashboard (Public): A global status page showing the current state of all AWS services across all regions without requiring an AWS login.
AWS Personal Health Dashboard (PHD): A personalized view of AWS service health specifically affecting the resources in your account.
AWS Health API: A programmatic interface (available to Business/Enterprise support) used to ingest health data into external systems like Slack or Splunk.
AWS Health Aware (AHA): An open-source serverless solution that automates the delivery of health alerts to communication channels.
Automated Remediation: The process of using code (Lambda) or automated tasks (SSM) to fix issues triggered by a health event without manual intervention.

The "Big Idea"

Formula / Concept Box

Tool	Primary Purpose	Key Feature
Service Health Dashboard	Global Visibility	No AWS login required; regional status.
Personal Health Dashboard	Resource Specificity	Shows impact on YOUR specific EC2/RDS instances.
EventBridge	Trigger Engine	Routes health events to SNS, Lambda, or SSM.
AWS Health API	External Integration	Requires Business/Enterprise support plan.
NIST 800-88	Compliance	The standard AWS uses for decommissioning media.

Hierarchical Outline

I. Visibility Layers
- Public Status: Global view of all services; used for initial triage.
- Personalized Status: Logged-in view; includes local time zone conversion for events.
- Organizational View: Aggregates health events from all member accounts into the management account dashboard.
II. Programmatic Access & Automation
- AWS Health API: Ingests events into Slack, Microsoft Teams, or ticketing systems.
- Amazon EventBridge Integration:
  - Rules capture specific health events (e.g., "EC2 Scheduled Maintenance").
  - Targets include AWS Lambda for code-based fixes or SSM Automation for runbooks.
III. Physical & Incident Infrastructure
- AWS Incident Management Team: 24/7/365 coverage for infrastructure-level response.
- Data Center Security: Locations are isolated and autonomous with independent power/fire suppression.

Visual Anchors

Health Event Automation Flow

Loading Diagram...

Public vs. Personal Health Dashboards

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Event Aggregation: The process of collecting data from multiple sources into a single pane of glass.
- Example: A company with 50 AWS accounts uses AWS Organizations to see all EBS failure alerts in one central dashboard rather than logging into 50 different consoles.
Service Event Guidance: Specific steps provided by AWS to mitigate a degradation.
- Example: A PHD alert for an RDS instance might include a recommendation to failover to a Multi-AZ standby because of underlying hardware issues.
NIST 800-88: A federal standard for media sanitization.
- Example: When an AWS hard drive reaches its end-of-life or fails, AWS technicians physically destroy or magnetically wipe the drive according to NIST 800-88 to ensure no customer data is recoverable.

Worked Examples

Scenario: Automated Response to EC2 Retirement

Step-by-Step Solution:

Event Detection: Use Amazon EventBridge. Create a rule with the event source aws.health and service EC2.
Define Pattern: Filter for AWS_EC2_PERSISTENT_INSTANCE_RETIREMENT_SCHEDULED.
Set Target: Select SSM Automation as the target.
Execute Runbook: Choose the AWS-RestartEC2Instance runbook.
Verification: The instance is automatically stopped and started on healthy hardware before the retirement date, and an SNS notification is sent to the team.

Checkpoint Questions

Which AWS support plans are required to access the AWS Health API directly?
True or False: You must be logged into an AWS account to view the status of the US-EAST-1 S3 service on the public health dashboard.
What service allows you to trigger a Lambda function automatically when a "Scheduled Change" event appears in your Health Dashboard?
How does AWS ensure data is not recoverable when a storage device is decommissioned?
Where can you find the AWS Health Aware (AHA) solution if you want to customize your alert delivery?

▶Click to see answers

Business and Enterprise support plans.
False. The Public Health Dashboard is accessible to everyone at health.aws.amazon.com/health/status.
Amazon EventBridge.
By following the NIST 800-88 standards for media decommissioning.
In the GitHub AWS Samples repository.

AWS Health and Incident Management Study Guide

AWS Health and Incident Management

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

Health Event Automation Flow

Public vs. Personal Health Dashboards

Definition-Example Pairs

Worked Examples

Scenario: Automated Response to EC2 Retirement

Checkpoint Questions

AWS Health and Incident Management Study Guide

AWS Health and Incident Management

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

Health Event Automation Flow

Public vs. Personal Health Dashboards

Definition-Example Pairs

Worked Examples

Scenario: Automated Response to EC2 Retirement

Checkpoint Questions