Identifying Metrics for High Availability (HA)

This guide covers the critical task of translating business requirements into technical metrics within the AWS ecosystem, specifically focusing on how to monitor and design for high availability as part of the SAA-C03 curriculum.

Learning Objectives

After studying this guide, you should be able to:

Distinguish between SLAs, SLOs, and SLIs in a cloud context.
Calculate workload availability based on successful vs. failed requests.
Identify key CloudWatch metrics (Latency, Throughput, Availability) required for HA.
Contrast basic monitoring (5-minute) with detailed monitoring (1-minute) and high-resolution metrics.
Map business recovery objectives (RPO/RTO) to architectural decisions.

Key Terms & Glossary

SLA (Service Level Agreement): A legal commitment between a service provider and a client regarding the expected level of service (e.g., "99.9% uptime").
SLO (Service Level Objective): An internal target for reliability (e.g., "The web server must maintain 65% CPU utilization").
SLI (Service Level Indicator): The actual quantitative measure of an SLO (e.g., the specific CloudWatch metric for CPU utilization).
Availability: The percentage of time a workload is functional and accessible, calculated as: $(Successful\ Requests / Total\ Requests) \times 100$ .
High Resolution Metric: CloudWatch metrics with a sub-minute resolution (down to 1 second), useful for transient spikes.
Durability: The ability of data to remain intact and uncorrupted over time (often measured in "11 nines" for S3).

The "Big Idea"

High Availability (HA) is not a binary "on/off" state; it is a measurable performance standard driven by business needs. In AWS, HA is achieved by combining redundant infrastructure (Multi-AZ) with proactive monitoring (CloudWatch). The goal of a Solutions Architect is to select the minimum set of metrics that prove the system is meeting its business-defined "nines" while avoiding unnecessary costs from over-monitoring.

Formula / Concept Box

Concept	Formula / Rule	Business Impact
Availability (%)	$ $100 \times$ (Successes / Total\ Requests)$	Determines if you are meeting the SLA.
Compound Availability	$Availability$ _A \times $Availability_B$	Total availability of serial dependencies.
Redundant Availability	$1 - (Unavailability_A \times Unavailability_B)$	Availability of parallel/redundant systems.
The "Nines"	99.9% = "3 Nines"	Approx. 8h 45m downtime/year.
The "Nines"	99.99% = "4 Nines"	Approx. 52m downtime/year.

Hierarchical Outline

Defining the Business Requirement
- Availability Targets: Determining if the business needs 99% vs 99.999%.
- RPO (Recovery Point Objective): Maximum acceptable data loss (measured in time).
- RTO (Recovery Time Objective): Maximum acceptable downtime to restore service.
Monitoring with Amazon CloudWatch
- Metric Resolutions:
  - Standard Resolution: 1-minute intervals.
  - High Resolution: 1-second intervals (useful for sensitive apps).
- Monitoring Levels:
  - Basic: 5-minute frequency (Default).
  - Detailed: 1-minute frequency (Paid feature).
Core Metrics for HA Architecture
- Compute: CPU Utilization, Memory, Status Checks.
- Storage: IOPS (Throughput), Latency, Burst Balance (EBS).
- Network: Requests per second, 4xx/5xx error rates (ALB/CLB).

Visual Anchors

The Metric Translation Flow

Loading Diagram...

Visualizing Availability vs. Downtime

This diagram illustrates how increasing "nines" exponentially decreases allowed downtime.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Metric: TargetResponseTime
- Definition: The time elapsed, in seconds, after the request leaves the load balancer until a response from the target is received.
- Example: A business requires that users never wait more than 1 second for a page load. The architect sets a CloudWatch alarm on the ALB's TargetResponseTime metric to trigger at 0.8s.
Metric: SurgeQueueLength
- Definition: The total number of requests that are pending routing to a healthy instance (Classic Load Balancer).
- Example: If this metric spikes, it indicates the backend instances are overwhelmed, signaling a need to scale out the compute fleet via Auto Scaling.
Metric: BurstBalance
- Definition: Used for EBS gp2 volumes; indicates how much "burst credit" is left for high IOPS activity.
- Example: A database performing a nightly batch job may exhaust its credits. Monitoring this prevents the DB from slowing down significantly once credits hit 0%.

Worked Examples

Example 1: Calculating Compound Availability

Problem: You have a web application running on an EC2 instance (99.9% availability) that depends on an RDS database (99.95% availability). What is the total theoretical availability of the system?

Solution: Since these are serial dependencies (the web app needs the DB to function), we multiply the availability percentages: $0.999 \times 0.9995 = 0.9985005$ Result: Approximately 99.85%. Note that the total system availability is lower than the weakest link.

Example 2: Designing for "Six Nines"

Problem: A business requires 99.9999% availability. How is this achieved using Availability Zones (AZs)?

Solution: If one AZ has an availability of 99.95% (0.05% downtime), operating across two independent AZs allows you to calculate the probability of both failing:

Unavailability per AZ: $1 - 0.9995 = 0.0005$
Probability of both failing: $0.0005 \times 0.0005 = 0.00000025$
System Availability: $1 - 0.00000025 = 0.99999975$ Result: This achieves 99.9999% ("Six Nines") by utilizing redundant components in parallel.

Checkpoint Questions

What is the default monitoring interval for Amazon CloudWatch metrics for an EC2 instance?
If a business requires a Recovery Point Objective (RPO) of 0, which replication strategy must be used for their database?
A user reports that their application is slow, but CloudWatch shows low CPU utilization. Which other two categories of metrics should the architect check?
How long does CloudWatch retain metrics stored at a 1-minute resolution?
What is the mathematical difference between calculating availability for serial vs. redundant components?

[!TIP] Answers for Recall:

5 minutes (Basic Monitoring).

Synchronous replication (e.g., Multi-AZ RDS).

Disk I/O (Latency/IOPS) and Network In/Out.

15 days.

Serial = Multiplied probabilities ( $A \times B$ ); Redundant = 1 - (Unavailability Product).

Identifying Metrics for High Availability (HA)

Learning Objectives

After studying this guide, you should be able to:

Distinguish between SLAs, SLOs, and SLIs in a cloud context.
Calculate workload availability based on successful vs. failed requests.
Identify key CloudWatch metrics (Latency, Throughput, Availability) required for HA.
Contrast basic monitoring (5-minute) with detailed monitoring (1-minute) and high-resolution metrics.
Map business recovery objectives (RPO/RTO) to architectural decisions.

Key Terms & Glossary

SLA (Service Level Agreement): A legal commitment between a service provider and a client regarding the expected level of service (e.g., "99.9% uptime").
SLO (Service Level Objective): An internal target for reliability (e.g., "The web server must maintain 65% CPU utilization").
SLI (Service Level Indicator): The actual quantitative measure of an SLO (e.g., the specific CloudWatch metric for CPU utilization).
Availability: The percentage of time a workload is functional and accessible, calculated as: $(Successful\ Requests / Total\ Requests) \times 100$ .
High Resolution Metric: CloudWatch metrics with a sub-minute resolution (down to 1 second), useful for transient spikes.
Durability: The ability of data to remain intact and uncorrupted over time (often measured in "11 nines" for S3).

The "Big Idea"

Formula / Concept Box

Concept	Formula / Rule	Business Impact
Availability (%)	$ $100 \times$ (Successes / Total\ Requests)$	Determines if you are meeting the SLA.
Compound Availability	$Availability$ _A \times $Availability_B$	Total availability of serial dependencies.
Redundant Availability	$1 - (Unavailability_A \times Unavailability_B)$	Availability of parallel/redundant systems.
The "Nines"	99.9% = "3 Nines"	Approx. 8h 45m downtime/year.
The "Nines"	99.99% = "4 Nines"	Approx. 52m downtime/year.

Hierarchical Outline

Defining the Business Requirement
- Availability Targets: Determining if the business needs 99% vs 99.999%.
- RPO (Recovery Point Objective): Maximum acceptable data loss (measured in time).
- RTO (Recovery Time Objective): Maximum acceptable downtime to restore service.
Monitoring with Amazon CloudWatch
- Metric Resolutions:
  - Standard Resolution: 1-minute intervals.
  - High Resolution: 1-second intervals (useful for sensitive apps).
- Monitoring Levels:
  - Basic: 5-minute frequency (Default).
  - Detailed: 1-minute frequency (Paid feature).
Core Metrics for HA Architecture
- Compute: CPU Utilization, Memory, Status Checks.
- Storage: IOPS (Throughput), Latency, Burst Balance (EBS).
- Network: Requests per second, 4xx/5xx error rates (ALB/CLB).

Visual Anchors

The Metric Translation Flow

Loading Diagram...

Visualizing Availability vs. Downtime

This diagram illustrates how increasing "nines" exponentially decreases allowed downtime.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Metric: TargetResponseTime
- Definition: The time elapsed, in seconds, after the request leaves the load balancer until a response from the target is received.
- Example: A business requires that users never wait more than 1 second for a page load. The architect sets a CloudWatch alarm on the ALB's TargetResponseTime metric to trigger at 0.8s.
Metric: SurgeQueueLength
- Definition: The total number of requests that are pending routing to a healthy instance (Classic Load Balancer).
- Example: If this metric spikes, it indicates the backend instances are overwhelmed, signaling a need to scale out the compute fleet via Auto Scaling.
Metric: BurstBalance
- Definition: Used for EBS gp2 volumes; indicates how much "burst credit" is left for high IOPS activity.
- Example: A database performing a nightly batch job may exhaust its credits. Monitoring this prevents the DB from slowing down significantly once credits hit 0%.

Worked Examples

Example 1: Calculating Compound Availability

Example 2: Designing for "Six Nines"

Problem: A business requires 99.9999% availability. How is this achieved using Availability Zones (AZs)?

Solution: If one AZ has an availability of 99.95% (0.05% downtime), operating across two independent AZs allows you to calculate the probability of both failing:

Unavailability per AZ: $1 - 0.9995 = 0.0005$
Probability of both failing: $0.0005 \times 0.0005 = 0.00000025$
System Availability: $1 - 0.00000025 = 0.99999975$ Result: This achieves 99.9999% ("Six Nines") by utilizing redundant components in parallel.

Checkpoint Questions

What is the default monitoring interval for Amazon CloudWatch metrics for an EC2 instance?
If a business requires a Recovery Point Objective (RPO) of 0, which replication strategy must be used for their database?
A user reports that their application is slow, but CloudWatch shows low CPU utilization. Which other two categories of metrics should the architect check?
How long does CloudWatch retain metrics stored at a 1-minute resolution?
What is the mathematical difference between calculating availability for serial vs. redundant components?

[!TIP] Answers for Recall:

5 minutes (Basic Monitoring).

Synchronous replication (e.g., Multi-AZ RDS).

Disk I/O (Latency/IOPS) and Network In/Out.

15 days.

Serial = Multiplied probabilities ( $A \times B$ ); Redundant = 1 - (Unavailability Product).

AWS Study Guide: Metrics for Highly Available Solutions

Identifying Metrics for High Availability (HA)

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

The Metric Translation Flow

Visualizing Availability vs. Downtime

Definition-Example Pairs

Worked Examples

Example 1: Calculating Compound Availability

Example 2: Designing for "Six Nines"

Checkpoint Questions

AWS Study Guide: Metrics for Highly Available Solutions

Identifying Metrics for High Availability (HA)

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

The Metric Translation Flow

Visualizing Availability vs. Downtime

Definition-Example Pairs

Worked Examples

Example 1: Calculating Compound Availability

Example 2: Designing for "Six Nines"

Checkpoint Questions