AWS Study Guide: Metrics for Highly Available Solutions
Identifying metrics based on business requirements to deliver a highly available solution
Identifying Metrics for High Availability (HA)
This guide covers the critical task of translating business requirements into technical metrics within the AWS ecosystem, specifically focusing on how to monitor and design for high availability as part of the SAA-C03 curriculum.
Learning Objectives
After studying this guide, you should be able to:
- Distinguish between SLAs, SLOs, and SLIs in a cloud context.
- Calculate workload availability based on successful vs. failed requests.
- Identify key CloudWatch metrics (Latency, Throughput, Availability) required for HA.
- Contrast basic monitoring (5-minute) with detailed monitoring (1-minute) and high-resolution metrics.
- Map business recovery objectives (RPO/RTO) to architectural decisions.
Key Terms & Glossary
- SLA (Service Level Agreement): A legal commitment between a service provider and a client regarding the expected level of service (e.g., "99.9% uptime").
- SLO (Service Level Objective): An internal target for reliability (e.g., "The web server must maintain 65% CPU utilization").
- SLI (Service Level Indicator): The actual quantitative measure of an SLO (e.g., the specific CloudWatch metric for CPU utilization).
- Availability: The percentage of time a workload is functional and accessible, calculated as: .
- High Resolution Metric: CloudWatch metrics with a sub-minute resolution (down to 1 second), useful for transient spikes.
- Durability: The ability of data to remain intact and uncorrupted over time (often measured in "11 nines" for S3).
The "Big Idea"
High Availability (HA) is not a binary "on/off" state; it is a measurable performance standard driven by business needs. In AWS, HA is achieved by combining redundant infrastructure (Multi-AZ) with proactive monitoring (CloudWatch). The goal of a Solutions Architect is to select the minimum set of metrics that prove the system is meeting its business-defined "nines" while avoiding unnecessary costs from over-monitoring.
Formula / Concept Box
| Concept | Formula / Rule | Business Impact |
|---|---|---|
| Availability (%) | $100 \times (Successes / Total\ Requests) | Determines if you are meeting the SLA. |
| Compound Availability | Availability_A \times Availability_B$ | Total availability of serial dependencies. |
| Redundant Availability | $1 - (Unavailability_A \times Unavailability_B)$ | Availability of parallel/redundant systems. |
| The "Nines" | 99.9% = "3 Nines" | Approx. 8h 45m downtime/year. |
| The "Nines" | 99.99% = "4 Nines" | Approx. 52m downtime/year. |
Hierarchical Outline
- Defining the Business Requirement
- Availability Targets: Determining if the business needs 99% vs 99.999%.
- RPO (Recovery Point Objective): Maximum acceptable data loss (measured in time).
- RTO (Recovery Time Objective): Maximum acceptable downtime to restore service.
- Monitoring with Amazon CloudWatch
- Metric Resolutions:
- Standard Resolution: 1-minute intervals.
- High Resolution: 1-second intervals (useful for sensitive apps).
- Monitoring Levels:
- Basic: 5-minute frequency (Default).
- Detailed: 1-minute frequency (Paid feature).
- Metric Resolutions:
- Core Metrics for HA Architecture
- Compute: CPU Utilization, Memory, Status Checks.
- Storage: IOPS (Throughput), Latency, Burst Balance (EBS).
- Network: Requests per second, 4xx/5xx error rates (ALB/CLB).
Visual Anchors
The Metric Translation Flow
Visualizing Availability vs. Downtime
This diagram illustrates how increasing "nines" exponentially decreases allowed downtime.
Definition-Example Pairs
-
Metric: TargetResponseTime
- Definition: The time elapsed, in seconds, after the request leaves the load balancer until a response from the target is received.
- Example: A business requires that users never wait more than 1 second for a page load. The architect sets a CloudWatch alarm on the ALB's
TargetResponseTimemetric to trigger at 0.8s.
-
Metric: SurgeQueueLength
- Definition: The total number of requests that are pending routing to a healthy instance (Classic Load Balancer).
- Example: If this metric spikes, it indicates the backend instances are overwhelmed, signaling a need to scale out the compute fleet via Auto Scaling.
-
Metric: BurstBalance
- Definition: Used for EBS gp2 volumes; indicates how much "burst credit" is left for high IOPS activity.
- Example: A database performing a nightly batch job may exhaust its credits. Monitoring this prevents the DB from slowing down significantly once credits hit 0%.
Worked Examples
Example 1: Calculating Compound Availability
Problem: You have a web application running on an EC2 instance (99.9% availability) that depends on an RDS database (99.95% availability). What is the total theoretical availability of the system?
Solution: Since these are serial dependencies (the web app needs the DB to function), we multiply the availability percentages: Result: Approximately 99.85%. Note that the total system availability is lower than the weakest link.
Example 2: Designing for "Six Nines"
Problem: A business requires 99.9999% availability. How is this achieved using Availability Zones (AZs)?
Solution: If one AZ has an availability of 99.95% (0.05% downtime), operating across two independent AZs allows you to calculate the probability of both failing:
- Unavailability per AZ: $1 - 0.9995 = 0.0005$
- Probability of both failing: $0.0005 \times 0.0005 = 0.00000025$
- System Availability: $1 - 0.00000025 = 0.99999975 Result: This achieves 99.9999% ("Six Nines") by utilizing redundant components in parallel.
Checkpoint Questions
- What is the default monitoring interval for Amazon CloudWatch metrics for an EC2 instance?
- If a business requires a Recovery Point Objective (RPO) of 0, which replication strategy must be used for their database?
- A user reports that their application is slow, but CloudWatch shows low CPU utilization. Which other two categories of metrics should the architect check?
- How long does CloudWatch retain metrics stored at a 1-minute resolution?
- What is the mathematical difference between calculating availability for serial vs. redundant components?
[!TIP] Answers for Recall:
- 5 minutes (Basic Monitoring).
- Synchronous replication (e.g., Multi-AZ RDS).
- Disk I/O (Latency/IOPS) and Network In/Out.
- 15 days.
- Serial = Multiplied probabilities (A \times B$); Redundant = 1 - (Unavailability Product).