BrainyBeeBrainyBee
ExploreBlogStart Studying
HomeAWS Certified Solutions Architect - Associate (SAA-C03)AWS Study Guide: Metrics for Highly Available Solutions
Study Guide925 words

AWS Study Guide: Metrics for Highly Available Solutions

Identifying metrics based on business requirements to deliver a highly available solution

Identifying Metrics for High Availability (HA)

This guide covers the critical task of translating business requirements into technical metrics within the AWS ecosystem, specifically focusing on how to monitor and design for high availability as part of the SAA-C03 curriculum.

Learning Objectives

After studying this guide, you should be able to:

  • Distinguish between SLAs, SLOs, and SLIs in a cloud context.
  • Calculate workload availability based on successful vs. failed requests.
  • Identify key CloudWatch metrics (Latency, Throughput, Availability) required for HA.
  • Contrast basic monitoring (5-minute) with detailed monitoring (1-minute) and high-resolution metrics.
  • Map business recovery objectives (RPO/RTO) to architectural decisions.

Key Terms & Glossary

  • SLA (Service Level Agreement): A legal commitment between a service provider and a client regarding the expected level of service (e.g., "99.9% uptime").
  • SLO (Service Level Objective): An internal target for reliability (e.g., "The web server must maintain 65% CPU utilization").
  • SLI (Service Level Indicator): The actual quantitative measure of an SLO (e.g., the specific CloudWatch metric for CPU utilization).
  • Availability: The percentage of time a workload is functional and accessible, calculated as: (Successful Requests/Total Requests)×100(Successful\ Requests / Total\ Requests) \times 100(Successful Requests/Total Requests)×100.
  • High Resolution Metric: CloudWatch metrics with a sub-minute resolution (down to 1 second), useful for transient spikes.
  • Durability: The ability of data to remain intact and uncorrupted over time (often measured in "11 nines" for S3).

The "Big Idea"

High Availability (HA) is not a binary "on/off" state; it is a measurable performance standard driven by business needs. In AWS, HA is achieved by combining redundant infrastructure (Multi-AZ) with proactive monitoring (CloudWatch). The goal of a Solutions Architect is to select the minimum set of metrics that prove the system is meeting its business-defined "nines" while avoiding unnecessary costs from over-monitoring.

Formula / Concept Box

ConceptFormula / RuleBusiness Impact
Availability (%)$100×100 \times100× (Successes / Total\ Requests)$Determines if you are meeting the SLA.
Compound AvailabilityAvailabilityAvailabilityAvailability_A \timesAvailabilityB Availability_BAvailabilityB​Total availability of serial dependencies.
Redundant Availability$1 - (Unavailability_A \times Unavailability_B)$Availability of parallel/redundant systems.
The "Nines"99.9% = "3 Nines"Approx. 8h 45m downtime/year.
The "Nines"99.99% = "4 Nines"Approx. 52m downtime/year.

Hierarchical Outline

  1. Defining the Business Requirement
    • Availability Targets: Determining if the business needs 99% vs 99.999%.
    • RPO (Recovery Point Objective): Maximum acceptable data loss (measured in time).
    • RTO (Recovery Time Objective): Maximum acceptable downtime to restore service.
  2. Monitoring with Amazon CloudWatch
    • Metric Resolutions:
      • Standard Resolution: 1-minute intervals.
      • High Resolution: 1-second intervals (useful for sensitive apps).
    • Monitoring Levels:
      • Basic: 5-minute frequency (Default).
      • Detailed: 1-minute frequency (Paid feature).
  3. Core Metrics for HA Architecture
    • Compute: CPU Utilization, Memory, Status Checks.
    • Storage: IOPS (Throughput), Latency, Burst Balance (EBS).
    • Network: Requests per second, 4xx/5xx error rates (ALB/CLB).

Visual Anchors

The Metric Translation Flow

Loading Diagram...

Visualizing Availability vs. Downtime

This diagram illustrates how increasing "nines" exponentially decreases allowed downtime.

Compiling TikZ diagram…
⏳
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Metric: TargetResponseTime

    • Definition: The time elapsed, in seconds, after the request leaves the load balancer until a response from the target is received.
    • Example: A business requires that users never wait more than 1 second for a page load. The architect sets a CloudWatch alarm on the ALB's TargetResponseTime metric to trigger at 0.8s.
  • Metric: SurgeQueueLength

    • Definition: The total number of requests that are pending routing to a healthy instance (Classic Load Balancer).
    • Example: If this metric spikes, it indicates the backend instances are overwhelmed, signaling a need to scale out the compute fleet via Auto Scaling.
  • Metric: BurstBalance

    • Definition: Used for EBS gp2 volumes; indicates how much "burst credit" is left for high IOPS activity.
    • Example: A database performing a nightly batch job may exhaust its credits. Monitoring this prevents the DB from slowing down significantly once credits hit 0%.

Worked Examples

Example 1: Calculating Compound Availability

Problem: You have a web application running on an EC2 instance (99.9% availability) that depends on an RDS database (99.95% availability). What is the total theoretical availability of the system?

Solution: Since these are serial dependencies (the web app needs the DB to function), we multiply the availability percentages: 0.999×0.9995=0.99850050.999 \times 0.9995 = 0.99850050.999×0.9995=0.9985005 Result: Approximately 99.85%. Note that the total system availability is lower than the weakest link.

Example 2: Designing for "Six Nines"

Problem: A business requires 99.9999% availability. How is this achieved using Availability Zones (AZs)?

Solution: If one AZ has an availability of 99.95% (0.05% downtime), operating across two independent AZs allows you to calculate the probability of both failing:

  1. Unavailability per AZ: $1 - 0.9995 = 0.0005$
  2. Probability of both failing: $0.0005 \times 0.0005 = 0.00000025$
  3. System Availability: $1 - 0.00000025 = 0.99999975$ Result: This achieves 99.9999% ("Six Nines") by utilizing redundant components in parallel.

Checkpoint Questions

  1. What is the default monitoring interval for Amazon CloudWatch metrics for an EC2 instance?
  2. If a business requires a Recovery Point Objective (RPO) of 0, which replication strategy must be used for their database?
  3. A user reports that their application is slow, but CloudWatch shows low CPU utilization. Which other two categories of metrics should the architect check?
  4. How long does CloudWatch retain metrics stored at a 1-minute resolution?
  5. What is the mathematical difference between calculating availability for serial vs. redundant components?

[!TIP] Answers for Recall:

  1. 5 minutes (Basic Monitoring).
  2. Synchronous replication (e.g., Multi-AZ RDS).
  3. Disk I/O (Latency/IOPS) and Network In/Out.
  4. 15 days.
  5. Serial = Multiplied probabilities (A×BA \times BA×B); Redundant = 1 - (Unavailability Product).
All AWS Certified Solutions Architect - Associate (SAA-C03) Study Resources

Related Notes

  • AWS S3 Access Options and Cost Optimization945 words
  • Mastering AWS Compliance: Aligning Technology with Regulatory Standards920 words
  • Mastering API Management: Amazon API Gateway and RESTful Architectures895 words
  • Secure Application Configuration and Credentials Management1,240 words
  • AWS Compute Services: Strategic Selection & Use Cases920 words
  • AWS Cost Management and Multi-Account Billing: A Comprehensive Study Guide925 words
  • AWS Cost Management and Multi-Account Billing Strategy845 words
  • AWS Cost Management and Optimization Study Guide820 words
  • AWS Cost Management: Tracking, Tagging, and Multi-Account Billing820 words
  • AWS Cost Management and Optimization Study Guide920 words
  • AWS Cost Management and Optimization Tools945 words
  • AWS Cost Management Tools: Appropriate Use Cases and Strategies845 words

Ready to study AWS Certified Solutions Architect - Associate (SAA-C03)?

Practice tests, flashcards, and all study notes — free, no sign-up.

Start Studying

Ready to study AWS Certified Solutions Architect - Associate (SAA-C03)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free
AWS Certified Solutions Architect - Associate (SAA-C03) ResourcesExplore All HivesBlogHome

© 2026 BrainyBee. Free AI-powered exam prep.