Cloud Availability: Designing for Production and Non-Production Workloads

This study guide explores the critical task of determining availability requirements for various workload classes, a core skill for the AWS Certified Solutions Architect - Associate (SAA-C03) exam. It focuses on balancing reliability, complexity, and cost.

Learning Objectives

After studying this guide, you should be able to:

Define and calculate availability using "nines" terminology.
Differentiate between hard and soft dependencies in a cloud architecture.
Calculate total system availability for both serial (dependent) and parallel (redundant) components.
Align availability targets (99% to 99.99%+) with specific workload classes like production, dev/test, and disaster recovery.
Identify the trade-offs between cost, complexity, and uptime.

Key Terms & Glossary

Availability: The percentage of time a workload is available for use. It is usually measured annually.
SLA (Service Level Agreement): A contract from a provider (like AWS) defining the operational uptime goal and the compensation (credits) if that goal is missed.
Hard Dependency: A component that must be operational for the entire workload to function. If it fails, the system fails.
Soft Dependency: A component that, if it fails, allows the workload to continue operating, perhaps with degraded performance or reduced features (e.g., a read-replica).
RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration of service.
RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time (e.g., "we can lose 4 hours of data").

The "Big Idea"

Reliability is the most important requirement in cloud architecture, but it is not free. Every additional "nine" of availability exponentially increases both the cost and the complexity of the system. As an architect, your job is not to build every system to 99.999% availability, but to match the availability target to the business value of the workload. A production payroll system requires significantly more investment in redundancy than a developer's sandbox environment.

Formula / Concept Box

Calculation Type	Description	Formula
Hard Dependencies	Calculating availability for components in a series (all must work).	$A_{total} = A_1 \times A_2 \times A_3...$
Redundant Components	Calculating availability for components in parallel (only one must work).	$A_{total} = 100\% - (F_1 \times F_2...)$ (where $F$ is failure rate)
Failure Rate	The inverse of availability.	$F = 1 - A$

Hierarchical Outline

I. Classes of Workloads
- Production Workloads: High business impact; targets 99.9% to 99.99%+. Uses Multi-AZ and automated failover.
- Non-Production (Dev/Test): Lower impact; targets ~99%. Often uses single instances to save costs.
II. Dependency Management
- Serial (Hard) Dependencies: Multiplying availability reduces the total (e.g., $0.$99 \times 0.99 = 0.9801$$).
- Parallel (Redundant) Components: Using multiple AZs drastically increases availability.
III. The Multi-AZ Strategy
- Compute: Distributing EC2 instances across AZs behind an ALB.
- Database: Synchronous replication between primary and standby instances in different AZs.
IV. Measuring Success
- SLA: AWS's promise of uptime.
- RTO/RPO: Defining how fast and how much data is recovered during a disaster.

Visual Anchors

The Cost-Availability Trade-off

This diagram illustrates how costs and complexity rise as you move toward higher availability targets.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Workload Dependency Logic

Loading Diagram...

Definition-Example Pairs

99% Availability (Two Nines): A target where ~3.5 days of downtime per year is acceptable.
- Example: A non-critical internal employee training portal that is only used during business hours.
99.9% Availability (Three Nines): A target where ~9 hours of downtime per year is acceptable.
- Example: A production application using multiple EC2 instances but a single-instance database that requires manual recovery.
99.99% Availability (Four Nines): A target where ~52 minutes of downtime per year is acceptable.
- Example: A high-traffic e-commerce site using an Application Load Balancer, Multi-AZ EC2 instances, and Multi-AZ Amazon RDS.

Worked Examples

Scenario 1: Calculating Hard Dependencies

Problem: You have a web application on a single EC2 instance (90% availability) connecting to an RDS instance (99.95% availability). What is the total application availability?

Solution:

Identify the availabilities: $0.90 and $0.9995.
Multiply them: $0.90 \times 0.9995 = 0.89955$.
Convert to percentage: 89.96%.
Result: This system is down for about 36 days a year—unacceptable for production!

Scenario 2: Calculating Redundancy

Problem: You move the EC2 instances to two separate Availability Zones. Each instance has a 10% failure rate (90% availability). What is the new availability of the compute layer?

Solution:

Identify failure rates: $0.10 for each.
Multiply failure rates: $0.$10 \times 0.10 = 0.01$$ (chance both fail simultaneously).
Subtract from 100%: $1.00 - 0.01 = 0.99$.
Result: 99% availability for the compute layer.

Checkpoint Questions

If a system has a "soft dependency" on a read-replica for reporting, does a failure of that replica take the main production site down?
Which is higher: the availability of a single EC2 instance or the availability of two EC2 instances in different AZs?
An organization requires an RTO of 15 minutes. Does this refer to the amount of data they can lose or the time it takes to get back online?
Why does AWS provide service credits instead of guaranteeing 100% uptime?

▶Click to see answers

No. Soft dependencies allow the workload to continue (perhaps without reporting features).
Two EC2 instances in different AZs (redundancy increases availability).
The time it takes to get back online.
Because failures are inevitable in distributed systems; SLAs define the financial consequences of those failures, not the impossibility of them.

Cloud Availability: Designing for Production and Non-Production Workloads

Learning Objectives

After studying this guide, you should be able to:

Define and calculate availability using "nines" terminology.
Differentiate between hard and soft dependencies in a cloud architecture.
Calculate total system availability for both serial (dependent) and parallel (redundant) components.
Align availability targets (99% to 99.99%+) with specific workload classes like production, dev/test, and disaster recovery.
Identify the trade-offs between cost, complexity, and uptime.

Key Terms & Glossary

Availability: The percentage of time a workload is available for use. It is usually measured annually.
SLA (Service Level Agreement): A contract from a provider (like AWS) defining the operational uptime goal and the compensation (credits) if that goal is missed.
Hard Dependency: A component that must be operational for the entire workload to function. If it fails, the system fails.
Soft Dependency: A component that, if it fails, allows the workload to continue operating, perhaps with degraded performance or reduced features (e.g., a read-replica).
RTO (Recovery Time Objective): The maximum acceptable delay between the interruption of service and restoration of service.
RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time (e.g., "we can lose 4 hours of data").

The "Big Idea"

Formula / Concept Box

Calculation Type	Description	Formula
Hard Dependencies	Calculating availability for components in a series (all must work).	$A_{total} = A_1 \times A_2 \times A_3...$
Redundant Components	Calculating availability for components in parallel (only one must work).	$A_{total} = 100\% - (F_1 \times F_2...)$ (where $F$ is failure rate)
Failure Rate	The inverse of availability.	$F = 1 - A$

Hierarchical Outline

I. Classes of Workloads
- Production Workloads: High business impact; targets 99.9% to 99.99%+. Uses Multi-AZ and automated failover.
- Non-Production (Dev/Test): Lower impact; targets ~99%. Often uses single instances to save costs.
II. Dependency Management
- Serial (Hard) Dependencies: Multiplying availability reduces the total (e.g., $0.$99 \times 0.99 = 0.9801$$).
- Parallel (Redundant) Components: Using multiple AZs drastically increases availability.
III. The Multi-AZ Strategy
- Compute: Distributing EC2 instances across AZs behind an ALB.
- Database: Synchronous replication between primary and standby instances in different AZs.
IV. Measuring Success
- SLA: AWS's promise of uptime.
- RTO/RPO: Defining how fast and how much data is recovered during a disaster.

Visual Anchors

The Cost-Availability Trade-off

This diagram illustrates how costs and complexity rise as you move toward higher availability targets.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Workload Dependency Logic

Loading Diagram...

Definition-Example Pairs

99% Availability (Two Nines): A target where ~3.5 days of downtime per year is acceptable.
- Example: A non-critical internal employee training portal that is only used during business hours.
99.9% Availability (Three Nines): A target where ~9 hours of downtime per year is acceptable.
- Example: A production application using multiple EC2 instances but a single-instance database that requires manual recovery.
99.99% Availability (Four Nines): A target where ~52 minutes of downtime per year is acceptable.
- Example: A high-traffic e-commerce site using an Application Load Balancer, Multi-AZ EC2 instances, and Multi-AZ Amazon RDS.

Worked Examples

Scenario 1: Calculating Hard Dependencies

Problem: You have a web application on a single EC2 instance (90% availability) connecting to an RDS instance (99.95% availability). What is the total application availability?

Solution:

Identify the availabilities: $0.90 and $0.9995.
Multiply them: $0.90 \times 0.9995 = 0.89955$.
Convert to percentage: 89.96%.
Result: This system is down for about 36 days a year—unacceptable for production!

Scenario 2: Calculating Redundancy

Problem: You move the EC2 instances to two separate Availability Zones. Each instance has a 10% failure rate (90% availability). What is the new availability of the compute layer?

Solution:

Identify failure rates: $0.10 for each.
Multiply failure rates: $0.$10 \times 0.10 = 0.01$$ (chance both fail simultaneously).
Subtract from 100%: $1.00 - 0.01 = 0.99$.
Result: 99% availability for the compute layer.

Checkpoint Questions

If a system has a "soft dependency" on a read-replica for reporting, does a failure of that replica take the main production site down?
Which is higher: the availability of a single EC2 instance or the availability of two EC2 instances in different AZs?
An organization requires an RTO of 15 minutes. Does this refer to the amount of data they can lose or the time it takes to get back online?
Why does AWS provide service credits instead of guaranteeing 100% uptime?

▶Click to see answers

No. Soft dependencies allow the workload to continue (perhaps without reporting features).
Two EC2 instances in different AZs (redundancy increases availability).
The time it takes to get back online.
Because failures are inevitable in distributed systems; SLAs define the financial consequences of those failures, not the impossibility of them.