Optimizing Performance: Sizes, Speeds, and Business Requirements
Sizes and speeds needed to meet business requirements
Optimizing Performance: Sizes, Speeds, and Business Requirements
Performance efficiency in the cloud is not a static destination but a continuous cycle of matching resource sizes and speeds to the shifting needs of a business. This guide focuses on characterizing workloads, selecting the right compute and storage classes, and understanding the metrics that define success in AWS architectures.
Learning Objectives
- Characterize workloads as compute, storage, or memory-driven to inform architectural choices.
- Evaluate business requirements using RTO (Recovery Time Objective) and RPO (Recovery Point Objective) metrics.
- Calculate required IOPS and throughput for storage volumes based on database engine page sizes.
- Differentiate between vertical and horizontal scaling and their impact on system performance.
- Identify networking limits for both internal AWS traffic and on-premises connectivity.
Key Terms & Glossary
- IOPS (Input/Output Operations Per Second): A measurement of the number of read and write operations a storage device can perform per second.
- Throughput: The amount of data (usually in MB/s or Gbps) that can be transferred from one place to another in a given time.
- RTO (Recovery Time Objective): The maximum acceptable delay between a service interruption and restoration of service (measured in time).
- RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time (e.g., "we can lose up to 15 minutes of data").
- Vertical Scaling: Increasing the capacity of a single resource, such as upgrading an EC2 instance to a larger size.
- Horizontal Scaling: Increasing capacity by adding more resources, such as adding more EC2 instances to an Auto Scaling group.
The "Big Idea"
[!IMPORTANT] Success in AWS performance design hinges on right-sizing. Over-provisioning leads to wasted cost, while under-provisioning leads to bottlenecks. By monitoring workload metrics and using the elasticity of the cloud, architects can move away from "guessing" capacity and move toward data-driven, automated scaling that meets business-defined performance goals.
Formula / Concept Box
| Concept | Formula / Rule | Notes |
|---|---|---|
| Required IOPS | Page sizes vary (MySQL=16KB, Oracle=8KB) | |
| gp2 Baseline IOPS | $3 \times Storage Size in GB | Minimum 100, Maximum 16,000 IOPS |
| Burst Duration | \frac{Credit Balance}{3000 - (3 \times Storage Size)}$ | For gp2 volumes under 1 TB |
| VPN Throughput | Max 1.25 Gbps | Per VPN tunnel |
| Direct Network | Up to 200 Gbps | Private AWS internal networking |
Hierarchical Outline
- Workload Characterization
- Compute-Oriented: High CPU needs (e.g., batch processing).
- Memory-Driven: High RAM needs (e.g., in-memory databases, caching).
- Storage-Focused: High I/O or throughput needs (e.g., data warehousing).
- Compute & Networking Optimization
- Vertical Scaling: Resizing instances to change CPU/RAM/Network specs.
- Horizontal Scaling: Adding instances via Parallelism (SQS, Read Replicas).
- Networking: AWS internal speeds (200 Gbps) vs. On-premises VPN (1.25 Gbps).
- Storage Performance (RDS & EBS)
- IOPS vs. Throughput: Understanding the inverse relationship between page size and IOPS.
- Storage Types: General Purpose SSD (gp2) vs. Provisioned IOPS (io1/io2).
- The Nitro System: New generation instance classes (m6i, r5) that offload virtualization to hardware.
Visual Anchors
Scaling Strategy Decision Tree
RPO and RTO Visualized
\begin{tikzpicture} [node distance=2cm, every node/.style={font=\small}] % Timeline \draw[thick, -{Stealth}] (0,0) -- (10,0) node[right] {Time};
% Disaster Point \draw[red, ultra thick] (5,-0.5) -- (5,1.5) node[above] {Disaster Event};
% RPO \draw[dashed, blue, thick] (2,0) -- (2,1); \draw[<->, blue, thick] (2,0.5) -- (5,0.5); \node[blue, below] at (3.5,0) {RPO (Data Loss)};
% RTO \draw[dashed, orange, thick] (8,0) -- (8,1); \draw[<->, orange, thick] (5,0.5) -- (8,0.5); \node[orange, below] at (6.5,0) {RTO (Downtime)};
% Labels \node at (1, -0.5) {Last Backup}; \node at (9, -0.5) {Service Restored}; \end{tikzpicture}
Definition-Example Pairs
- Parallelism: Designing systems so many tasks run simultaneously rather than one after another.
- Example: Using Amazon SQS to decouple an application, allowing multiple worker instances to process messages independently from a queue.
- Nitro System: A collection of dedicated hardware and a lightweight hypervisor that delivers high performance and security for EC2 instances.
- Example: Moving from a
db.m4to adb.m6iinstance to leverage the Nitro System for higher network bandwidth (up to 40 Gbps) and disk throughput.
- Example: Moving from a
- Burst Balance: A credit system that allows volumes to temporarily exceed their baseline performance.
- Example: A 200 GB gp2 volume with a baseline of 600 IOPS can burst to 3,000 IOPS for approximately 37.5 minutes while the credit balance lasts.
Worked Examples
Example 1: Calculating IOPS for Throughput
Scenario: A MariaDB database (16 KB page size) requires a disk throughput of 100 MB/s. How many IOPS must the storage volume support?
- Convert Throughput to KB/s: $100 \text{ MB/s} \times 1024 = 102,400 \text{ KB/s}$.
- Divide by Page Size: $102,400 KB/s \div 16 KB = 6,400 IOPS.
- Conclusion: The architect must choose an EBS volume or RDS storage configuration that supports at least 6,400 IOPS.
Example 2: Sizing gp2 for Baseline Performance
Scenario: A company needs 1,200 IOPS consistently without relying on burst credits. What is the minimum size of a gp2 volume required?
- Use the 3 IOPS per GB Rule: Size = \frac{1200 IOPS}{3 IOPS/GB}$.
- Calculation: $400 \text{ GB}$.
- Conclusion: A 400 GB gp2 volume will provide a steady 1,200 IOPS baseline.
Checkpoint Questions
- What is the maximum network throughput for a private VPN connection to AWS?
- If an Oracle database uses an 8 KB page size, how many I/O operations occur when writing 16 KB of data?
- Which metric defines the amount of data loss a business can tolerate: RTO or RPO?
- What is the minimum storage size for an Amazon RDS volume?
- How does horizontal scaling differ from vertical scaling in terms of resource management?
▶Click to see answers
- 1.25 Gbps.
- Two I/O operations (16 KB / 8 KB = 2).
- RPO (Recovery Point Objective).
- 20 GB.
- Horizontal scaling adds more instances (e.g., Auto Scaling), while vertical scaling increases the size of an existing instance (e.g., moving from m5.large to m5.xlarge).