Unit 3: Continuous Improvement for Existing Solutions — SAP-C02 Study Guide
Unit 3: Continuous Improvement for Existing Solutions
Unit 3: Continuous Improvement for Existing Solutions
This guide focuses on Domain 3 of the AWS Certified Solutions Architect - Professional (SAP-C02) exam, covering the strategies necessary to evolve and optimize existing cloud workloads across operational excellence, security, performance, reliability, and cost.
Learning Objectives
After studying this guide, you should be able to:
- Design a strategy to improve operational excellence using runbooks and playbooks.
- Evaluate and implement security improvements for existing production workloads.
- Identify performance bottlenecks and propose architectural enhancements.
- Enhance reliability through automation and post-incident analysis.
- Execute cost optimization via rightsizing, storage tiering, and pricing model selection.
Key Terms & Glossary
- Runbook: A documented procedure to achieve a specific, often routine, outcome (e.g., a deployment or a backup check).
- Playbook: A documented process used to investigate and resolve unexpected issues or failures.
- Infrastructure as Code (IaC): The practice of managing and provisioning infrastructure through machine-readable definition files (e.g., AWS CloudFormation, CDK).
- Rightsizing: The process of matching instance types and sizes to your workload performance and capacity requirements at the lowest possible cost.
- Drift Detection: The identification of changes in infrastructure that cause it to deviate from its intended state defined in IaC.
The "Big Idea"
[!IMPORTANT] Cloud architecture is not a static event; it is a continuous lifecycle. The "Big Idea" here is that an architect's job only truly begins once a solution is live. By leveraging the feedback loops of monitoring and automated remediation, organizations move from "reactive firefighting" to "proactive evolution."
Formula / Concept Box
| Concept | Core Rule / Formula |
|---|---|
| Cost Optimization | \Focus on reducing Quantity (rightsizing) and Unit Price (Savings Plans). |
| Availability Goal | \Professional focus: Minimize MTTR (Mean Time To Repair) via automation. |
| Data Transfer | Internal > Intra-region > Inter-region > Internet \Always architect to keep data as close to the compute as possible to minimize latency and cost. |
Hierarchical Outline
- Operational Excellence (Task 3.1)
- Automation: Use Systems Manager and EventBridge for self-healing.
- Documentation: Transition from manual wiki pages to Runbooks-as-Code.
- Security Improvement (Task 3.2)
- Visibility: Implement AWS Config and Security Hub for continuous compliance.
- Least Privilege: Regularly audit IAM policies using IAM Access Analyzer.
- Performance & Reliability (Tasks 3.3 & 3.4)
- Bottleneck Analysis: Use CloudWatch Synthetics and X-Ray for deep-dive tracing.
- Resilience: Implement Multi-AZ/Multi-Region failover for legacy single-node systems.
- Cost Optimization (Task 3.5)
- Compute: Migrate from On-Demand to Savings Plans or Spot Instances (for stateless).
- Storage: Automate lifecycle policies for S3 (Moving from Standard to IA or Glacier).
Visual Anchors
The Continuous Improvement Loop
Cost-Performance Trade-off Matrix
\begin{tikzpicture}[scale=0.8] \draw[->] (0,0) -- (6,0) node[right] {Performance}; \draw[->] (0,0) -- (0,6) node[above] {Cost}; \draw[thick, blue] (0.5,0.5) .. controls (1,4) and (4,5) .. (5.5,5.5); \node at (3,5) [align=center, font=\small] {Over-provisioned$Waste)}; \node at (5,2) [align=center, font=\small] {Rightsized$Optimal)}; \draw[dashed, red] (1,1) circle (0.5); \node at (2,0.5) [font=\tiny] {Under-provisioned}; \end{tikzpicture}
Definition-Example Pairs
- Concept: Post-Incident Analysis
- Definition: A process to identify the root cause of an event and implement preventive measures.
- Example: After an application outage due to a full disk, the architect creates an Amazon CloudWatch Alarm that triggers an AWS Lambda function to clear logs automatically when disk usage hits 80%.
- Concept: Storage Tiering
- Definition: Moving data between different storage classes based on access frequency.
- Example: Moving financial records older than 90 days from S3 Standard to S3 Glacier Deep Archive to save up to 95% in storage costs.
Worked Examples
Problem: Over-provisioned EC2 Fleet
Scenario: A company has a fleet of 50 m5.2xlarge instances running at an average CPU utilization of 10%. They are currently using On-Demand pricing.
Step-by-Step Optimization:
- Analyze: Use AWS Compute Optimizer to identify specific sizing recommendations.
- Rightsize: Downsize instances to
m5.large(reducing cost by 75%). - Select Pricing Model: Apply a Compute Savings Plan for the baseline usage (saving an additional 30% or more).
- Automate: Implement an Auto Scaling Group to handle peaks rather than keeping all 50 instances running 24/7.
Checkpoint Questions
- What is the primary difference between a Runbook and a Playbook?
- Which AWS service would you use to find instances that are underutilized and could be downsized?
- How does Infrastructure as Code (IaC) contribute to reliability during an improvement cycle?
- Why is "small incremental changes" preferred over large "big bang" updates in operational evolution?
Muddy Points & Cross-Refs
- Reserved Instances (RI) vs. Savings Plans: Students often confuse these. Savings Plans are generally more flexible (applying across instance families and regions), while RIs are legacy but still relevant for specific services like RDS or Redshift.
- Operational Excellence vs. Reliability: These overlap. Remember: Operational Excellence is about the processes (how you work), while Reliability is about the workload behavior (how the system stays up).
- See Also: Domain 2: Design for New Solutions (to compare how to build right the first time vs. fixing existing issues).
Comparison Tables
Operational Documentation Comparison
| Feature | Runbook | Playbook |
|---|---|---|
| Primary Purpose | Execution of routine tasks | Investigation of issues |
| Trigger | Scheduled or requested | Unexpected failure/incident |
| Example | Weekly database maintenance | Response to a DDoS attack |
| Automation Goal | Fully automated (Run Command) | Semi-automated (Guidance + Tools) |
Pricing Strategy Comparison
| Model | Best For... | Typical Savings |
|---|---|---|
| On-Demand | Spiky, unpredictable workloads | 0% (Baseline) |
| Savings Plans | Steady-state compute across account | 60-72% |
| Spot Instances | Stateless, fault-tolerant batch jobs | up to 90% |
| Reserved Instances | Databases and specific long-term apps | 60-72% |