Study Guide915 words

Unit 3: Continuous Improvement for Existing Solutions — SAP-C02 Study Guide

Unit 3: Continuous Improvement for Existing Solutions

Unit 3: Continuous Improvement for Existing Solutions

This guide focuses on Domain 3 of the AWS Certified Solutions Architect - Professional (SAP-C02) exam, covering the strategies necessary to evolve and optimize existing cloud workloads across operational excellence, security, performance, reliability, and cost.

Learning Objectives

After studying this guide, you should be able to:

  • Design a strategy to improve operational excellence using runbooks and playbooks.
  • Evaluate and implement security improvements for existing production workloads.
  • Identify performance bottlenecks and propose architectural enhancements.
  • Enhance reliability through automation and post-incident analysis.
  • Execute cost optimization via rightsizing, storage tiering, and pricing model selection.

Key Terms & Glossary

  • Runbook: A documented procedure to achieve a specific, often routine, outcome (e.g., a deployment or a backup check).
  • Playbook: A documented process used to investigate and resolve unexpected issues or failures.
  • Infrastructure as Code (IaC): The practice of managing and provisioning infrastructure through machine-readable definition files (e.g., AWS CloudFormation, CDK).
  • Rightsizing: The process of matching instance types and sizes to your workload performance and capacity requirements at the lowest possible cost.
  • Drift Detection: The identification of changes in infrastructure that cause it to deviate from its intended state defined in IaC.

The "Big Idea"

[!IMPORTANT] Cloud architecture is not a static event; it is a continuous lifecycle. The "Big Idea" here is that an architect's job only truly begins once a solution is live. By leveraging the feedback loops of monitoring and automated remediation, organizations move from "reactive firefighting" to "proactive evolution."

Formula / Concept Box

ConceptCore Rule / Formula
Cost OptimizationTotalCost=(UnitPrice×Quantity)SavingsTotal Cost = (Unit Price \times Quantity) - Savings\Focus on reducing Quantity (rightsizing) and Unit Price (Savings Plans).
Availability GoalAvailability=MTBFMTBF+MTTRAvailability = \frac{MTBF}{MTBF + MTTR} \Professional focus: Minimize MTTR (Mean Time To Repair) via automation.
Data TransferInternal > Intra-region > Inter-region > Internet \Always architect to keep data as close to the compute as possible to minimize latency and cost.

Hierarchical Outline

  1. Operational Excellence (Task 3.1)
    • Automation: Use Systems Manager and EventBridge for self-healing.
    • Documentation: Transition from manual wiki pages to Runbooks-as-Code.
  2. Security Improvement (Task 3.2)
    • Visibility: Implement AWS Config and Security Hub for continuous compliance.
    • Least Privilege: Regularly audit IAM policies using IAM Access Analyzer.
  3. Performance & Reliability (Tasks 3.3 & 3.4)
    • Bottleneck Analysis: Use CloudWatch Synthetics and X-Ray for deep-dive tracing.
    • Resilience: Implement Multi-AZ/Multi-Region failover for legacy single-node systems.
  4. Cost Optimization (Task 3.5)
    • Compute: Migrate from On-Demand to Savings Plans or Spot Instances (for stateless).
    • Storage: Automate lifecycle policies for S3 (Moving from Standard to IA or Glacier).

Visual Anchors

The Continuous Improvement Loop

Loading Diagram...

Cost-Performance Trade-off Matrix

\begin{tikzpicture}[scale=0.8] \draw[->] (0,0) -- (6,0) node[right] {Performance}; \draw[->] (0,0) -- (0,6) node[above] {Cost}; \draw[thick, blue] (0.5,0.5) .. controls (1,4) and (4,5) .. (5.5,5.5); \node at (3,5) [align=center, font=\small] {Over-provisioned$Waste)}; \node at (5,2) [align=center, font=\small] {Rightsized$Optimal)}; \draw[dashed, red] (1,1) circle (0.5); \node at (2,0.5) [font=\tiny] {Under-provisioned}; \end{tikzpicture}

Definition-Example Pairs

  • Concept: Post-Incident Analysis
    • Definition: A process to identify the root cause of an event and implement preventive measures.
    • Example: After an application outage due to a full disk, the architect creates an Amazon CloudWatch Alarm that triggers an AWS Lambda function to clear logs automatically when disk usage hits 80%.
  • Concept: Storage Tiering
    • Definition: Moving data between different storage classes based on access frequency.
    • Example: Moving financial records older than 90 days from S3 Standard to S3 Glacier Deep Archive to save up to 95% in storage costs.

Worked Examples

Problem: Over-provisioned EC2 Fleet

Scenario: A company has a fleet of 50 m5.2xlarge instances running at an average CPU utilization of 10%. They are currently using On-Demand pricing.

Step-by-Step Optimization:

  1. Analyze: Use AWS Compute Optimizer to identify specific sizing recommendations.
  2. Rightsize: Downsize instances to m5.large (reducing cost by 75%).
  3. Select Pricing Model: Apply a Compute Savings Plan for the baseline usage (saving an additional 30% or more).
  4. Automate: Implement an Auto Scaling Group to handle peaks rather than keeping all 50 instances running 24/7.

Checkpoint Questions

  1. What is the primary difference between a Runbook and a Playbook?
  2. Which AWS service would you use to find instances that are underutilized and could be downsized?
  3. How does Infrastructure as Code (IaC) contribute to reliability during an improvement cycle?
  4. Why is "small incremental changes" preferred over large "big bang" updates in operational evolution?

Muddy Points & Cross-Refs

  • Reserved Instances (RI) vs. Savings Plans: Students often confuse these. Savings Plans are generally more flexible (applying across instance families and regions), while RIs are legacy but still relevant for specific services like RDS or Redshift.
  • Operational Excellence vs. Reliability: These overlap. Remember: Operational Excellence is about the processes (how you work), while Reliability is about the workload behavior (how the system stays up).
  • See Also: Domain 2: Design for New Solutions (to compare how to build right the first time vs. fixing existing issues).

Comparison Tables

Operational Documentation Comparison

FeatureRunbookPlaybook
Primary PurposeExecution of routine tasksInvestigation of issues
TriggerScheduled or requestedUnexpected failure/incident
ExampleWeekly database maintenanceResponse to a DDoS attack
Automation GoalFully automated (Run Command)Semi-automated (Guidance + Tools)

Pricing Strategy Comparison

ModelBest For...Typical Savings
On-DemandSpiky, unpredictable workloads0% (Baseline)
Savings PlansSteady-state compute across account60-72%
Spot InstancesStateless, fault-tolerant batch jobsup to 90%
Reserved InstancesDatabases and specific long-term apps60-72%

Ready to study AWS Certified Solutions Architect - Professional (SAP-C02)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free