☁️ AWS

Free AWS Certified CloudOps Engineer - Associate (SOA-C03) Study Resources

Run production AWS with confidence — the operations cert for engineers who keep cloud systems reliable, secure, and cost-efficient. Spanning monitoring and remediation, reliability and business continuity, deployment and automation, security, networking, and cost optimization, with an AI tutor and real-format practice. For SysOps and CloudOps engineers operating live AWS workloads.

840

Practice Questions

Mock Exams

148

Study Notes

1,200

Flashcard Decks

Source Materials

Start Studying — Free

AWS Certified CloudOps Engineer - Associate (SOA-C03) Study Notes & Guides

148 AI-generated study notes covering the full AWS Certified CloudOps Engineer - Associate (SOA-C03) curriculum. Showing 10 complete guides below.

Curriculum Overview820 words

Curriculum Overview: Advanced Observability Services

Advanced Observability Services

Read full article

Curriculum Overview: Advanced Observability Services

[!NOTE] Course Alignment: This curriculum overview aligns closely with Domain 1 of the AWS Certified CloudOps Engineer - Associate (SOA-C03) exam: Monitoring, Logging, Analysis, Remediation, and Performance Optimization.

Welcome to the Advanced Observability Services curriculum. As cloud environments transition toward modern, containerized, and microservice-driven architectures, traditional monitoring is no longer sufficient. This curriculum bridges the gap between basic resource checks and full-stack, automated observability.

Prerequisites

Before diving into Advanced Observability Services, learners must establish a solid baseline in cloud operations and AWS fundamentals. You should be comfortable with the following:

AWS Management & Core Services: Proficiency in navigating the AWS Management Console and executing commands via the AWS CLI. Familiarity with EC2, VPC, and IAM basics.
Basic CloudWatch: Prior experience setting up simple CloudWatch alarms (e.g., CPU Utilization) and viewing basic metrics.
Container Fundamentals: A conceptual understanding of Docker containers, Amazon Elastic Container Service (ECS), and Amazon Elastic Kubernetes Service (EKS).
JSON & Query Syntax: Basic ability to read JSON responses and familiarity with querying structures (like JMESPath).

Module Breakdown

This curriculum is structured to take you from foundational centralized logging up to highly automated, multi-account observability platforms.

Module	Title	Focus Area	Difficulty
1	Centralized Logging & Analysis	CloudTrail, CloudWatch Logs Insights, log aggregation	Beginner
2	Advanced CloudWatch Metrics	Custom metrics, anomaly detection, cross-account dashboards	Intermediate
3	Container & OS-Level Observability	CloudWatch Agent, EC2, ECS, EKS metrics	Intermediate
4	Open-Source Monitoring Integrations	Amazon Managed Service for Prometheus & Grafana	Advanced
5	Event-Driven Remediation	EventBridge, Lambda, SSM Automation Runbooks	Expert

Observability Flow

Loading Diagram...

Figure 1 — Mermaid diagram

Learning Objectives per Module

By progressing through the curriculum, learners will achieve specific, testable outcomes critical to the role of a CloudOps Engineer.

Module 1: Centralized Logging & Analysis

Audit effectively: Configure AWS CloudTrail for comprehensive account auditing and data event tracking.
Query at scale: Write purpose-built syntax queries using CloudWatch Logs Insights to perform complex searches across application and system logs.

Module 2: Advanced CloudWatch Metrics

Implement intelligent alerting: Set up CloudWatch alarms featuring static and dynamic thresholds (anomaly detection).
Centralize visibility: Design and deploy customizable, shareable CloudWatch Dashboards that aggregate data across multiple AWS Regions and accounts.

Module 3: Container & OS-Level Observability

Deepen system monitoring: Configure and manage the CloudWatch agent to collect deep system-level metrics and internal logs from EC2 instances.
Observe modern workloads: Integrate monitoring agents within Amazon ECS and Amazon EKS clusters to track task and pod health.

Module 4: Open-Source Monitoring Integrations

Adopt open standards: Explain the architecture and benefits of Amazon Managed Service for Prometheus.
Visualize beautifully: Identify use cases and configure Amazon Managed Grafana to create rich, interactive visual dashboards compatible with open-source tools.

Module 5: Event-Driven Remediation

Automate responses: Configure Amazon EventBridge rules to trigger remediation actions automatically upon state changes.
Deploy runbooks: Execute predefined and custom Systems Manager (SSM) Automation runbooks to self-heal infrastructure without human intervention.

Success Metrics

How will you know you have mastered the Advanced Observability Services curriculum? Your success will be measured by your ability to:

Deploy the CloudWatch Agent Programmatically: Successfully use SSM or User Data to install and configure the CloudWatch agent across a fleet of simulated EC2 and EKS nodes.
Resolve an Incident Using Insights: Given a simulated application failure, identify the root cause within 5 minutes using CloudWatch Logs Insights and VPC Flow Logs.
Create a Multi-Account Grafana Dashboard: Successfully link metrics from at least two different AWS accounts into a single Managed Grafana visualization.
Achieve Zero-Touch Remediation: Build an EventBridge rule that detects a stopped EC2 instance or a full EBS volume, automatically triggering an SSM runbook to remediate the issue.

Real-World Application

In modern enterprise environments, downtime translates directly into lost revenue and damaged reputation. Traditional monitoring focuses on "what is broken?" (e.g., a server is down). Advanced observability answers "why is it broken, and how can we prevent it?"

Scenario: Imagine working for a global e-commerce platform during a flash sale. An unexpected spike in traffic causes memory exhaustion on several backend containers.

Without advanced observability: Customers experience timeout errors. The operations team spends 45 minutes manually SSHing into servers to read logs and restart services.
With advanced observability: The CloudWatch agent detects memory anomalies instantly. Metrics are pushed to a unified Grafana dashboard. EventBridge detects the CloudWatch alarm and triggers an AWS Lambda function that automatically scales up the Amazon ECS cluster and cycles the unhealthy containers—resolving the issue before end-users even notice.

Real-World Observability Architecture

Loading Diagram...

Figure 2 — Mermaid diagram

[!TIP] Career Impact: Mastering these tools shifts your role from reactive administrator (fixing broken things) to proactive engineer (designing self-healing systems), a highly sought-after skill in DevOps and Site Reliability Engineering (SRE) roles.

Curriculum Overview811 words

Amazon CloudWatch Metrics and Alarms: Curriculum Overview

Amazon CloudWatch Metrics and Alarms

Read full article

Amazon CloudWatch Metrics and Alarms: Curriculum Overview

[!NOTE] This curriculum aligns with the AWS Certified SysOps Administrator - Associate (SOA-C03) exam domain: Monitoring, Logging, Analysis, Remediation, and Performance Optimization.

Prerequisites

Before embarking on this curriculum, learners must possess a foundational understanding of the AWS ecosystem to ensure they can fully grasp advanced monitoring concepts.

Compute Services Fluency: Basic understanding of Amazon EC2, AWS Lambda, Amazon ECS, and Amazon EKS.
Operational Foundations: Proficiency using the AWS Management Console and the AWS Command Line Interface (CLI).
IAM Principles: Knowledge of Identity and Access Management (IAM) roles and policies, specifically the principle of least privilege required for resource monitoring.
Networking Basics: Understanding of VPCs, subnets, and security groups to comprehend network-level metrics.

Module Breakdown

This curriculum is structured to take you from foundational monitoring concepts to advanced, automated remediation strategies.

Loading Diagram...

Figure 1 — Mermaid diagram

Module	Core Focus	Difficulty	Estimated Time
1. Fundamentals	Metrics, Namespaces, Dashboards	Beginner	2 Hours
2. CW Agent	EC2/Container Logs & Custom Metrics	Intermediate	3 Hours
3. Alarms & SNS	Static/Dynamic Thresholds, Composite Alarms	Intermediate	3 Hours
4. Remediation	EventBridge, SSM Automation Runbooks	Advanced	4 Hours

Learning Objectives per Module

Module 1: CloudWatch Fundamentals

Analyze Standard Metrics: Interpret default metrics reported by AWS services at 1-minute and 5-minute intervals (e.g., Lambda invocations, execution time, errors, and throttling).
Implement Custom Metrics: Define and publish custom business or application-level metrics to specific CloudWatch Namespaces.
Design Dashboards: Create customizable, cross-region, and cross-account CloudWatch dashboards to visualize health across the entire AWS infrastructure.

Module 2: Advanced Collection & The CW Agent

Deploy the CloudWatch Agent: Configure and manage the CW Agent on EC2 instances to collect granular OS-level metrics (e.g., memory utilization, disk space) and application logs.
Monitor Containers: Implement monitoring for Amazon Elastic Container Service (ECS) and Elastic Kubernetes Service (EKS) clusters.
Log Analytics: Utilize CloudWatch Logs Insights to query log streams (e.g., filtering Lambda log streams for RequestID, billed duration, and memory size).

Module 3: Alarms, Thresholds, & Notifications

Configure CloudWatch Alarms: Set up static and anomaly-detection (dynamic) thresholds to monitor resource health.
Build Composite Alarms: Combine multiple alarms to reduce alarm fatigue and trigger actions only when specific multi-condition criteria are met.
Implement Notifications: Configure alarms to push alerts to Amazon Simple Notification Service (SNS) topics for email, SMS, or third-party ticketing integration.

[!TIP] Remember the key Lambda metrics that typically drive alarms: Errors (logic/runtime failures), Execution Time (slowest 1-5% of responses), and Throttling (concurrency limits reached).

Module 4: Automated Remediation & Operations

Event-Driven Architectures: Use Amazon EventBridge to route state changes and enrich events.
Automate Remediation: Trigger custom or predefined AWS Systems Manager (SSM) Automation runbooks to self-heal infrastructure.
Auto Scaling Integration: Trigger EC2 Auto Scaling policies or RDS Aurora Add Replica policies based on sustained alarm states.

Core Formula: Calculating Metric Impact

Understanding the mathematical relationship of metrics is crucial for setting effective alarms. For example, to calculate the application error rate for AWS Lambda:

$\text{Error Rate (\%)} = \left( \frac{\text{Total Errors}}{\text{Total Invocations}} \right) \times 100$

Success Metrics

How do you know you have mastered this curriculum? A successful candidate will be able to demonstrate the following hands-on capabilities:

Independent Remediation: Successfully configure an alarm that detects high CPU utilization on an EC2 instance, triggers EventBridge, and executes an SSM runbook to automatically restart the instance.
Visibility Architecture: Build a unified CloudWatch Dashboard that displays custom metrics, Lambda error rates, and EC2 memory utilization in a single pane of glass.
Troubleshooting Prowess: Given a simulated Lambda throttling event, successfully query CloudWatch Logs Insights to isolate the affected RequestIDs and identify the capacity constraint.
Cost-Aware Monitoring: Ensure custom metrics and extensive log ingestion are optimized to prevent unnecessary AWS spend.

Real-World Application

In modern Cloud Operations (CloudOps), monitoring is not just about watching graphs; it is about building self-healing systems.

Imagine a scenario where an e-commerce platform goes viral. Suddenly, your AWS Lambda functions experience a 500% spike in traffic. Without proper monitoring, your functions will throttle silently, leading to a degraded customer experience and lost revenue.

By applying the concepts in this curriculum, you establish a resilient architecture:

Loading Diagram...

Figure 2 — Mermaid diagram

Mastering CloudWatch Metrics and Alarms empowers you to transition from a reactive administrator (putting out fires) to a proactive CloudOps Engineer (preventing the fires from starting). This is a critical skill set for maintaining the Operational Excellence and Reliability pillars of the AWS Well-Architected Framework.

Curriculum Overview810 words

Curriculum Overview: Amazon EBS Performance, Troubleshooting, and Cost Optimization

Analyze Amazon Elastic Block Store (Amazon EBS) performance metrics, troubleshoot issues, and optimize volume types to improve performance and reduce cost

Read full article

Curriculum Overview: Amazon EBS Performance, Troubleshooting, and Cost Optimization

[!NOTE] Target Audience: SysOps Administrators, Cloud Operations Engineers, and candidates preparing for the AWS Certified CloudOps Engineer - Associate (SOA-C03) exam. Focus Area: Task 1.3.2 - Analyze Amazon Elastic Block Store (Amazon EBS) performance metrics, troubleshoot issues, and optimize volume types to improve performance and reduce cost.

Amazon Elastic Block Store (EBS) provides block-level storage volumes for use with EC2 instances. In real-world cloud operations, ensuring these volumes are highly performant and cost-effective is a critical day-to-day responsibility. This curriculum outlines the structured learning path to mastering EBS performance monitoring, troubleshooting bottlenecks, and implementing right-sizing strategies.

Prerequisites

Before diving into this curriculum, learners should have a solid foundation in the following areas:

Cloud Computing Fundamentals: Understanding of virtualization, guest operating systems, and basic cloud economics.
AWS EC2 Basics: Experience deploying, stopping, starting, and connecting to Amazon EC2 instances.
Storage Concepts: Differentiating between block storage (EBS), object storage (S3), and file storage (EFS).
AWS CloudWatch Basics: Familiarity with viewing metrics, creating simple alarms, and navigating the CloudWatch console.
CLI / IAM Setup: Access to an AWS account with necessary IAM permissions to create, modify, and monitor EC2 instances and EBS volumes.

Module Breakdown

This curriculum is divided into four progressive modules, moving from foundational architecture to advanced troubleshooting and cost optimization.

Module	Title	Difficulty	Est. Time
Module 1	EBS Architecture & Volume Profiles	Beginner	1.5 Hours
Module 2	Monitoring EBS with CloudWatch	Intermediate	2.0 Hours
Module 3	Troubleshooting Performance Bottlenecks	Advanced	2.5 Hours
Module 4	Cost & Performance Optimization Strategies	Intermediate	2.0 Hours

▶Click to expand: Module 1 Deep-Dive

Focuses on the fundamentals of block storage, distinguishing between SSD-backed (gp2, gp3, io1, io2) and HDD-backed (st1, sc1) volumes. Learners will explore baseline performance metrics, Input/Output Operations Per Second (IOPS), and throughput ceilings.

▶Click to expand: Module 2 Deep-Dive

Centers on Amazon CloudWatch metrics specifically for EBS. Key topics include understanding VolumeQueueLength, BurstBalance, VolumeReadBytes, and VolumeWriteOps.

Learning Objectives per Module

Module 1: EBS Architecture & Volume Profiles

Categorize the eight different Amazon EBS volume types based on their underlying hardware (SSD vs. HDD) and ideal use cases.
Explain the concept of baseline IOPS versus burstable IOPS.
Calculate expected performance using standard AWS formulas (e.g., $IOPS = \frac{Throughput}{I/O\_Size}$ ).

Module 2: Monitoring EBS with CloudWatch

Interpret core CloudWatch metrics (VolumeReadOps, VolumeWriteOps, VolumeReadBytes, VolumeWriteBytes).
Analyze the VolumeQueueLength metric to distinguish between normal operations and potential latency issues.
Configure baseline performance alerts using CloudWatch Alarms to proactively catch BurstBalance depletion.

Module 3: Troubleshooting Performance Bottlenecks

Identify when an EC2 instance's network bandwidth is throttling EBS performance.
Enable and validate EBS-Optimization on compatible EC2 instance types.
Diagnose initialization latency issues and mitigate them using Fast Snapshot Restore (FSR) or manual block access techniques.

Module 4: Cost & Performance Optimization Strategies

Execute online volume modifications (Elastic Volumes) to upgrade or downgrade volume types without downtime.
Right-size provisioned IOPS based on historical CloudWatch data to prevent over-provisioning.
Design lifecycle policies using tags and AWS Data Lifecycle Manager to clean up orphaned volumes and snapshots.

Success Metrics

How will you know you have mastered this curriculum? Upon completion, learners should be able to consistently demonstrate the following:

Metric Interpretation: Given a CloudWatch graph showing depleted BurstBalance and high VolumeQueueLength, correctly diagnose the bottleneck and propose the optimal volume upgrade.
Cost Reduction: Successfully identify over-provisioned io1/io2 volumes and transition them to gp3 while maintaining required IOPS, calculating the monthly cost savings.
Architectural Alignment: Match specific application workloads (e.g., transactional databases vs. big data log processing) to the correct EBS volume type with 100% accuracy.

EBS Optimization Lifecycle

Loading Diagram...

Figure 1 — Mermaid diagram

Real-World Application

Why does mastering EBS matter in a SysOps or Cloud Engineering career?

Scenario: The "Slow" Production Database

Imagine you are an on-call SysOps Administrator. Users report that the flagship e-commerce application is timing out. You check the EC2 dashboard and see CPU and memory are within normal limits.

By applying the skills from this curriculum, you dive into CloudWatch and observe that the VolumeQueueLength for the database's EBS volume is skyrocketing, and the BurstBalance has hit 0%.

Because you understand EBS performance, you realize the current gp2 volume's burst bucket is depleted. You quickly use the Elastic Volumes feature to modify the volume to gp3, explicitly provisioning higher baseline IOPS. The application recovers seamlessly with no downtime.

Diagnostic Decision Tree

Loading Diagram...

Figure 2 — Mermaid diagram

[!IMPORTANT] Cost Implication: Blindly throwing higher-tier volumes (like io2) at a performance problem is an easy but expensive fix. A skilled CloudOps engineer uses metrics like VolumeReadBytes and VolumeWriteBytes to determine the actual required I/O size, ensuring the company only pays for the performance it genuinely needs.

Resource Links

To supplement this curriculum, learners are encouraged to reference:

AWS Documentation: Amazon EBS Volume Types
AWS Documentation: Amazon CloudWatch Metrics for Amazon EBS
AWS CLI Reference: aws ec2 modify-volume command specification.

Curriculum Overview878 words

Curriculum Overview: Amazon EBS Performance, Troubleshooting, and Optimization

Analyze Amazon Elastic Block Store (Amazon EBS) performance metrics, troubleshoot issues, and optimize volume types to improve performance and reduce cost

Read full article

Curriculum Overview: Amazon EBS Performance, Troubleshooting, and Optimization

Welcome to the comprehensive curriculum for analyzing, troubleshooting, and optimizing Amazon Elastic Block Store (Amazon EBS). This course track aligns with the AWS Certified SysOps Administrator - Associate (SOA-C03) exam objectives (Task 1.3.2) and focuses on ensuring block storage architectures are performant, reliable, and cost-effective.

Prerequisites

Before diving into EBS performance tuning and troubleshooting, learners must have a foundational understanding of the following concepts:

Cloud Computing Basics: Familiarity with the AWS Well-Architected Framework, specifically the Performance Efficiency and Cost Optimization pillars.
Amazon EC2 Fundamentals: Understanding of the EC2 instance lifecycle, how instances attach to storage, and basic network traffic concepts.
Storage Paradigms: Knowledge of raw, unformatted block storage versus file and object storage, and why block storage is preferred for databases and boot volumes.
AWS Management Tools: Basic proficiency navigating the AWS Management Console and utilizing the AWS CLI for querying resources.

Module Breakdown

This curriculum is structured into four progressive modules, transitioning from foundational block storage concepts to advanced troubleshooting and optimization techniques.

Module	Title	Difficulty	Core Focus
Module 1	EBS Architecture & Volume Types	Beginner	Storage classes, IOPS vs. Throughput, Pricing models
Module 2	Monitoring EBS with CloudWatch	Intermediate	Key metrics (`BurstBalance`, `VolumeQueueLength`)
Module 3	Troubleshooting Performance Issues	Advanced	Identifying bottlenecks, network contention, and snapshot latency
Module 4	Cost & Performance Optimization	Advanced	Rightsizing, EBS-Optimized instances, Fast Snapshot Restore

[!NOTE] The modules are designed to be taken sequentially, as the optimization techniques in Module 4 heavily rely on the metric analysis skills developed in Module 2.

Learning Objectives per Module

Module 1: EBS Architecture & Volume Types

Differentiate between the eight different Amazon EBS volume types (e.g., gp2, gp3, io1, io2, st1, sc1).
Identify workload characteristics to determine if an application is transaction-intensive (requires high IOPS) or throughput-intensive (requires high MB/s).
Evaluate the pricing models associated with storage size versus provisioned performance.

Module 2: Monitoring EBS with CloudWatch

Define and track critical EBS CloudWatch metrics, including VolumeReadBytes, VolumeWriteBytes, VolumeReadOps, and VolumeWriteOps.
Analyze VolumeQueueLength to determine the number of pending I/O requests and assess host-to-EBS network link health.
Monitor BurstBalance for gp2, st1, and sc1 volumes to predict and alert on performance throttling.

▶Click to expand: Deeper Dive into Burst Balance

Certain volume types operate on a burst bucket model. They accrue I/O credits when idle and consume them during heavy traffic. If the BurstBalance metric reaches 0%, the volume is throttled to its baseline performance level, causing significant application latency.

Module 3: Troubleshooting Performance Issues

Diagnose I/O bottlenecks by correlating VolumeQueueLength with operating system-level metrics.
Identify the "latency penalty" associated with initializing volumes from EBS Snapshots.
Distinguish between EBS volume limits and EC2 instance-level bandwidth limits.

Module 4: Cost & Performance Optimization

Enable and configure EBS-optimization on supported Amazon EC2 instances to separate storage traffic from standard network traffic.
Implement Fast Snapshot Restore (FSR) to bypass initialization latency for critical recovery operations.
Rightsize volume I/O and capacity based on historical CloudWatch data to eliminate over-provisioning.

Visual Anchors

Workload to Volume Type Decision Matrix

Understanding how to map workload requirements to the correct volume type is a critical SysOps skill. Use this decision tree to optimize both performance and cost.

Loading Diagram...

Figure 1 — Mermaid diagram

Burst Balance Depletion Over Time

This diagram illustrates how an intensive workload depletes the burst credit balance of a gp2 volume over time, eventually leading to performance throttling.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Figure 2 — TikZ diagram

Success Metrics

How do you know you have mastered this curriculum? You will be able to successfully:

Metric Interpretation: Look at a CloudWatch dashboard showing high VolumeQueueLength and low BurstBalance and immediately diagnose an under-provisioned gp2 volume.
Cost Reduction: Audit an AWS account using Cost Explorer and identify oversized provisioned IOPS (io1/io2) volumes that can be safely downgraded to gp3 based on historical usage metrics.
Architectural Optimization: Successfully provision an EC2 instance with EBS-optimization enabled, ensuring that standard network traffic does not contend with storage I/O.
Disaster Recovery SLA Compliance: Implement Fast Snapshot Restore to ensure an initialized volume is ready for production immediately, meeting aggressive RTO (Recovery Time Objective) targets.

Real-World Application

Why does this matter in the field?

Imagine you are the SysOps Administrator for a high-traffic e-commerce platform during a flash sale. Your backend relational database is running on an EC2 instance backed by a standard gp2 EBS volume. As thousands of users simultaneously add items to their carts, the database performs heavy, random read/write operations.

Without an understanding of EBS performance:

The gp2 burst bucket entirely depletes.
The BurstBalance drops to zero, and the volume throttles to its baseline IOPS.
The VolumeQueueLength spikes as I/O requests back up.
Users experience extreme latency, shopping carts fail to load, and the company loses significant revenue.

By applying the skills in this curriculum, you would proactively monitor these metrics via CloudWatch alarms. You would recognize the bottleneck and seamlessly modify the volume type to gp3 or io2 (Provisioned IOPS), adjust the EC2 instance type to one that supports a higher EBS-optimized throughput, and ensure your system handles the flash sale smoothly.

Study Guide985 words

Mastering EBS and S3 Performance Metrics: AWS CloudOps Study Guide

Analyze EBS and S3 performance metrics

Read full article

Mastering EBS and S3 Performance Metrics

This guide covers the critical metrics and optimization strategies for Amazon Elastic Block Store (EBS) and Amazon Simple Storage Service (S3), specifically aligned with the AWS Certified CloudOps Engineer - Associate (SOA-C03) exam objectives.

Learning Objectives

After studying this chapter, you should be able to:

Analyze critical EBS performance metrics like VolumeQueueLength and BurstBalance to identify bottlenecks.
Remediate performance issues by optimizing volume types and enabling features like Fast Snapshot Restore.
Optimize S3 performance using Multi-part uploads and S3 Transfer Acceleration.
Automate remediation strategies using CloudWatch alarms and SSM Automation runbooks.

Key Terms & Glossary

IOPS (Input/Output Operations Per Second): A measure of the number of read and write operations performed per second. Essential for transaction-heavy workloads like databases.
Throughput: The amount of data transferred to or from a volume per second, usually measured in MB/s. Essential for streaming or large data processing.
Burst Balance: A metric for gp2, st1, and sc1 volumes representing the amount of "burst" credits remaining to exceed baseline performance.
Queue Length: The number of pending I/O requests for a device. High queue length often indicates a bottleneck.
S3 Transfer Acceleration: A bucket-level feature that enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket using Amazon CloudFront’s globally distributed Edge Locations.

The "Big Idea"

In a cloud environment, storage performance is not just about choosing the right "disk." It is a dynamic balance between latency, throughput, and cost. Effective CloudOps involves shifting from reactive troubleshooting to proactive monitoring. By mastering CloudWatch metrics, you can identify when an application is exceeding its IOPS allotment and automatically scale or switch volume types before the user experience degrades.

Formula / Concept Box

Concept	Metric / Formula	Key Interpretation
Throughput Formula	$Throughput = IOPS \times I/O\ Size$	Larger I/O sizes require more throughput for the same number of IOPS.
EBS Health	`VolumeQueueLength`	Low for transaction-intensive; High for throughput-intensive (HDD).
Burst Health	`BurstBalance`	If it reaches 0%, the volume is throttled to its baseline performance.
S3 Efficiency	Multi-part Upload	Recommended for objects > 100 MB; Required for objects > 5 GB.

Hierarchical Outline

Amazon EBS Performance Analysis
- Critical CloudWatch Metrics
  - VolumeReadOps / VolumeWriteOps: Used to calculate total IOPS.
  - VolumeQueueLength: Identifying bottlenecks in the OS or network link.
  - BurstBalance: Monitoring credit depletion for burstable volumes.
- Performance Optimization
  - EBS-Optimized Instances: Ensuring dedicated bandwidth for storage traffic.
  - Fast Snapshot Restore (FSR): Eliminating the latency penalty of first-touch reads on new volumes.
  - Volume Type Switching: Moving from gp2 to gp3 or io2 for predictable performance.
Amazon S3 Performance Optimization
- Transfer Optimization
  - Multi-part Upload: Parallelizing uploads for higher throughput and reliability.
  - S3 Transfer Acceleration: Using Edge Locations to reduce latency over long distances.
- Storage Management
  - S3 Lifecycle Policies: Automating transitions to lower-cost tiers based on access patterns.
  - DataSync: Simplifying large-scale data transfers into S3.

Visual Anchors

EBS Performance Troubleshooting Flow

Loading Diagram...

Figure 1 — Mermaid diagram

Data Transfer Comparison

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Figure 2 — TikZ diagram

Definition-Example Pairs

Metric: VolumeQueueLength
- Definition: The number of I/O requests waiting to be processed by the storage device.
- Real-World Example: In a busy grocery store, the "Queue Length" is the number of people waiting in line. If the cashier (EBS volume) is too slow, the line grows. For a database, a long line means the application has to wait to save data, causing lag.
Feature: S3 Lifecycle Policies
- Definition: A set of rules that define actions that Amazon S3 applies to a group of objects (e.g., transition to Glacier or expiration).
- Real-World Example: An office that keeps physical files in a desk for 30 days (S3 Standard), moves them to a filing cabinet for 90 days (S3 Standard-IA), and eventually sends them to an off-site warehouse for 7 years (Glacier) before shredding them (Expiration).

Worked Examples

Example 1: Troubleshooting Throttled EBS

Scenario: A developer reports that a database on an Amazon EC2 instance is experiencing high latency every afternoon.

Metric Analysis: You check CloudWatch and see BurstBalance for the gp2 volume dropping to 0% at 2:00 PM and staying there until 4:00 PM.
Diagnosis: The workload is exceeding the baseline IOPS provided by the current volume size, depleting the burst bucket.
Remediation:
- Short term: Increase the size of the gp2 volume (which increases baseline IOPS).
- Long term: Migrate to a gp3 volume to provision higher IOPS independently of storage size, ensuring more cost-effective performance.

Example 2: Optimizing Large File Uploads to S3

Scenario: You need to upload a 50 GB database backup file to an S3 bucket from an on-premises server in London to a bucket in Tokyo.

Action 1: Enable S3 Transfer Acceleration on the bucket to utilize the AWS global network.
Action 2: Use the AWS CLI or SDK to perform a Multi-part Upload.
Benefit: If a network interruption occurs, only the failed part (e.g., 100 MB) needs to be re-uploaded instead of the entire 50 GB file.

Checkpoint Questions

Which CloudWatch metric is the most direct indicator that an EBS volume is acting as a bottleneck due to pending I/O requests?
For a throughput-intensive application using HDD volumes (st1), is a high VolumeQueueLength always considered a failure state? Why or why not?
What is the minimum object size for which AWS recommends using Multi-part uploads for S3?
How does enabling "Fast Snapshot Restore" affect the performance of a newly created EBS volume?
Which AWS service can be used to automate the modification of an EBS volume type when a CloudWatch alarm is triggered?

▶Click to see Answers

VolumeQueueLength.
No. HDD volumes are less sensitive to latency and can actually benefit from higher queue lengths for large, sequential I/O.
100 MB (though it is mandatory for files 5 GB or larger).
It eliminates the latency penalty (initialization/pre-warming) by ensuring the volume is fully initialized at creation.
AWS Systems Manager (SSM) Automation combined with Amazon EventBridge.

Curriculum Overview703 words

Curriculum Overview: Analyzing Events with the AWS Personal Health Dashboard

Analyze events using the AWS Personal Health Dashboard

Read full article

This curriculum overview details the learning path for mastering the AWS Personal Health Dashboard and the AWS Health API. By the end of this curriculum, learners will be able to monitor, analyze, and automate responses to service-level interruptions and planned changes within an AWS environment.

Prerequisites

Before beginning this module, learners should have a foundational understanding of the following AWS concepts and services:

AWS Management Console & CLI: Basic navigation and programmatic access.
Core AWS Services: General familiarity with Amazon EC2, Amazon S3, and AWS Lambda.
Amazon EventBridge (formerly CloudWatch Events): Understanding of event-driven architectures and rule routing.
AWS Organizations (Optional but recommended): Knowledge of multi-account management strategies.
Foundational Cloud Monitoring: Familiarity with the concepts of uptime, availability, and incident response.

[!NOTE] A mathematical understanding of availability goals is helpful. AWS pursues a 99.9% uptime for most services. You can calculate availability using the following block equation:
$\text{Availability (\%)} = \left( \frac{\text{Total Time} - \text{Downtime}}{\text{Total Time}} \right) \times 100$

Module Breakdown

The curriculum is structured into four progressive modules, transitioning from fundamental visibility to advanced automated remediation.

Module	Title	Difficulty	Core Focus
Module 1	AWS Health Fundamentals	Beginner	Public vs. Personal Health Dashboards, UI navigation, and time zone configuration.
Module 2	Multi-Account Visibility	Intermediate	AWS Organizations integration, centralized event aggregation.
Module 3	Automated Event Remediation	Advanced	Amazon EventBridge integration, AWS Lambda triggers, and SNS notifications.
Module 4	Enterprise Integrations & AHA	Expert	AWS Health API, AWS Health Aware (AHA) framework, Slack/Teams routing.

Architectural Context

The following diagram illustrates how AWS Health events flow from the core infrastructure to the end-user or automated remediation systems.

Loading Diagram...

Figure 1 — Mermaid diagram

Learning Objectives per Module

Module 1: AWS Health Fundamentals

Differentiate between the public AWS Health Dashboard (all global service events) and the AWS Personal Health Dashboard (personalized to your active resources).
Configure the Personal Health Dashboard settings, including local time zone preferences and console notifications.
Analyze alerts for event information, affected resources, and AWS-recommended troubleshooting guidance.

Module 2: Multi-Account Visibility

Enable AWS Health organizational view using AWS Organizations.
Aggregate health events from multiple member accounts into a single, centralized management account dashboard.
Identify cross-account impact during a localized AWS service degradation.

Module 3: Automated Event Remediation

Define Amazon EventBridge rules specifically targeting AWS Health events (e.g., aws.health).
Deploy AWS Lambda functions to execute automated runbooks (e.g., auto-restarting a degraded EC2 instance).
Route specific event categories (scheduled changes vs. critical notifications) to targeted operational teams.

Module 4: Enterprise Integrations & AHA

Evaluate support plan requirements (Business or Enterprise Support) to unlock direct AWS Health API access.
Deploy the AWS Health Aware (AHA) solution using AWS CloudFormation.
Integrate AHA with external operational channels such as Slack, Microsoft Teams, Splunk, or DataDog.

Success Metrics

How will you know you have mastered this curriculum? You should be able to consistently demonstrate the following capabilities:

Dashboard Configuration: Successfully locate the Personal Health Dashboard and filter events by region, service, and status.
Event Routing: Create a working EventBridge rule that captures an AWS Health scheduled maintenance event and successfully routes it to an SNS topic.
Automation Readiness: Write a basic Lambda script that parses the JSON payload of an AWS Health event and outputs the affected Resource IDs to CloudWatch Logs.
AHA Deployment (Stretch): Successfully launch the AWS Health Aware CloudFormation stack and receive a test notification in a third-party chat application.

[!IMPORTANT] The AWS Health API is only available to customers on an AWS Business Support or AWS Enterprise Support plan. Ensure your lab environment has the appropriate tier, or rely on EventBridge routing for standard accounts.

Real-World Application

In a real-world CloudOps or SysOps role, relying solely on reactive customer complaints to discover infrastructure degradation is a critical failure.

The Scenario: AWS detects degraded underlying hardware hosting one of your mission-critical Amazon EC2 instances.

The Application: Instead of waiting for the instance to fail completely, AWS publishes a scheduled maintenance event to your AWS Personal Health Dashboard. Because you implemented the lessons from this curriculum:

The event is immediately caught by an Amazon EventBridge rule.
The rule triggers an AWS Systems Manager (SSM) Automation runbook.
The runbook automatically safely stops and restarts the instance onto healthy underlying hardware during an off-peak maintenance window.
The AWS Health Aware (AHA) solution posts a summary of the incident and the automated resolution directly into your CloudOps Slack channel.

AWS Health Aware (AHA) Routing Flow

Loading Diagram...

Figure 2 — Mermaid diagram

By leveraging these tools, you transform unpredictable cloud infrastructure events into fully automated, actionable, and trackable workflows.

Study Guide820 words

Analyzing Security Findings: Amazon Inspector and AWS Security Hub

Analyze findings from Security Hub and Inspector

Read full article

Analyzing Security Findings: Amazon Inspector and AWS Security Hub

This guide focuses on the centralized management and analysis of security alerts within an AWS environment, specifically through the lens of Amazon Inspector and AWS Security Hub.

Learning Objectives

After studying this guide, you should be able to:

Analyze and group findings within the Amazon Inspector console.
Configure suppression rules to filter out noise in vulnerability reports.
Explain the benefits of centralizing security findings in AWS Security Hub.
Identify the prerequisites for exporting Inspector findings to Amazon S3.
Implement automated remediation workflows using EventBridge and Security Hub.

Key Terms & Glossary

Finding: A detailed report of a potential security issue or vulnerability identified by an AWS security service.
CVE (Common Vulnerabilities and Exposures): A list of publicly disclosed computer security flaws. Inspector maps findings to these IDs.
Suppression Rule: A set of criteria in Inspector used to automatically hide findings that are known risks or non-issues.
Insight: A collection of related findings in Security Hub that helps identify specific areas of risk (e.g., "S3 buckets with public read access").
Security Standard: A set of controls (e.g., CIS AWS Foundations) that Security Hub uses to measure your compliance.

The "Big Idea"

In a modern cloud environment, "security fatigue" is a real threat—admins are often overwhelmed by thousands of disconnected alerts. The Big Idea here is Centralized Visibility. Instead of checking EC2, ECR, and IAM separately, AWS uses Security Hub as a "single pane of glass" to aggregate findings from Inspector (vulnerabilities), GuardDuty (threats), and Macie (data privacy), allowing for prioritized, automated remediation.

Formula / Concept Box

Feature	Amazon Inspector	AWS Security Hub
Primary Goal	Vulnerability & Reachability Scanning	Centralized Security Posture Management
Scan Targets	EC2 instances, ECR images, Lambda	Integrated AWS Services & 3rd Party Tools
Logic	Scans software packages and network paths	Aggregates findings and checks against Standards
Output	Findings (CSV/JSON/S3/Security Hub)	Insights, Compliance Scores, Actions

Hierarchical Outline

Amazon Inspector: Vulnerability Management
- Finding Types: Software package vulnerabilities and network reachability.
- Analysis Techniques: Grouping by account, instance, or finding state.
- Suppression Rules: Using filters to exclude specific findings from the view.
- Exporting Data:
  - EventBridge: For real-time notifications/remediation.
  - S3 Buckets: For long-term archival (requires AWS KMS encryption).
AWS Security Hub: The Aggregator
- Centralization: Collects findings from Inspector, GuardDuty, and Config.
- Compliance: Automated checks against standards (PCI DSS, CIS, AWS Best Practices).
- Dashboards: Visualizing security trends and high-priority issues.
- Automation: Triggering Lambda or SSM via EventBridge custom actions.

Visual Anchors

Finding Lifecycle Flow

Loading Diagram...

Figure 1 — Mermaid diagram

Logical Architecture of Analysis

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Figure 2 — TikZ diagram

Definition-Example Pairs

Suppression Rule
- Definition: A filter that hides specific findings based on criteria like AMI ID or Severity.
- Example: Suppressing all "Medium" severity findings on a legacy development server that is scheduled for decommissioning next week.
Finding Grouping
- Definition: Organizing findings by shared attributes to identify patterns.
- Example: Grouping by "Vulnerability ID" to see if one specific outdated library is present across 50 different EC2 instances.
Security Standard
- Definition: A prepackaged set of security best practices used for automated auditing.
- Example: Using the CIS AWS Foundations Benchmark to automatically detect if the 'root' user has an active access key.

Worked Examples

Example 1: Exporting Inspector Findings to S3

Scenario: You need to archive all Inspector findings for compliance auditing over the next 7 years.

Create an S3 Bucket: Ensure the bucket exists in the target region.
Configure KMS: Create a symmetric KMS key. Inspector requires a customer-managed key to encrypt findings during the export process.
Set Permissions: Update the S3 bucket policy and KMS key policy to allow the Inspector service principal (inspector2.amazonaws.com) to perform s3:PutObject and kms:GenerateDataKey.
Generate Report: In the Inspector console, select "Reports," choose S3 as the destination, and provide the KMS Key ARN.

Example 2: Auto-Remediation Workflow

Scenario: Automatically stop an EC2 instance if Security Hub reports it has a "Critical" vulnerability.

Finding Source: Inspector detects a critical CVE on instance i-12345 and sends it to Security Hub.
Security Hub Insight: Security Hub flags the finding as critical.
EventBridge Rule: Create a rule with an event pattern matching Source: aws.securityhub and Severity.Label: CRITICAL.
Target: Set the target to an SSM Automation document AWS-StopEC2Instance.

Checkpoint Questions

Which service provides prepackaged security standards like PCI DSS and CIS?
What is mandatory for Amazon Inspector to export findings to an Amazon S3 bucket?
How do you exclude known low-risk vulnerabilities from the Inspector console view without deleting them?
To which AWS service does Inspector automatically export findings for real-time remediation triggers?

▶Click for Answers

AWS Security Hub.
A Customer Managed KMS Key for encryption.
Create a Suppression Rule.
Amazon EventBridge.

Study Guide1,050 words

SOA-C03 Study Guide: Performance Analysis & Automated Remediation

Analyze performance metrics and automate remediation strategies by using AWS services and functionality (for example, CloudWatch, AWS User Notifications, AWS Lambda, AWS Systems Manager, CloudTrail, auto scaling)

Read full article

Performance Analysis & Automated Remediation

This guide focuses on Content Domain 1 of the AWS Certified SysOps Administrator - Associate (SOA-C03) exam, specifically targeting the ability to analyze metrics and implement self-healing architectures.

Learning Objectives

After studying this guide, you should be able to:

Analyze CloudWatch metrics to identify performance bottlenecks and system failures.
Configure EventBridge rules to route operational events to remediation targets.
Implement AWS Systems Manager (SSM) Automation runbooks for common issues.
Automate EC2 instance recovery and scaling based on health and performance triggers.
Utilize AWS Health events to proactively respond to service-level interruptions.

Key Terms & Glossary

CloudWatch Alarm: A mechanism that watches a single metric over a specified time period and performs one or more actions based on the value of the metric relative to a threshold.
EventBridge (formerly CloudWatch Events): A serverless event bus that makes it easy to connect applications using data from your own applications, integrated SaaS applications, and AWS services.
SSM Automation Runbook: A document that defines the actions that Systems Manager performs on your managed instances and other AWS resources.
Metric Filter: A way to extract metric data from log groups in CloudWatch Logs.
Target: The resource or endpoint that EventBridge sends an event to when a rule's pattern is matched (e.g., Lambda, SSM, SNS).

The "Big Idea"

[!IMPORTANT] The core philosophy of modern SysOps is "Detection to Remediation without Intervention."

Instead of a human responder manually fixing a disk space issue or restarting a service, we build a closed-loop system:

Detect (CloudWatch Metrics/Logs)
Evaluate (CloudWatch Alarms)
Act (EventBridge -> Lambda/SSM)
Verify (Status Checks/Metrics return to normal).

Formula / Concept Box

Component	Role	Example
Producer	Generates the event/metric	Amazon EC2, CloudTrail, AWS Health
Evaluator	Decides if action is needed	CloudWatch Alarm (Static or Anomaly Detection)
Router	Connects the signal to the fix	Amazon EventBridge Rules
Remediator	Executes the corrective logic	AWS Lambda, SSM Automation, Auto Scaling

Hierarchical Outline

Monitoring & Data Collection
- Standard Metrics: CPU, Network, Disk I/O (available by default).
- Custom Metrics: Memory utilization, Disk Swap (requires CloudWatch Agent).
- Log Processing: Using Metric Filters to turn log patterns into searchable data.
Event-Driven Response
- EventBridge: Matching patterns (e.g., EC2 State Change) and routing to targets.
- AWS Health API: Responding to scheduled maintenance or regional service outages.
Remediation Tools
- SSM Automation: Predefined runbooks for patching, restarting, and resource optimization.
- AWS Lambda: Custom Python/Node scripts for complex logic (e.g., updating Route 53 during a failover).
- Auto Scaling: Dynamic, Scheduled, and Predictive scaling based on historical patterns.

Visual Anchors

Automated Remediation Flow

Loading Diagram...

Figure 1 — Mermaid diagram

Performance Optimization Cycle

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Figure 2 — TikZ diagram

Definition-Example Pairs

Anomaly Detection: A CloudWatch feature that applies machine learning to a metric's history to create a baseline of expected behavior.
- Example: Identifying a sudden drop in application requests that occurs at 2:00 PM on a Tuesday, which usually sees high traffic.
Predictive Scaling: An Auto Scaling policy that uses machine learning to predict future traffic and schedule capacity changes in advance.
- Example: An e-commerce site scaling up EC2 instances on Friday morning in anticipation of a weekend sale based on the last 3 months of data.
EC2 Status Check Remediation: Automatically recovering an instance if the underlying hardware fails.
- Example: Using a CloudWatch Alarm on StatusCheckFailed_System to trigger the Recover action, which moves the instance to new hardware while keeping the same IP and ID.

Worked Examples

Scenario: Remediating Low Disk Space on EC2

Problem: An application server stops responding because the root EBS volume is 100% full.

Step-by-Step Solution:

Metric Collection: Install the CloudWatch Agent on the EC2 instance to collect disk_used_percent (this is not a standard metric).
Alarm Creation: Create a CloudWatch Alarm that triggers when disk_used_percent > 80% for 5 minutes.
EventBridge Rule: Create an EventBridge rule that triggers when the Alarm enters the ALARM state.
Target Selection: Set the target to an SSM Automation Runbook (e.g., AWS-ExpandVolumes or a custom script to clear /tmp files).
Verification: The alarm should return to OK once the cleanup/expansion is complete.

Scenario: Lambda Performance Tuning

Problem: A Lambda function is frequently throttling or timing out.

Step-by-Step Solution:

Analyze Metrics: Check Throttles, Duration, and Errors in CloudWatch.
Optimization: Use AWS Compute Optimizer to analyze the function's memory allocation.
Action: If Compute Optimizer suggests the function is memory-constrained, increase the memory setting (which also proportionally increases CPU power).

Checkpoint Questions

Which metric requires the CloudWatch Agent to be installed on an EC2 instance? (Answer: Memory utilization or Disk space usage).
What is the difference between an EventBridge Rule and a CloudWatch Alarm? (Answer: An Alarm monitors a specific threshold over time; a Rule matches a state change or event pattern instantaneously).
How can you automate the recovery of an EC2 instance that failed a system status check? (Answer: Create a CloudWatch Alarm for the StatusCheckFailed_System metric and add an 'EC2 Action' to 'Recover').
True or False: Predictive scaling is best for workloads that have random, unpredictable traffic spikes. (Answer: False. It requires historical patterns to work effectively).
What service allows you to integrate AWS Health events with Slack or Microsoft Teams? (Answer: Amazon EventBridge or the AWS Health Aware (AHA) solution).

Study Guide890 words

Study Guide: Analyzing Spend Patterns with AWS Cost Explorer

Analyze spend patterns using AWS Cost Explorer

Read full article

Study Guide: Analyzing Spend Patterns with AWS Cost Explorer

This guide covers the fundamental and advanced capabilities of AWS Cost Explorer for the SysOps Administrator - Associate (SOA-C03) exam, focusing on visualization, forecasting, and cost management strategies.

Learning Objectives

By the end of this study guide, you will be able to:

Enable and Configure AWS Cost Explorer within the Billing and Cost Management console.
Identify Dimensions used for filtering cost data, including Regions, Services, and Cost Allocation Tags.
Differentiate between the visualization capabilities of Cost Explorer and the raw data of Cost and Usage Reports (CUR).
Forecast Future Spend and evaluate Reserved Instance (RI) utilization/coverage.
Manage Permissions and understand the impact of AWS Organizations on historical data visibility.

Key Terms & Glossary

AWS Cost Explorer (CE): A tool that enables you to visualize, understand, and manage your AWS costs and usage over time.
Cost and Usage Report (CUR): The most granular AWS billing dataset, which can be integrated into Redshift or QuickSight.
Cost Allocation Tags: Metadata (key-value pairs) applied to resources used to categorize and track costs at a granular level (e.g., Project: SecretAlpha).
Forecast: A prediction of future costs based on historical usage patterns for the next 12 months.
Paginated API Request: A programmatic call to retrieve data from Cost Explorer, which incurs a specific per-request fee.

The "Big Idea"

While raw billing data (CUR) is useful for deep data science and SQL-based auditing, AWS Cost Explorer is the primary engine for human-driven visualization. It transforms complex billing rows into actionable insights, allowing administrators to answer "Why did our EC2 spend spike last Tuesday?" or "Will we stay under budget for the next quarter?"

Formula / Concept Box

Feature	Limit / Metric	Cost / Note
Custom Reports	Up to 50 reports	Included in free UI access
Historical Data	12 months	Takes a few days to populate initially
Forecasting Range	12 months ahead	Based on previous usage trends
Data Refresh	Every 24 hours	Current month data has ~24h latency
API Access	$0.01 per request	Charge applies to paginated API calls
IAM Access	No default access	Must be explicitly granted via policy

Hierarchical Outline

I. Enabling Cost Explorer
- Enabled via Billing and Cost Management console.
- Org Impact: Management accounts can block member account access.
- Historical Access: Joining an Org hides pre-org data; leaving hides membership-era data.
II. Data Visualization & Analysis
- Preconfigured Views: Top five cost-accruing services, daily/monthly costs.
- Filtering Dimensions: Account, Service, Region, Instance Type, Tag.
- RI/Savings Plans: Monitoring utilization (how much you use) and coverage (how much is covered by the plan).
III. Reporting Capabilities
- Custom Reports: Create up to 50 tailored views (e.g., CFO-specific reports).
- Exporting: Data is the same as CUR but formatted for visual consumption.
IV. Security and Governance
- IAM Policies: Essential for access control (no default access for users).
- Cost Allocation Tags: Managed via Resource Groups & Tag Editor.

Visual Anchors

Data Flow and Access

Loading Diagram...

Figure 1 — Mermaid diagram

Historical Data Retention Logic

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Figure 2 — TikZ diagram

Definition-Example Pairs

Dimension Filtering: Selecting specific criteria to isolate costs.
- Example: A SysOps admin filters by Region: us-east-1 and Service: Amazon RDS to investigate a specific database bill increase in North Virginia.
RI Utilization Report: A report showing how much of your purchased Reserved Instance discount is actually being used.
- Example: If you purchased 10 RI instances but only run 7, the report shows 70% utilization, signaling a need to increase usage or avoid future purchases.
Forecasted Spend: A prediction based on current trends.
- Example: Predicting that the "Project Secret" will cost $5,000 next month because usage has increased 10% weekly for the last 3 months.

Worked Examples

Scenario 1: The CFO's Special Report

Request: The CFO needs a report every month showing the utilization of EC2 instances tagged with Department: Finance across three specific member accounts. Step-by-Step Solution:

Enable Tags: Ensure Department is activated as a Cost Allocation Tag in the Billing console.
Filter: Open Cost Explorer and set the filter for Service (EC2), Tag (Department: Finance), and Linked Account (select the 3 accounts).
Save: Click "Save as" to create one of the 50 custom reports.
Visualize: Set the time range to "Monthly" and the chart type to "Bar" to show trends.

Scenario 2: Programmatic Cost Auditing

Problem: A developer writes a script that calls the Cost Explorer API every minute to update a custom dashboard. Result: The account incurs unexpected charges. Explanation: Each paginated API request costs $0.01. 60 requests/hour * 24 hours = $14.40/day. The solution is to cache results or use the free UI for non-automated needs.

Checkpoint Questions

How many months of historical data can be viewed in AWS Cost Explorer?
What is the cost associated with using the Cost Explorer User Interface (UI)?
If an IAM user has AdministratorAccess, do they automatically have access to Cost Explorer?
What happens to a standalone account's historical data when it joins an AWS Organization?
How many custom reports can be created in a single account?

▶Click to see answers

12 months (and it can forecast 12 months forward).
Free (Only API access incurs a cost).
No. Access to Cost Explorer must be explicitly granted via IAM policies; there is no default access for users.
The account no longer has access to cost and usage data from the time prior to joining the organization (though it regains it if it leaves).
50 custom reports.

Curriculum Overview863 words

AWS Well-Architected Principles & CloudOps Engineering Curriculum Overview

Apply Well-Architected principles to support AWS workloads

Read full article

AWS Well-Architected Principles & CloudOps Engineering

[!NOTE] Course Overview: A comprehensive curriculum focused on deploying, managing, and operating scalable, highly available, and fault-tolerant systems on AWS, directly aligned with the AWS Certified CloudOps Engineer - Associate (SOA-C03) exam domains.

Prerequisites

To be successful in this curriculum, learners must possess foundational knowledge in general IT operations and cloud computing principles before beginning.

General IT Experience

Operations Role: At least 1 year of experience in a systems administrator or related IT operations role.
Networking Basics: Understanding of core networking concepts including DNS, TCP/IP, and firewalls.
Scripting & OS: Familiarity with at least one scripting language (e.g., Python, Bash) and major operating systems (Linux/Windows).
Modern Workflows: Basic understanding of containerization (Docker), orchestration, and CI/CD pipelines (Git).

AWS Knowledge

Core Services: Hands-on familiarity with AWS storage (S3, EBS), compute (EC2), and networking services (VPC).
AWS Interfaces: Prior experience navigating the AWS Management Console and executing basic commands via the AWS CLI.

Module Breakdown

This curriculum is designed to progressively build your operational capabilities, culminating in advanced automation and remediation skills.

Module	Title	Difficulty	Core Well-Architected Pillar Focus
1	AWS Operational Foundations	Beginner	Operational Excellence
2	Monitoring, Logging & Observability	Intermediate	Performance Efficiency
3	Performance & Cost Optimization	Intermediate	Cost Optimization
4	Reliability & Business Continuity	Advanced	Reliability
5	Security & Compliance	Advanced	Security
6	Deployment & Automation	Advanced	Operational Excellence

Curriculum Progression Flow

Loading Diagram...

Figure 1 — Mermaid diagram

Learning Objectives per Module

Module 1: AWS Operational Foundations

Understand the Well-Architected Framework: Describe the six pillars (Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability).
Master the CLI: Execute commands and analyze outputs using JMESPath query syntax to extract targeted JSON data.

Module 2: Monitoring, Logging, and Observability

Implement CloudWatch: Configure static and dynamic alarms for anomalous behavior.
Centralize Auditing: Enable AWS CloudTrail and integrate it with CloudWatch Logs Insights for real-time querying.
Extend Observability: Deploy the CloudWatch Agent on EC2 and ECS to capture deep system-level metrics.

Module 3: Performance and Cost Optimization

Rightsize Compute: Utilize AWS Compute Optimizer to interpret performance metrics and adjust instance families.
Optimize Storage: Analyze EBS IOPS and switch volume types to maximize efficiency while reducing monthly spend.
Implement FinOps: Configure AWS Budgets and Cost Anomaly Detection to proactively manage cloud expenditures.

Module 4: Reliability and Business Continuity

Architect High Availability: Implement Multi-AZ deployments for RDS and configure Route 53 DNS-level failover.
Design Disaster Recovery: Compare strategies (Pilot Light vs. Warm Standby) and evaluate RPO/RTO metrics.
Automate Backups: Utilize AWS Backup to create centralized retention vaults for EC2, RDS, and EFS.

Module 5: Security and Compliance

Enforce Least Privilege: Implement granular IAM identity-based and resource-based policies.
Protect Data: Manage encryption keys using AWS KMS and rotate sensitive database credentials via Secrets Manager.
Audit Compliance: Deploy AWS Config to monitor state changes and identify High-Risk Issues (HRIs) automatically.

Module 6: Deployment, Provisioning, and Automation

Adopt Infrastructure as Code (IaC): Manage complex resources using AWS CloudFormation and remediate stack drift.
Automate Remediation: Connect EventBridge to AWS Systems Manager (SSM) Automation runbooks to self-heal infrastructure.

▶Click to view an automated remediation workflow

Loading Diagram...

Figure 2 — Mermaid diagram

Success Metrics

How will you know you have mastered the curriculum? Mastery is evaluated through both objective exam readiness and practical engineering benchmarks.

Practical Validation

Zero High-Risk Issues: The ability to review an AWS account via Trusted Advisor and clear all Security and Reliability High-Risk Issues (HRIs).
Automated MTTR Reduction: Successfully configuring self-healing runbooks that reduce your Mean Time To Recovery.

$\text{Availability} = \frac{\text{Uptime}}{\text{Uptime} + \text{Downtime}}$

[!TIP] A successful cloud operator aims for "Five Nines" (99.999%) availability. This requires mastering the automated remediation techniques taught in Module 6 so downtime approaches zero.

Assessment Metrics

SOA-C03 Exam Readiness: Consistently scoring 80%+ on practice exams mirroring the official AWS Certified CloudOps Engineer - Associate format.
Troubleshooting Speed: Diagnosing complex VPC connectivity or IAM permission denial issues within 15 minutes using the IAM Policy Simulator and VPC Reachability Analyzer.

Real-World Application

Why does mastering the Well-Architected Framework and CloudOps matter in a professional career?

Terminology in Practice

Infrastructure as Code (IaC)
- Definition: Managing and provisioning computing infrastructure through machine-readable definition files rather than physical hardware configuration or interactive configuration tools.
- Real-World Example: Instead of manually clicking through the AWS Console to build an environment, a CloudOps engineer writes a CloudFormation YAML template that consistently deploys an Auto Scaling Group, ensuring environments are reproducible and version-controlled.
Disaster Recovery (Warm Standby)
- Definition: A DR strategy where a scaled-down version of a fully functional environment is always running in the cloud.
- Real-World Example: An e-commerce business experiences a catastrophic regional outage during Black Friday. Because they implemented a Warm Standby in a secondary AWS Region, Route 53 instantly routes customer traffic to the backup region, saving millions of dollars in potential lost revenue.

The Operational Mindset

In modern enterprise environments, manual intervention is a bottleneck. By applying these curriculum principles, you transition from a reactive administrator to a proactive CloudOps Engineer. You will save organizations money through automated Spot Instance utilization, protect user data via KMS encryption enforcement, and allow developer teams to deploy faster and safer.

More Study Notes (138)

Auditing AWS Network Protection Services

Audit AWS network protection services (for example, Amazon Route 53 Resolver DNS Firewall, AWS WAF, AWS Shield, AWS Network Firewall) in a single account

820 words

AWS Auditing and Compliance Management: Study Guide

Auditing and Compliance Management

920 words

Mastering Automation: EC2 Image Builder Study Guide

Automate AMI creation using EC2 Image Builder

924 words

Automating AWS Backups and Snapshots Study Guide

Automate snapshots and backups for AWS resources (for example, Amazon EC2 instances, RDS DB instances, Amazon Elastic Block Store [Amazon EBS] volumes, Amazon S3 buckets, DynamoDB tables) by using AWS services (for example, AWS Backup)

875 words

Mastering Automation of Existing AWS Resources (SOA-C03)

Automate the management of existing resources

845 words

Hands-On Lab: Implementing Monitoring, Alarms, and Remediation on AWS

AWS Certified CloudOps Engineer - Associate (SOA-C03) > Unit 1: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

1,083 words

AWS Health and Incident Management Study Guide

AWS Health and Incident Management

890 words

AWS Management and Governance Tools: A Comprehensive Study Guide

AWS Management Tools

850 words

AWS Systems Manager (SSM) Operations: Comprehensive Study Guide

AWS Systems Manager (SSM) Operations

945 words

Curriculum Overview: Backup, Restore, and Disaster Recovery

Backup, Restore, and Disaster Recovery

863 words

Centralized Logging and Analysis: AWS Curriculum Overview

Centralized Logging and Analysis

878 words

Curriculum Overview: Centralized Logging and Analysis on AWS

Centralized Logging and Analysis

816 words

Cloud Financial Management & Cost Optimization

Cloud Financial Management

820 words

Curriculum Overview: Troubleshooting with AWS Networking Logs

Collect and interpret networking logs to troubleshoot issues (for example, VPC flow logs, Elastic Load Balancing [ELB] access logs, AWS WAF web ACL logs, CloudFront logs, container logs)

822 words

Mastering Automated Remediation with Amazon EventBridge

Configure Amazon EventBridge rules to trigger remediation

860 words

Amazon CloudWatch: Network Monitoring & Analysis

Configure and analyze Amazon CloudWatch network monitoring services

840 words

EC2 Auto Scaling & Scaling Policies: Curriculum Overview

Configure and manage EC2 Auto Scaling groups and scaling policies

878 words

Curriculum Overview: Configure and Manage Scaling in AWS Managed Databases

Configure and manage scaling in AWS managed databases (for example, Amazon RDS, Amazon DynamoDB)

660 words

CloudWatch Agent Deep Dive: Metrics and Logs for EC2 & Containers

Configure and manage the CloudWatch agent to collect metrics and logs from Amazon EC2 instances, Amazon Elastic Container Service (Amazon ECS) clusters, or Amazon Elastic Kubernetes Service (Amazon EKS) clusters

1,084 words

Curriculum Overview: ELB and Route 53 Health Checks

Configure and troubleshoot Elastic Load Balancing (ELB) and Amazon Route 53 health checks

796 words

Curriculum Overview: Configure and Manage an AWS VPC

Configure a VPC (for example, subnets, route tables, network ACLs, security groups, NAT gateways, internet gateway, egress-only internet gateway)

811 words

Mastering AWS Budgets and Cost Anomaly Detection

Configure AWS Budgets and Cost Anomaly Detection

820 words

Curriculum Overview: Configure AWS CloudTrail for Account Auditing

Configure AWS CloudTrail for account auditing

826 words

Curriculum Overview: AWS Monitoring and Logging (SOA-C03)

Configure AWS monitoring and logging by using AWS services (for example, Amazon CloudWatch, AWS CloudTrail, Amazon Managed Service for Prometheus)

863 words

AWS Notification Services: SNS, CloudWatch Alarms, and Budgets Study Guide

Configure AWS services to send notifications to Amazon Simple Notification Service (Amazon SNS) and to invoke alarms that send notifications to Amazon SNS

1,084 words

Mastering AWS WAF and Shield for Application Security

Configure AWS WAF and Shield for application protection

920 words

Curriculum Overview: Configure CloudWatch Alarms and Anomaly Detection

Configure CloudWatch alarms and anomaly detection

810 words

Curriculum Overview: Configure CloudWatch Alarms and Anomaly Detection

Configure CloudWatch alarms and anomaly detection

831 words

Curriculum Overview: Configure Content and Service Distribution on AWS

Configure content and service distribution (for example, Amazon CloudFront, AWS Global Accelerator)

782 words

Mastering AWS Route 53 Resolver and DNS Security

Configure DNS (for example, Route 53 Resolver)

820 words

Curriculum Overview: Configure Domains, DNS Services, and Content Delivery

Configure domains, DNS services, and content delivery

810 words

AWS Curriculum Overview: Configuring Fault-Tolerant Systems

Configure fault-tolerant systems (for example, Multi-AZ deployments)

785 words

Identity Security & External Trust: IAM Roles Anywhere and MFA

Configure IAM Roles Anywhere and Multi-Factor Authentication (MFA)

942 words

CloudWatch Alarms: Direct Actions, Composite Logic, and EventBridge Integration

Configure, identify, and troubleshoot CloudWatch alarms that can invoke AWS services directly or through Amazon EventBridge (for example, by creating composite alarms and identifying their invokable actions)

842 words

Curriculum Overview: Configuring Private Networking Connectivity

Configure private networking connectivity

820 words

AWS Security Operations: Configuring Reports and Remediating Findings

Configure reports and remediate findings from AWS services (for example, AWS Security Hub, Amazon GuardDuty, AWS Config, Amazon Inspector)

820 words

Mastering AWS Networking: Subnets, Route Tables, and Gateways

Configure subnets, route tables, and gateways

864 words

Curriculum Overview: Configure the CloudWatch Agent on EC2 and Containers

Configure the CloudWatch agent on EC2 and Containers

925 words

AWS Container Operations and Security Study Guide

Container Operations

860 words

Curriculum Overview: Creating and Managing AMIs & Container Images

Create and manage AMIs and container images (for example, Amazon EC2 Image Builder)

863 words

AWS CloudFormation & CDK: Infrastructure as Code Study Guide

Create and manage stacks of resources by using AWS CloudFormation and the AWS Cloud Development Kit (AWS CDK)

950 words

Curriculum Overview: Advanced Amazon CloudWatch Dashboards

Create, implement, and manage customizable and shareable CloudWatch dashboards that display metrics and alarms for AWS resources across multiple accounts and AWS Regions

698 words

AWS Systems Manager Automation: Predefined & Custom Runbooks Overview

Create or run custom and predefined Systems Manager Automation runbooks (for example, by using AWS SDKs or custom scripts) to automate tasks and streamline processes on AWS

639 words

Curriculum Overview: Automating AWS with Systems Manager (SSM) Runbooks

Create or run custom and predefined Systems Manager Automation runbooks (for example, by using AWS SDKs or custom scripts) to automate tasks and streamline processes on AWS

810 words

Curriculum Overview: AWS Systems Manager Automation Runbooks

Create or run custom and predefined Systems Manager Automation runbooks (for example, by using AWS SDKs or custom scripts) to automate tasks and streamline processes on AWS

863 words

Data Protection and Infrastructure Security: Comprehensive Study Guide

Data Protection and Infrastructure Security

925 words

AWS Elastic Beanstalk: Deployment and Lifecycle Management

Deploy applications using AWS Elastic Beanstalk

860 words

Curriculum Overview: The Six Pillars of the AWS Well-Architected Framework

Describe the six pillars of the Well-Architected Framework

813 words

Curriculum Overview: Design CloudWatch Dashboards for Multi-Account Visibility

Design CloudWatch Dashboards for multi-account visibility

686 words

Curriculum Overview: Detect and Remediate CloudFormation Stack Drift

Detect and remediate CloudFormation stack drift

624 words

Curriculum Overview: Enforcing AWS Compliance Requirements

Enforce compliance requirements (for example, AWS Region and service selections)

814 words

Curriculum Overview: Enforcing Governance using AWS Config

Enforce governance using AWS Config

834 words

AWS Shared Storage Solutions: EFS & FSx Curriculum Overview

Evaluate and select shared storage solutions (for example, Amazon Elastic File System [Amazon EFS], Amazon FSx), and optimize the solutions (for example, EFS lifecycle policies) for specific use cases and requirements

863 words

Optimizing Compute Costs: Evaluating Spot Instances and Savings Plans

Evaluate workloads for EC2 Spot Instance and Savings Plans eligibility

945 words

Chapter Study Guide: Event-Driven Remediation on AWS

Event-Driven Remediation

860 words

Curriculum Overview: Mastering AWS CLI Commands and Output Analysis

Execute CLI commands and Analyze CLI output using query and filter parameters

687 words

Curriculum Overview: Mastering the AWS Command Line Interface (CLI)

Execute commands using the AWS Command Line Interface (CLI)

782 words

Mastering SSM Automation for Automated Remediation

Execute SSM Automation runbooks for remediation

945 words

AWS Cloud Development Kit (CDK): The Evolution of Infrastructure as Code

Explain AWS CDK and its role in IaC

820 words

Curriculum Overview: Managed Service for Prometheus and Grafana

Explain the role of Managed Service for Prometheus and Grafana

863 words

AWS Disaster Recovery Procedures: Implementation & Strategy

Follow disaster recovery procedures

865 words

Curriculum Overview: High Availability and Resilience in AWS

High Availability and Resilience

862 words

Curriculum Overview: Identify and Remediate CloudFront Caching Issues

Identify and remediate CloudFront caching issues

863 words

Troubleshooting AWS Deployment Issues: Curriculum Overview

Identify and remediate deployment issues (for example, subnet sizing issues, CloudFormation errors, permissions issues)

811 words

Curriculum Overview: Automating Remediation and Monitoring Metrics (AWS SOA-C03)

Identify and remediate issues by using monitoring and availability metrics

836 words

Mastering Hybrid and Private Connectivity Troubleshooting

Identify and troubleshoot hybrid connectivity issues and private connectivity issues

865 words

AWS Identity and Access Management (IAM) Study Guide

Identity and Access Management

890 words

Study Guide: Implementing and Enforcing Data Classification

Implement and enforce a data classification scheme

820 words

Study Guide: Security and Compliance Management (SOA-C03)

Implement and manage security and compliance tools and policies

985 words

Curriculum Overview: Amazon S3 Performance & Optimization Strategies

Implement and optimize Amazon S3 performance strategies (for example, AWS DataSync, S3 Transfer Acceleration, multipart uploads, S3 Lifecycle policies) to enhance data transfer, storage efficiency, and access patterns

863 words

Curriculum Overview: Implement and Optimize Networking Features and Connectivity

Implement and optimize networking features and connectivity

810 words

Curriculum Overview: Implementing Automated Instance Recovery

Implement automated instance recovery

863 words

Mastering AWS Identity and Access Management (IAM)

Implement AWS Identity and Access Management (IAM) features (for example, password policies, multi-factor authentication [MFA], roles, federated identity, resource policies, policy conditions)

890 words

Curriculum Overview: Implementing AWS Caching for Dynamic Scalability

Implement caching by using AWS services to enhance dynamic scalability (for example, Amazon CloudFront, Amazon ElastiCache)

837 words

AWS Certified SysOps: Mastering Encryption at Rest with AWS KMS

Implement, configure, and troubleshoot encryption at rest (for example, AWS Key Management Service [AWS KMS])

1,342 words

Mastering Encryption in Transit with AWS Certificate Manager (ACM)

Implement, configure, and troubleshoot encryption in transit (for example, AWS Certificate Manager [ACM])

945 words

Curriculum Overview: Implement Custom Metrics and Namespaces

Implement custom metrics and namespaces

822 words

Curriculum Overview: Implementing Custom Metrics & Namespaces in AWS

Implement custom metrics and namespaces

687 words

Curriculum Overview: Implementing Deployment Strategies & Services

Implement deployment strategies and services

810 words

Curriculum Overview: Event-Driven Automation in AWS

Implement event-driven automation by using AWS services and features (for example, AWS Lambda, Amazon S3 Event Notifications)

694 words

Study Guide: Implementing IAM Policies, Roles, and Groups

Implement IAM policies, roles, and groups

945 words

AWS Monitoring & Logging: Metrics, Alarms, and Filters

Implement metrics, alarms, and filters by using AWS monitoring and logging services

732 words

Curriculum Overview: Implement, Monitor, and Optimize EC2 Capabilities

Implement, monitor, and optimize EC2 instances and their associated storage and networking capabilities (for example, EC2 placement groups)

813 words

Secure Multi-Account Strategies in AWS

Implement multi-account strategies securely

925 words

Curriculum Overview: AWS Compute, Storage, and Database Performance Optimization

Implement performance optimization strategies for compute, storage, and database resources

863 words

Curriculum Overview: AWS Performance Optimization Strategies

Implement performance optimization strategies for compute, storage, and database resources

822 words

Curriculum Overview: Implement Private Connectivity Using VPC Endpoints

Implement private connectivity using VPC Endpoints

673 words

AWS Trusted Advisor: Security Remediation and Best Practices

Implement remediation based on the results of AWS Trusted Advisor security checks

920 words

Route 53 Mastery: Routing Policies, Configuration, and Query Logging

Implement Route 53 routing policies, configurations, and query logging

925 words

Curriculum Overview: Implementing Versioning for Storage Services

Implement versioning for storage services (for example, Amazon S3, Amazon FSx)

725 words

Mastering Infrastructure as Code (IaC) and Resource Provisioning

Infrastructure as Code (IaC)

925 words

Curriculum Overview: Integrate AWS Health Events with External Notification Systems

Integrate AWS Health events with external notification systems

834 words

Curriculum Overview: Manage Elastic Load Balancing (ELB) Listeners and Rules

Manage Elastic Load Balancing (ELB) listeners and rules

786 words

Mastering Fleet Updates with AWS Systems Manager Patch Manager

Manage fleet updates with SSM Patch Manager

820 words

Curriculum Overview: Inter-VPC Connectivity via Peering and Transit Gateway

Manage inter-VPC connectivity via Peering and Transit Gateway

728 words

Curriculum Overview: Managing Stacks Using AWS CloudFormation

Manage stacks using AWS CloudFormation

862 words

Study Guide: Managing Workloads on Amazon ECS and EKS

Manage workloads on Amazon ECS and EKS

940 words

Curriculum Overview: Optimizing and Monitoring Amazon RDS Performance

Monitor Amazon RDS metrics (for example, Amazon RDS Performance Insights, CloudWatch alarms), and modify configurations to increase performance efficiency (for example, Performance Insights proactive recommendations, RDS Proxy)

863 words

Curriculum Overview: Network Troubleshooting and Monitoring

Network Troubleshooting and Monitoring

826 words

AWS Compute Optimization & Performance Remediation Curriculum

Optimize compute resources and remediate performance problems by using performance metrics, resource tags, and AWS tools

829 words

Compute Resource Optimization & Performance Remediation in AWS

Optimize compute resources and remediate performance problems by using performance metrics, resource tags, and AWS tools

878 words

Curriculum Overview: Optimize Compute Resources and Remediate Performance

Optimize compute resources and remediate performance problems by using performance metrics, resource tags, and AWS tools

634 words

Study Guide: Optimizing Compute with AWS Compute Optimizer

Optimize compute resources using AWS Compute Optimizer

1,450 words

Mastering the AWS Management Console & CLI Operations

Perform operations using the AWS Management Console

885 words

AWS Resource Provisioning and Maintenance Study Guide

Provision and maintain cloud resources

1,150 words

Mastering Multi-Account and Multi-Region Resource Sharing

Provision and share resources across multiple AWS Regions and accounts (for example, AWS Resource Access Manager [AWS RAM], CloudFormation StackSets)

1,150 words

Curriculum Overview: Querying Log Data with CloudWatch Logs Insights

Query log data using CloudWatch Logs Insights

945 words

AWS Global Infrastructure: From Regions to the Edge

Region, Availability Zone, Edge Location, Local Zone, Wavelength Zone, Outpost, Direct Connect Location

875 words

AWS Resource Maintenance and Application Provisioning: Curriculum Overview

Resource Maintenance and Application Provisioning

863 words

Resource Performance Optimization: AWS SOA-C03 Study Guide

Resource Performance Optimization

940 words

AWS Curriculum Overview: Scalability and Elasticity

Scalability and Elasticity

820 words

Securely Storing Secrets with AWS Secrets Manager: Curriculum Overview

Securely store secrets by using AWS services (for example AWS Secrets Manager)

820 words

Mastering the AWS Shared Responsibility Model

Shared Responsibility Model

945 words

AWS Global Infrastructure: Curriculum Overview

The AWS Global Infrastructure

836 words

Comprehensive Study Guide: The AWS Well-Architected Framework

The AWS Well-Architected Framework

820 words

Mastering the IAM Policy Simulator: A Troubleshooting Guide

Troubleshoot access issues using IAM Policy Simulator

845 words

Study Guide: Troubleshooting and Auditing AWS Access

Troubleshoot and audit access issues by using AWS tools (for example, AWS CloudTrail, IAM Access Analyzer, IAM policy simulator)

890 words

Mastering VPC Troubleshooting: Connectivity and Configuration

Troubleshoot VPC configurations (for example, subnets, route tables, network ACLs, security groups, transit gateways, NAT gateways)

920 words

Curriculum Overview: Reliability and Business Continuity

Unit 2: Reliability and Business Continuity

686 words

Lab: Building Resilient Storage with S3 Cross-Region Replication

Unit 2: Reliability and Business Continuity

845 words

Curriculum Overview: Deployment, Provisioning, and Automation

Unit 3: Deployment, Provisioning, and Automation

868 words

Lab: Automating Infrastructure and Remediation with CloudFormation and SSM

Unit 3: Deployment, Provisioning, and Automation

820 words

Lab: Hardening AWS Infrastructure with AWS Config and IAM Access Analyzer

Unit 4: Security and Compliance

845 words

Lab: Building High-Performance Content Delivery with Amazon CloudFront and S3 OAC

Unit 5: Networking and Content Delivery

840 words

Curriculum Overview: Unit 6 - Automated Remediation and Remedial Actions

Unit 6: Automated Remediation and Remedial Actions

645 words

Lab: Automated Remediation of Public S3 Buckets with AWS Config and SSM

Unit 6: Automated Remediation and Remedial Actions

820 words

Curriculum Overview: Performance and Cost Optimization (Unit 7)

Unit 7: Performance and Cost Optimization

768 words

Hands-On Lab: Optimizing AWS Resource Performance and Costs

Unit 7: Performance and Cost Optimization

945 words

Automating AWS Operations: Incident Remediation with Systems Manager and EventBridge

Unit 8: AWS Operational Foundations

845 words

Curriculum Overview: AWS Operational Foundations

Unit 8: AWS Operational Foundations

832 words

Curriculum Overview: Automating Resource Deployment with Third-Party Tools

Use and manage third-party tools to automate resource deployment (for example, Terraform, Git)

834 words

Curriculum Overview: Automating AWS Operational Processes

Use AWS services to automate operational processes (for example, AWS Systems Manager)

792 words

AWS EventBridge Mastery: Routing, Enriching, Delivering, and Troubleshooting

Use EventBridge to route, enrich, and deliver events, and troubleshoot any issues with event bus rules

684 words

Curriculum Overview: Amazon EventBridge Mastery

Use EventBridge to route, enrich, and deliver events, and troubleshoot any issues with event bus rules

662 words

Curriculum Overview: AWS EventBridge Routing, Enrichment, and Troubleshooting

Use EventBridge to route, enrich, and deliver events, and troubleshoot any issues with event bus rules

863 words

Curriculum Overview: Mastering AWS EventBridge Routing, Enrichment, and Troubleshooting

Use EventBridge to route, enrich, and deliver events, and troubleshoot any issues with event bus rules

738 words

Database Restoration & Recovery Strategies: RTO, RPO, and Cost Management

Use various methods to restore databases (for example, point-in-time restore) to meet recovery time objective (RTO), recovery point objective (RPO), and cost requirements

985 words

AWS VPC Administration: Comprehensive Study Guide

VPC Administration

845 words

Ready to practice? Jump straight in — no sign-up needed.

Take practice tests, review flashcards, and read study notes right now.

Take a Practice Test

AWS Certified CloudOps Engineer - Associate (SOA-C03) Practice Questions

Try 15 sample questions from a bank of 840. Answers and detailed explanations included.

Q1medium

A cloud engineer is configuring an Amazon EC2 Image Builder pipeline to automate the creation of a secure, golden Amazon Machine Image (AMI). How does the service manage the compute resources required to apply the build components and validate the resulting image?

It launches a temporary build instance from the base image to execute the build components, creates an initial image, and then launches a temporary test instance from that image for validation before terminating both instances.

It modifies the underlying EBS snapshot of the base image directly in an offline state to apply the build components, avoiding the need to launch and pay for running EC2 instances.

It maintains a single, persistent EC2 build instance that is continuously updated with the latest build components and snapshotted each time the pipeline executes to reduce overall build times.

It utilizes AWS Lambda functions to execute the YAML-based build and test components directly against the base AMI's file system without the need to boot a host operating system.

Show answer & explanation

Correct Answer: A

Amazon EC2 Image Builder automates the creation, management, and deployment of customized, secure, and up-to-date server images. When a pipeline runs, the service provisions a temporary build instance from the selected base image. It uses AWS Systems Manager (requiring an appropriate IAM instance profile) to run the YAML-based build components on this instance. Once the build is complete, it creates an intermediate image and then launches a temporary test instance from this new image to run validation components. After testing succeeds, the final AMI is created, and the service automatically terminates both the temporary build and test instances to prevent orphaned resources.

Incorrect Options:

B is incorrect because Image Builder requires running an actual OS environment to execute scripts, apply patches, and install software; it does not modify EBS snapshots offline.
C is incorrect because the service does not use a persistent instance. It ensures a clean, reproducible state by launching a fresh temporary instance from the base image for every pipeline execution.
D is incorrect because AWS Lambda cannot natively mount and execute OS-level configurations directly against an AMI's file system without booting an EC2 instance.

Q2medium

When publishing custom metrics to Amazon CloudWatch, which of the following best defines the purpose of a namespace?

It acts as a logical container to isolate metrics so that data from different applications are not mistakenly aggregated together.

It is a dedicated storage configuration that extends the default retention period of CloudWatch metric data.

It is an IAM security boundary that restricts which users have permission to view specific metric dashboards.

It dictates the mathematical aggregation method, such as Sum or Average, that is applied to a custom metric.

Show answer & explanation

Correct Answer: A

In Amazon CloudWatch, a namespace acts as a container for metrics. Namespaces are used to logically isolate metrics from different applications, environments, or projects so they are not mistakenly aggregated into the same statistics. Every metric published to CloudWatch must be assigned to a namespace, as there is no default namespace provided.

Option A is the correct definition. Option B is incorrect because namespaces do not configure or affect metric data retention periods. Option C is incorrect because IAM policies and conditions, not namespaces themselves, serve as security boundaries for access control. Option D is incorrect because mathematical aggregation methods (statistics) are specified when querying or graphing the metrics, not by the namespace.

Q3medium

A cloud administrator needs to provide a team of developers with secure, interactive shell access to a fleet of Amazon EC2 instances located in private subnets. The organization's strict security policies prohibit opening any inbound ports (such as port 22) and forbid the deployment of bastion hosts. How does AWS Systems Manager (SSM) Session Manager operate to fulfill these requirements?

It utilizes the SSM Agent running on the instances to initiate continuous outbound connections to the Systems Manager service via HTTPS.

It temporarily opens inbound port 22 on the instances' security groups for the duration of the session, driven by temporary IAM credentials.

It establishes a secure VPC Peering connection between the developer's local client network and the private subnets hosting the instances.

It routes SSH traffic through an AWS-managed NAT Gateway that securely forwards the interactive session packets to the private instances.

Show answer & explanation

Correct Answer: A

AWS Systems Manager Session Manager provides secure and auditable instance management without the need to open inbound ports, maintain bastion hosts, or manage SSH keys. It achieves this because the SSM Agent, running locally on the EC2 instances, initiates an outbound connection to the Systems Manager service endpoints using HTTPS (port 443). When a user requests a session via the AWS Management Console or AWS CLI, Systems Manager routes the interactive session traffic through this pre-established outbound tunnel. Because the traffic originates from within the instance pointing outwards, no inbound security group rules are required. Option B is incorrect because Session Manager never opens inbound ports; it strictly relies on outbound HTTPS. Option C is incorrect because VPC Peering connects two VPC networks, not user clients to instances for shell access. Option D is incorrect because a NAT Gateway enables outbound internet access for private subnets, but does not route or forward inbound SSH connections. Therefore, the correct answer is A.

Q4medium

A solutions architect is configuring an Application Load Balancer (ALB) to route HTTP and HTTPS traffic to multiple backend microservices. They have defined several custom listener rules based on specific URL paths and host headers.

What happens when an incoming request does not match the conditions of any of these custom listener rules?

The ALB processes the request using the listener's mandatory default rule, typically forwarding the traffic to a default target group.

The ALB drops the connection and automatically returns an error to the client because no matching routing instruction was found.

The ALB evaluates the default rule first, and only processes the custom rules if the default target group becomes unhealthy.

The ALB falls back to Layer 4 routing, forwarding the request based solely on the destination IP address and port number.

Show answer & explanation

Correct Answer: A

An Application Load Balancer (ALB) listener evaluates incoming requests against rules that consist of IF/THEN conditions. Custom rules allow routing based on Layer 7 application data, such as specific host headers or URL paths. Every listener is required to have a mandatory default rule. If an incoming request does not match the conditions of any custom rules, the ALB automatically processes the request using this default rule, which typically forwards it to a default target group.

Option B is incorrect because the ALB does not drop the connection if custom rules fail to match; it relies on the mandatory default rule instead. Option C is incorrect because the ALB evaluates custom rules by priority before ever applying the default rule as a final fallback. Option D is incorrect because ALBs operate entirely at Layer 7 and use the default rule for unmatched traffic, rather than falling back to Layer 4 network routing (which is a characteristic of Network Load Balancers).

Q5medium

A cloud financial management team needs to review their company's AWS monthly billing. Every month, the team requires a visual breakdown of historical cloud spend grouped by department. The company already utilizes Cost Allocation Tags to identify which resources belong to which department.

What is the most operationally efficient native AWS solution to provide this recurring, customized visual breakdown without requiring manual data parsing?

Export the AWS Cost and Usage Reports (CUR) to an Amazon S3 bucket each month and manually generate the required charts using a spreadsheet application.

Configure AWS Budgets to automatically group the historical spend by department and send a detailed visual breakdown to the team's email.

Use AWS Cost Explorer to filter and group the historical spend by the department tag, and save this specific view as a custom report for quick access every month.

Input the company's current architecture into the AWS Pricing Calculator to model and visualize the exact monthly costs incurred by each department.

Show answer & explanation

Correct Answer: C

AWS Cost Explorer allows users to deeply analyze cloud spend by filtering and grouping data using dimensions such as AWS Region, AWS Service, and Cost Allocation Tags (like a department tag). It provides native, built-in graphical visualizations of historical spend. Furthermore, users can save these specific filtering and grouping views as custom reports. This enables the team to quickly access the exact same visual breakdown every month without manual recreation.

Option A is incorrect because exporting raw Cost and Usage Reports (CUR) and manually building charts in a spreadsheet is highly inefficient compared to using the native saved reports in Cost Explorer.

Option B is incorrect because AWS Budgets is designed for threshold-based alerting and automated actions, not for generating complex visual reporting of historical data.

Option D is incorrect because the AWS Pricing Calculator is an estimation tool used to model costs for future, hypothetical architectures, not to analyze actual, incurred historical usage.

Therefore, the correct answer is C.

Q6medium

A company decides to migrate its self-managed relational database hosted on Amazon EC2 instances to Amazon Relational Database Service (Amazon RDS). By making this architectural change, the company's IT team no longer needs to perform operating system patching, database software updates, routine backups, or hardware provisioning.

According to the AWS Well-Architected Framework, which design principle does this architectural decision best demonstrate?

Stop spending money on undifferentiated heavy lifting.

Stop guessing capacity.

Perform operations as code.

Adopt a consumption model.

Show answer & explanation

Correct Answer: A

The correct answer is A: Stop spending money on undifferentiated heavy lifting.

In the AWS Well-Architected Framework, "undifferentiated heavy lifting" refers to the time-consuming IT maintenance tasks that do not uniquely add value or differentiate a company's core business from its competitors. Managing database servers, applying OS patches, and handling routine backups are perfect examples of this. By adopting a managed service like Amazon RDS, AWS assumes responsibility for these administrative tasks, which allows the company's engineers to focus their time on application development, feature delivery, and business logic.

Let's evaluate the incorrect options:

B (Stop guessing capacity) is a design principle of the Reliability and Performance Efficiency pillars. It focuses on using elasticity and auto-scaling capabilities to automatically match computing resource supply with workload demand, rather than over-provisioning or under-provisioning hardware upfront.
C (Perform operations as code) is a principle of the Operational Excellence pillar. It encourages defining your entire workload (both applications and infrastructure) as code and updating it using version control and automation, which is not the primary focus of switching to a managed database.
D (Adopt a consumption model) is a Cost Optimization principle that relates to paying strictly for the computing resources you consume based on dynamic demand, rather than paying for static resources regardless of usage. While moving to the cloud enables this, offloading administrative database tasks specifically targets removing undifferentiated heavy lifting.

Q7medium

A company currently has three VPCs: VPC A, VPC B, and VPC C. VPC A is peered with VPC B, and VPC B is peered with VPC C. Currently, resources in VPC A cannot communicate with resources in VPC C. The company plans to add 30 more VPCs in the next six months and requires full communication between all current and future VPCs. What is the most operationally efficient and scalable solution to establish this connectivity?

Create an AWS Transit Gateway and attach all current and future VPCs to it.

Create a direct VPC peering connection between VPC A and VPC C, and continue creating a full mesh of VPC peering connections as new VPCs are added.

Update the route tables in VPC B to forward traffic from VPC A's peering connection to VPC C's peering connection.

Deploy a NAT Gateway in VPC B to translate traffic coming from VPC A and route it through the peering connection to VPC C.

Show answer & explanation

Correct Answer: A

VPC peering connections are strictly non-transitive. This means traffic from VPC A cannot route through VPC B to reach VPC C. Because of this, updating route tables (Option C) or deploying a NAT Gateway in the intermediary VPC (Option D) will not bypass this fundamental restriction. While creating a direct peering connection between VPC A and VPC C (Option B) solves the immediate connectivity issue, it is highly unscalable. A full mesh network of $n$ VPCs requires $\frac{n(n-1)}{2}$ peering connections. For the planned 33 VPCs, this would require 528 peering connections, creating massive operational overhead. AWS Transit Gateway (Option A) provides a highly scalable hub-and-spoke architecture. By attaching all current and future VPCs to a central Transit Gateway, traffic can be routed between all attached networks with minimal configuration, making it the correct answer.

Q8easy

What is the primary benefit of using Amazon Managed Service for Prometheus and Amazon Managed Grafana together?

They provide an open-source compatible monitoring and visualization solution without the operational overhead of managing the underlying infrastructure.

They act as automated remediation engines that execute scripts to fix configuration drift across AWS resources.

They serve as proprietary, closed-source tools designed exclusively to manage on-premises bare-metal server environments.

They provide centralized auditing of AWS API calls and user authentication events across multiple AWS accounts.

Show answer & explanation

Correct Answer: A

A is correct. Amazon Managed Service for Prometheus and Amazon Managed Grafana provide a fully managed, serverless environment to ingest, store, query, and visualize operational metrics. Their primary value is allowing organizations to use familiar open-source tools while eliminating the operational burden of provisioning, scaling, and maintaining the self-hosted infrastructure.

B is incorrect. These services are specifically built for monitoring and visualization, not for automated remediation or fixing configuration drift (which is typically handled by services like AWS Systems Manager or AWS Config).

C is incorrect. They are based on popular open-source projects (Prometheus and Grafana) and are primarily designed for cloud and containerized workloads, rather than being proprietary tools exclusively for on-premises servers.

D is incorrect. Centralized auditing of API calls and user authentication events is the primary function of AWS CloudTrail, not Prometheus or Grafana.

Q9easy

Which combination of AWS services is primarily used to implement centralized auditing of API activity and aggregate log management across an AWS infrastructure?

AWS CloudTrail and Amazon CloudWatch Logs

AWS Config and AWS Systems Manager

Amazon Inspector and AWS WAF

AWS Trusted Advisor and AWS Artifact

Show answer & explanation

Correct Answer: A

To achieve centralized logging and analysis in AWS, two primary services are used in tandem:

AWS CloudTrail provides account auditing by continuously recording API calls, management events, and data events across the AWS infrastructure.
Amazon CloudWatch Logs provides centralized log management, allowing administrators to aggregate, store, and analyze both system and application logs.

CloudTrail can be configured to deliver its logs directly to CloudWatch Logs, which enables real-time analysis, alerting, and metric extraction based on API activity.

Looking at the incorrect options:

AWS Config and AWS Systems Manager are primarily used for tracking resource configuration changes and operational automation, not for centralizing API logs.
Amazon Inspector and AWS WAF are security services focused on automated vulnerability assessments and web application firewalls, respectively.
AWS Trusted Advisor and AWS Artifact provide best-practice recommendations and access to compliance reports, but they do not aggregate or analyze real-time logs.

Therefore, the correct answer is A.

Q10medium

An organization uses AWS CloudFormation to manage the infrastructure for a complex application. As the environment has grown, their monolithic CloudFormation template is approaching the hard limit of 500 resources. The architecture team wants to refactor the template to overcome this limit and promote the reusability of standard components, such as a baseline VPC. Furthermore, they require that the lifecycles of these components remain tightly coupled, ensuring that creating, updating, or deleting the parent deployment automatically cascades to all underlying components.

Which architectural approach best meets these requirements?

Define AWS::CloudFormation::Stack resources within a parent template to create nested stacks, modularizing the architecture while maintaining a tightly coupled lifecycle.

Deploy separate, independent stacks for each component and share exported resource values using the Fn::ImportValue intrinsic function.

Implement AWS CloudFormation StackSets to distribute identical component templates across multiple AWS accounts and Regions.

Generate a snapshot of the stack's configuration using Change Sets to bypass the resource limit and enable instantaneous rollbacks.

Show answer & explanation

Correct Answer: A

Option A is correct. Nested stacks are created by defining an AWS::CloudFormation::Stack resource within a parent CloudFormation template. This pattern allows large, monolithic deployments to be decomposed into smaller, reusable components, effectively overcoming the 500-resource limit per template. Because nested stacks are created, updated, and deleted as a direct part of the parent stack's operations, their lifecycles remain tightly coupled.

Option B is incorrect. While cross-stack references (using Fn::ImportValue) allow independent stacks to share information and can break up large templates, they create loosely coupled dependencies. The scenario specifically requires a tightly coupled lifecycle where actions cascade directly from a single parent.

Option C is incorrect. AWS CloudFormation StackSets are designed to deploy the same template across multiple independent AWS accounts and AWS Regions simultaneously. They are not used for decomposing a single application's template to overcome resource limits.

Option D is incorrect. Change Sets allow administrators to preview changes to a stack before executing an update. They do not increase the maximum resource limit per template or provide a mechanism for modularizing infrastructure code.

Q11hard

A company has multiple departments sharing a single AWS account. The finance team wants to visually analyze the historical AWS spending trends for each department over the past 12 months and generate a cost forecast for the next quarter. They require a built-in graphical interface within the AWS Management Console without using external tools. Which solution should a Cloud Practitioner recommend to meet these requirements?

Use AWS Cost Explorer and filter the costs by department using active cost allocation tags.

Configure AWS Cost and Usage Reports (CUR) to generate interactive departmental spending graphs.

Create individual AWS Budgets for each department to visualize their historical spending patterns.

Use the AWS Pricing Calculator to input the past year's usage data and forecast future departmental costs.

Show answer & explanation

Correct Answer: A

AWS Cost Explorer provides an easy-to-use graphical interface that lets you visualize, understand, and manage your AWS costs and usage over time. It allows you to view historical data for up to the last 12 months and forecast spending for the next 12 months. By activating cost allocation tags (like a 'department' tag), you can easily filter and group the visual data by department. Option B is incorrect because AWS Cost and Usage Reports (CUR) delivers raw, comprehensive cost data as CSV files to an Amazon S3 bucket; it does not provide built-in visual graphs without external tools like Amazon QuickSight. Option C is incorrect because AWS Budgets is primarily used for setting custom spending thresholds and sending alerts when those thresholds are exceeded, rather than for exploratory historical data visualization. Option D is incorrect because the AWS Pricing Calculator is used to estimate the costs of planned future infrastructure, not to analyze historical usage or generate visual forecasts based on past data. Therefore, Option A is the correct answer.

Q12hard

A company is migrating a complex hybrid application to AWS and wants to maximize cost savings without compromising the availability of critical components. The architecture consists of three distinct workloads:

Workload A: A monolithic legacy application running on Amazon EC2 that requires 24/7 steady-state compute and cannot tolerate interruptions.
Workload B: A set of stateless, asynchronous background processing jobs containerized on AWS Fargate that can be stopped and restarted at any time without data loss.
Workload C: An event-driven architecture utilizing AWS Lambda functions with a highly predictable, consistent daily execution pattern.

Which combination of pricing models should a solutions architect recommend to meet these requirements most cost-effectively?

Purchase a Compute Savings Plan to cover the consistent compute usage of Workloads A and C, and utilize Fargate Spot capacity for Workload B.

Purchase an EC2 Instance Savings Plan to cover Workloads A, B, and C, and apply the Savings Plan discount to Spot instances for Workload B to maximize combined savings.

Provision Spot instances for Workload A to achieve up to 90% savings, and purchase Reserved Instances for Workloads B and C to provide flexible compute discounts.

Purchase a Compute Savings Plan for Workload A, and provision Spot instances for Workloads B and C by making a 1-year upfront commitment to the Spot capacity pool.

Show answer & explanation

Correct Answer: A

The correct answer is A.

Workload A (steady-state EC2, uninterruptible) and Workload C (predictable AWS Lambda usage) are ideal candidates for a Compute Savings Plan, which provides up to 72% savings for a 1-year or 3-year commitment and flexibly applies across Amazon EC2, AWS Fargate, and AWS Lambda. Workload B is stateless and fault-tolerant, making it the perfect use case for Fargate Spot (or EC2 Spot instances), which utilizes spare compute capacity for up to a 90% discount but is subject to interruption.

Why the other options are incorrect:

Option B is incorrect because AWS Savings Plan discounts cannot be stacked with Spot instance usage. Additionally, an EC2 Instance Savings Plan only applies to a specific instance family within a specific region on EC2, and does not provide discounts for AWS Fargate or AWS Lambda.
Option C is incorrect because Spot instances are highly inappropriate for Workload A, which cannot tolerate interruptions. Furthermore, Reserved Instances only apply to Amazon EC2 and Amazon RDS; they do not provide the flexible discount model across AWS Fargate and AWS Lambda that a Compute Savings Plan does.
Option D is incorrect because Spot instances do not require a 1-year or 3-year commitment; they are a dynamic, pay-as-you-go model based on available spare capacity. Additionally, AWS Lambda does not support Spot pricing.

Q13medium

A security engineer is using AWS Firewall Manager to configure centralized logging for AWS Network Firewall policies across an AWS Organization. The goal is to capture network traffic flow logs and store them directly in an Amazon S3 bucket for analysis.

How should the engineer configure the target S3 bucket and its permissions to ensure Firewall Manager can successfully deliver the logs?

Create the S3 bucket in the Firewall Manager administrator account and attach a bucket policy that grants the Firewall Manager service the s3:GetBucketPolicy and s3:PutBucketPolicy permissions.

Create the S3 bucket in a dedicated Log Archive member account and attach an IAM role to the Firewall Manager service granting s3:PutObject permissions.

Create the S3 bucket in the Firewall Manager administrator account and configure AWS CloudTrail to capture Network Firewall traffic as data events to route to the bucket.

Create the S3 bucket in any Organization account, configure the firewalls to send logs to Amazon CloudWatch Logs, and use a subscription filter to deliver them to the S3 bucket.

Show answer & explanation

Correct Answer: A

To successfully implement centralized logging for AWS Network Firewall using AWS Firewall Manager, specific architectural and permission constraints must be met:

Bucket Location: A strict architectural constraint of AWS Firewall Manager is that the destination Amazon S3 bucket must be located within the Firewall Manager administrator account. It cannot be located in a member account, such as a dedicated Log Archive account.
Permissions: Firewall Manager requires permission to modify the bucket policy to allow the firewall endpoints to write logs to it. Therefore, the target S3 bucket must have an existing bucket policy that explicitly grants the Firewall Manager service (fms.amazonaws.com) the s3:GetBucketPolicy and s3:PutBucketPolicy permissions.

Why the other options are incorrect:

Option B is incorrect because the S3 bucket cannot be in a member account, and Firewall Manager relies on bucket policies for this setup, not just an IAM role with s3:PutObject.
Option C is incorrect because AWS CloudTrail is designed to capture API management events, not raw network traffic flow logs generated by Network Firewall.
Option D is incorrect because while CloudWatch Logs can be used for logging, using a subscription filter bypasses the native centralized S3 logging capability of Firewall Manager, and the S3 bucket still cannot reside in just any account if managed centrally via FMS.

Therefore, the correct answer is A.

Q14hard

A Solutions Architect is troubleshooting an Application Load Balancer (ALB) that routes traffic to several target groups. The ALB has an HTTPS listener configured with the following custom rules:

Priority 10: If the Host header is *.example.com, forward to TargetGroup-Main.
Priority 20: If the Host header is admin.example.com AND the Path is /dashboard/*, forward to TargetGroup-Admin.

Users report that when they navigate to https://admin.example.com/dashboard/settings, they are incorrectly routed to the main website (TargetGroup-Main) instead of the administration portal (TargetGroup-Admin).

Which action will resolve this routing issue so that the dashboard traffic reaches TargetGroup-Admin while maintaining the current routing for other subdomains?

Change the priority of the rule routing to TargetGroup-Admin to a numerical value lower than 10.

Add a "Pass" action to the rule at Priority 10 so that the ALB evaluates subsequent rules if a more specific path match exists in the rule set.

Modify the listener's mandatory default rule to include the admin.example.com host and /dashboard/* path conditions, as default rules are evaluated first.

Change the logical operator in the rule at Priority 20 from AND to OR to force the ALB to evaluate the path condition independently of the host header priority.

Show answer & explanation

Correct Answer: A

Correct Answer: A

Application Load Balancer (ALB) listener rules are evaluated in order of their priority, starting from the lowest numeric value (e.g., Priority 1 is evaluated before Priority 10). When a request matches all the conditions of a rule, the ALB applies the specified action (such as forwarding to a target group) and stops evaluating any subsequent rules.

In the scenario described, the incoming request for https://admin.example.com/dashboard/settings successfully matches the wildcard condition *.example.com in the rule at Priority 10. Because this rule has a numerically lower priority compared to the highly specific rule at Priority 20, the ALB routes the traffic to TargetGroup-Main and terminates evaluation. This is known as rule shadowing. To ensure the highly specific traffic reaches TargetGroup-Admin, the rule containing the multiple conditions (admin.example.com AND /dashboard/*) must be given a priority number lower than 10 (e.g., Priority 5).

Incorrect Options:

Option B is incorrect because ALB rules do not support a "Pass" or "Continue" action to defer routing decisions to lower-priority rules. Once a rule matches, evaluation stops immediately.
Option C is incorrect because the mandatory default rule of an ALB listener is always evaluated last, not first, and only applies if no custom rules match the request. Additionally, default rules cannot be configured with specific routing conditions.
Option D is incorrect because multiple condition types within a single ALB rule (such as a host and a path condition) are always evaluated using a logical AND. This behavior cannot be changed to a logical OR, and even if it could, it would not bypass the fact that Priority 10 intercepts the request first.

Q15medium

A company routes traffic to an Application Load Balancer (ALB) using an Amazon Route 53 Alias record. The SysOps Administrator has enabled the "Evaluate Target Health" feature on the Alias record to support a failover routing policy.

How does Route 53 determine when to trigger a failover away from this primary ALB under this configuration?

The global Route 53 health checkers bypass the ALB and send HTTP requests directly to the backend Amazon EC2 instances to evaluate their health.

Route 53 monitors the status of a manually associated, standalone Route 53 health check that pings the ALB's DNS endpoint from multiple geographic locations.

Route 53 evaluates the health status reported natively by the ALB's target group health checks and initiates failover if all registered targets become unhealthy.

Route 53 monitors the ALB's target group and initiates an immediate failover to the secondary record if even a single backend target fails its health check.

Show answer & explanation

Correct Answer: C

To understand how Route 53 determines endpoint health with Alias records, we must look at how the Evaluate Target Health feature functions:

Integration with ELB/ALB: When "Evaluate Target Health" is set to "Yes" on an Alias record pointing to an Application Load Balancer, Route 53 natively utilizes the load balancer's internal target group health checks. It does not bypass the ALB to check instances directly.
No Standalone Checks Needed: This feature eliminates the need to create, manage, and pay for standard, standalone Route 53 health checks that monitor the ALB's endpoint externally.
Failover Conditions: An ALB is considered healthy by Route 53 as long as it has at least one healthy target in its target group. Route 53 will only consider the ALB unhealthy (and thereby trigger the failover routing policy) if all registered targets in the ALB's target group fail their health checks.

Therefore, Route 53 relies entirely on the native ALB health checks and requires complete target failure to initiate a failover.

Correct Answer: C

These are 15 of 840 questions available. Take a practice test →

AWS Certified CloudOps Engineer - Associate (SOA-C03) Flashcards

1,200 flashcards for spaced-repetition study. Showing 30 sample cards below.

Advanced Observability Services on AWS(5 cards shown)

Question

CloudWatch Agent

Answer

A software package installed on Amazon EC2 instances, on-premises servers, or container clusters to collect system-level metrics and logs.

[!TIP] By default, AWS can only see hypervisor-level metrics (like CPU and Network). The CloudWatch Agent is required to see internal guest OS metrics like Memory Utilization and Disk Space Used.

Question

Amazon Managed Service for Prometheus (AMP)

Answer

A serverless, open-source compatible monitoring and alerting service heavily optimized for containerized environments and microservices.

[!NOTE] It is primarily used to monitor applications running on Amazon EKS (Elastic Kubernetes Service) without the operational overhead of managing Prometheus infrastructure.

Question

Amazon Managed Grafana

Answer

A fully managed service based on the popular open-source platform used for data visualization and operational dashboards.

It allows teams to query, correlate, and visualize metrics, logs, and traces from multiple data sources instantly.

Common Integrations:

Amazon Managed Service for Prometheus
Amazon CloudWatch
AWS X-Ray
External databases

Question

System-Level Metrics vs. Hypervisor Metrics

Answer

The distinction between what AWS monitors by default versus what requires specialized agents within the OS.

Metric Type	Visibility	Examples
Hypervisor	Default (Agentless)	CPU Utilization, Disk Read/Write Ops, Network In/Out
System-Level	Requires CloudWatch Agent	Memory Utilization, Disk Space Available, Swap Usage, Page Faults

[!WARNING] If a question asks how to monitor memory on an EC2 instance, the answer is always to install and configure the CloudWatch Agent!

Question

CloudWatch Container Insights

Answer

A feature of CloudWatch used to collect, aggregate, and summarize operational metrics and logs from containerized applications.

Loading Diagram...

Figure 1 — Mermaid diagram

[!TIP] It automatically generates dashboards tracking performance at the cluster, node, pod, and task levels for Amazon ECS, EKS, and AWS Fargate.

Amazon CloudWatch Metrics and Alarms(5 cards shown)

Question

Amazon CloudWatch Dashboards

Answer

Customizable, shareable home pages in the CloudWatch console used to monitor your AWS resources.

[!TIP] They are highly powerful because they can display metrics and alarms for AWS resources across multiple accounts and multiple AWS Regions within a single centralized view.

Question

CloudWatch Agent

Answer

A software package deployed to compute resources to collect system-level metrics and logs.

It is primarily used for gathering internal system metrics and logs from:

Amazon EC2 instances
Amazon ECS clusters
Amazon EKS clusters

[!WARNING] Default CloudWatch monitoring for EC2 only tracks hypervisor-visible metrics (like CPU and network I/O). You must install the CloudWatch Agent to capture operating system metrics like Memory and Disk Space utilization.

Question

CloudWatch Anomaly Detection

Answer

A CloudWatch feature that applies machine learning algorithms to continuously analyze metrics, establish an expected baseline, and trigger alarms based on dynamic thresholds.

Instead of creating a static threshold (e.g., trigger when CPU > 80%), anomaly detection automatically adjusts to natural metric patterns over time.

[!TIP] This is ideal for metrics with predictable daily or weekly trends (like varying website traffic), helping to significantly reduce false alarms.

Question

CloudWatch Custom Metrics

Answer

Application-specific or business-level data points published to CloudWatch that are not automatically tracked by AWS services.

When creating a custom metric, you define a Namespace to act as a container, isolating these metrics from standard AWS service metrics.

Metric Type	Provided By	Example
Standard	AWS Services (Default)	EC2 CPU Utilization, Lambda Invocations
Custom	PutMetricData API	Number of e-commerce checkout failures

Question

Composite Alarms

Answer

A CloudWatch alarm that evaluates the states of multiple other underlying alarms to determine its own state.

By aggregating alarms, they help reduce alert noise and "alert fatigue."

[!NOTE] Composite alarms use logical operators such as AND, OR, and NOT.

Example Use Case: Only trigger a high-severity incident notification if an application's Error Rate is high AND Database CPU Utilization is > 90%.

When triggered, they can invoke AWS services directly or route events through Amazon EventBridge.

Amazon CloudWatch Network Monitoring and Troubleshooting(5 cards shown)

Question

VPC Flow Logs

Answer

A feature that captures metadata about the IP traffic going to and from network interfaces (ENIs) within an Amazon VPC.

Flow logs can be enabled at the VPC, subnet, or individual ENI level, and the log data can be published to Amazon CloudWatch Logs or Amazon S3.

[!NOTE] Flow logs do not capture actual packet payloads (they are not packet sniffers). They only capture metadata such as source/destination IP addresses, ports, protocols, and whether traffic was ACCEPT or REJECT.

Example: Analyzing a flow log record containing REJECT to determine that an overly restrictive Security Group or Network ACL is blocking legitimate inbound web traffic.

Question

VPC Reachability Analyzer

Answer

A configuration analysis tool that performs automated network path validation between a source and a destination within your AWS environment.

[!TIP] No actual network packets are sent! It uses automated reasoning to verify if your logical configuration (Route Tables, NACLs, Security Groups) allows a path.

Example: Troubleshooting an SSH connection timeout by running the Reachability Analyzer from an Internet Gateway (IGW) to an EC2 instance. The analyzer will highlight the exact configuration issue, such as a missing subnet route or a blocking Security Group rule.

Loading Diagram...

Figure 1 — Mermaid diagram

Question

CloudWatch Logs Insights

Answer

An interactive query engine used to rapidly search, filter, and analyze log data stored in Amazon CloudWatch Logs.

It features a purpose-built query language with commands like fields, filter, stats, and sort to extract actionable operational intelligence from massive log groups.

Common Network Use Case: Querying VPC Flow Logs to find the top 10 external IP addresses that are generating rejected connection attempts.

Example Query:

text

fields @timestamp, @message
| filter action = "REJECT"
| stats count() by srcAddr
| sort count() desc
| limit 10

Question

CloudWatch Network Monitor

Answer

An active network monitoring service that provides continuous visibility into network performance metrics—specifically packet loss and latency—between AWS and on-premises environments, or across AWS Regions.

It publishes these performance metrics directly to CloudWatch, allowing operations teams to set up proactive alerts before users report sluggish application performance.

[!WARNING] Don't confuse this with VPC Flow Logs! While Flow Logs tell you what traffic is occurring, Network Monitor tells you how well the connection is performing.

Example: Monitoring a hybrid architecture connected via AWS Direct Connect and configuring a CloudWatch Alarm to trigger an incident ticket if packet loss exceeds 2% for 5 consecutive minutes.

Question

CloudWatch Metric Filters

Answer

A CloudWatch feature that scans incoming log events for specific patterns and extracts numerical data, transforming those logs into standard CloudWatch Metrics.

Once a metric filter generates a metric, you can graph it on a CloudWatch Dashboard or use it to trigger a CloudWatch Alarm.

Log Source	Metric Filter Target	Resulting Metric Example
VPC Flow Logs	Match `REJECT` actions	`RejectedConnectionsCount`
CloudFront Logs	Match `4xx` HTTP status codes	`ClientErrorRate`

Example: Applying the filter pattern [version, account_id, interface_id, srcaddr, dstaddr, srcport, dstport, protocol, packets, bytes, start, end, action="REJECT", log_status] to a VPC Flow Log group to track anomalous spikes in denied network traffic.

Amazon CloudWatch & Network Monitoring Services(5 cards shown)

Question

VPC Flow Logs

Answer

A feature that enables you to capture information about the IP traffic going to and from network interfaces in your Virtual Private Cloud (VPC).

Flow log data can be published to Amazon CloudWatch Logs, Amazon S3, or Amazon Kinesis Data Firehose.

[!TIP] Use VPC Flow Logs to troubleshoot overly restrictive security groups or network ACLs by looking for REJECT records.

Example Flow Log Record:

text

2 123456789010 eni-1235b8ca123456789 172.31.16.139 172.31.16.21 20641 22 6 20 4249 1418530010 1418530070 ACCEPT OK

Question

VPC Reachability Analyzer

Answer

A network diagnostic tool that performs automated network path validation between a source and a destination in your VPCs.

It analyzes the network configuration to determine whether two resources can communicate, and if not, it identifies the blocking component (e.g., a specific Security Group rule or missing Route Table entry).

[!NOTE] Reachability Analyzer does not send actual packets over the network. It uses automated reasoning to build a mathematical model of your network configurations.

Question

CloudWatch Logs Insights

Answer

A fully managed, interactive log analytics service in AWS that allows you to search and analyze your log data using a purpose-built query language.

It is highly useful for diagnosing network connectivity issues by parsing massive volumes of VPC Flow Logs or Route 53 query logs.

Example Query (Finding rejected SSH traffic):

text

filter @logStream = 'eni-0a1b2c3d4e5f6g7h8'
| filter action = 'REJECT' and dstPort = 22
| stats count() by srcAddr
| sort count() desc

Question

VPC Traffic Mirroring

Answer

An AWS feature that allows you to copy network traffic from an Elastic Network Interface (ENI) of an EC2 instance and send it to out-of-band security and monitoring appliances.

Unlike VPC Flow Logs (which only capture metadata/headers), Traffic Mirroring captures the actual packet payload.

[!TIP] Use Traffic Mirroring for deep packet inspection (DPI), intrusion detection/prevention systems (IDS/IPS), or diagnosing complex application-level network issues.

Question

CloudWatch Anomaly Detection

Answer

A feature that applies machine learning algorithms to continuously analyze network and system metrics, creating a model of expected baseline behavior.

Instead of setting static thresholds (e.g., "Alert if NetworkIn > 500MB"), anomaly detection dynamically calculates standard deviations and triggers an alarm only when metrics fall outside the expected band.

Threshold Type	Use Case
Static Threshold	Known hard limits (e.g., 90% disk space full)
Anomaly Detection	Fluctuating traffic patterns (e.g., unexpected spike in ELB requests or network traffic)

Amazon EBS Performance Metrics and Optimization(5 cards shown)

Question

VolumeQueueLength

Answer

A CloudWatch metric that measures the number of pending I/O requests for an EBS volume device.

[!TIP]

Transaction-intensive workloads (SSDs): Should maintain a low queue length and high available IOPS to minimize latency.

Throughput-intensive workloads (HDDs): Are less sensitive to latency and can actually benefit from a high queue length when performing large, sequential I/O operations.

Question

BurstBalance

Answer

A CloudWatch metric that tracks the remaining percentage of I/O credits in the burst bucket for gp2, st1, and sc1 volumes.

[!WARNING] Depletion of the burst bucket (a balance of 0%) results in the volume being throttled down to its baseline performance, which causes a sudden drop in IOPS and increased latency.

Question

EBS-Optimized Instance

Answer

An Amazon EC2 feature that provides dedicated, predictable bandwidth between the EC2 instance and its attached Amazon EBS volumes.

[!NOTE] Enabling this setting prevents EBS volume storage traffic from contending with the instance's standard network traffic, mitigating a common source of poor storage performance.

Question

First-Access Latency Penalty (Snapshot Initialization)

Answer

The significant performance drop that occurs when a new EBS volume is created from a snapshot, because blocks are pulled down from Amazon S3 only when they are accessed for the first time.

Remediations:

Amazon EBS Fast Snapshot Restore (FSR): Creates fully initialized volumes instantly (incurs additional cost).
Manual Initialization: Use OS-level tools (like dd or fio) to manually read all blocks before putting the volume into production.

Question

VolumeReadOps & VolumeWriteOps

Answer

CloudWatch metrics that track the total number of read and write operations against an EBS volume over a specific time period.

These metrics help administrators identify if there are I/O size or volume throughput bottlenecks between the guest OS and the EBS volume.

[!TIP] By evaluating these operational metrics alongside bytes transferred (VolumeReadBytes / VolumeWriteBytes), you can define the appropriate I/O size for the application and calculate the exact total IOPS required when upgrading to Provisioned IOPS volume types.

Amazon EventBridge: Routing, Enriching, and Delivering Events(5 cards shown)

Question

Amazon EventBridge

Answer

A serverless event bus service used to receive, filter, transform, route, and deliver events across AWS services and third-party applications.

[!TIP] Think of it as the central nervous system of your AWS architecture. It continuously receives events from sources (like AWS Security Hub) and routes them to specific targets in near real-time, enabling automated remediation and reducing manual human interaction.

Question

EventBridge Rule

Answer

A configuration that watches an event bus for specific incoming events and routes them to targets for processing.

Rules use event patterns (or schedules) to determine which events to catch.

[!NOTE] When creating rules in the console, you can use predefined patterns that automatically fill in the source and detail type. For example, a predefined pattern can easily capture all new compliance findings directly from AWS Security Hub.

Question

EventBridge Targets

Answer

The destination resources or endpoints that receive an event when an EventBridge rule matches.

A single rule can route an event to multiple targets simultaneously.

Common Targets for Automated Remediation Include:

AWS Lambda: Invoking functions for custom code execution
Amazon EC2: Sending run commands via Systems Manager
AWS Step Functions: Triggering a state machine for complex workflows
Amazon SNS / SQS: Pushing notifications or queuing messages for downstream processing

Question

Event Pattern Filtering

Answer

The process of defining JSON-based matchers within an EventBridge rule to precisely select incoming events based on their data payload.

By specifying exact filter values, you ensure that targets are only invoked for relevant events, cutting down on noise and cost.

Example: Filtering Security Hub findings by specifying custom attributes like AWSAccountID, Compliance.Status, or RecordState:

json

{
  "source": ["aws.securityhub"],
  "detail-type": ["Security Hub Findings - Imported"],
  "detail": {
    "findings": {
      "Compliance": {
        "Status": ["FAILED"]
      }
    }
  }
}

Question

EventBridge Input Transformer

Answer

A feature used to customize, format, or enrich the payload of an event before EventBridge delivers it to a target.

It consists of two components:

Input Path: Uses JSONPath to extract specific values from the original event payload and assign them to variables.
Input Template: Uses those variables to construct a new data structure (e.g., converting a raw JSON error into a human-readable text string for an email notification).

[!TIP] Use the Input Transformer to enrich and deliver well-formatted events so the target service (like SNS) receives exactly what it needs, eliminating the need for an intermediary Lambda function just to parse JSON.

Showing 30 of 1,200 flashcards. Study all flashcards →

Ready to ace AWS Certified CloudOps Engineer - Associate (SOA-C03)?

Access all 840 practice questions, 13 timed mock exams, study notes, and flashcards — no sign-up required.

Start Studying — Free

Free AWS Certified CloudOps Engineer - Associate (SOA-C03) Study Resources

On This Page

AWS Certified CloudOps Engineer - Associate (SOA-C03) Study Notes & Guides

Curriculum Overview: Advanced Observability Services

Curriculum Overview: Advanced Observability Services

Prerequisites

Module Breakdown

Observability Flow

Learning Objectives per Module

Module 1: Centralized Logging & Analysis

Module 2: Advanced CloudWatch Metrics

Module 3: Container & OS-Level Observability

Module 4: Open-Source Monitoring Integrations

Module 5: Event-Driven Remediation

Success Metrics

Real-World Application

Real-World Observability Architecture

Amazon CloudWatch Metrics and Alarms: Curriculum Overview

Amazon CloudWatch Metrics and Alarms: Curriculum Overview

Prerequisites

Module Breakdown

Learning Objectives per Module

Module 1: CloudWatch Fundamentals

Module 2: Advanced Collection & The CW Agent

Module 3: Alarms, Thresholds, & Notifications

Module 4: Automated Remediation & Operations

Core Formula: Calculating Metric Impact

Success Metrics

Real-World Application

Curriculum Overview: Amazon EBS Performance, Troubleshooting, and Cost Optimization

Curriculum Overview: Amazon EBS Performance, Troubleshooting, and Cost Optimization

Prerequisites

Module Breakdown

Learning Objectives per Module

Module 1: EBS Architecture & Volume Profiles

Module 2: Monitoring EBS with CloudWatch

Module 3: Troubleshooting Performance Bottlenecks

Module 4: Cost & Performance Optimization Strategies

Success Metrics

EBS Optimization Lifecycle

Real-World Application

Scenario: The "Slow" Production Database

Diagnostic Decision Tree

Resource Links

Curriculum Overview: Amazon EBS Performance, Troubleshooting, and Optimization

Curriculum Overview: Amazon EBS Performance, Troubleshooting, and Optimization

Prerequisites

Module Breakdown

Learning Objectives per Module

Module 1: EBS Architecture & Volume Types

Module 2: Monitoring EBS with CloudWatch

Module 3: Troubleshooting Performance Issues

Module 4: Cost & Performance Optimization

Visual Anchors

Workload to Volume Type Decision Matrix

Burst Balance Depletion Over Time

Success Metrics

Real-World Application

Mastering EBS and S3 Performance Metrics: AWS CloudOps Study Guide

Mastering EBS and S3 Performance Metrics

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

EBS Performance Troubleshooting Flow

Data Transfer Comparison

Definition-Example Pairs

Worked Examples

Example 1: Troubleshooting Throttled EBS

Example 2: Optimizing Large File Uploads to S3

Checkpoint Questions

Curriculum Overview: Analyzing Events with the AWS Personal Health Dashboard

Prerequisites

Module Breakdown

Architectural Context

Learning Objectives per Module

Module 1: AWS Health Fundamentals

Module 2: Multi-Account Visibility