☁️ AWS

Free AWS Certified CloudOps Engineer - Associate (SOA-C03) Study Resources

Comprehensive AWS Certified CloudOps Engineer - Associate (SOA-C03) hive provides study notes, question bank with practice tests, flashcards, and hands-on labs, all supported by a personal AI tutor to help you master the AWS Certified CloudOps Engineer - Associate (SOA-C03) certification.

840
Practice Questions
12
Mock Exams
148
Study Notes
1,200
Flashcard Decks
2
Source Materials

AWS Certified CloudOps Engineer - Associate (SOA-C03) Study Notes & Guides

148 AI-generated study notes covering the full AWS Certified CloudOps Engineer - Associate (SOA-C03) curriculum. Showing 10 complete guides below.

Curriculum Overview820 words

Curriculum Overview: Advanced Observability Services

Advanced Observability Services

Read full article

Curriculum Overview: Advanced Observability Services

[!NOTE] Course Alignment: This curriculum overview aligns closely with Domain 1 of the AWS Certified CloudOps Engineer - Associate (SOA-C03) exam: Monitoring, Logging, Analysis, Remediation, and Performance Optimization.

Welcome to the Advanced Observability Services curriculum. As cloud environments transition toward modern, containerized, and microservice-driven architectures, traditional monitoring is no longer sufficient. This curriculum bridges the gap between basic resource checks and full-stack, automated observability.


Prerequisites

Before diving into Advanced Observability Services, learners must establish a solid baseline in cloud operations and AWS fundamentals. You should be comfortable with the following:

  • AWS Management & Core Services: Proficiency in navigating the AWS Management Console and executing commands via the AWS CLI. Familiarity with EC2, VPC, and IAM basics.
  • Basic CloudWatch: Prior experience setting up simple CloudWatch alarms (e.g., CPU Utilization) and viewing basic metrics.
  • Container Fundamentals: A conceptual understanding of Docker containers, Amazon Elastic Container Service (ECS), and Amazon Elastic Kubernetes Service (EKS).
  • JSON & Query Syntax: Basic ability to read JSON responses and familiarity with querying structures (like JMESPath).

Module Breakdown

This curriculum is structured to take you from foundational centralized logging up to highly automated, multi-account observability platforms.

ModuleTitleFocus AreaDifficulty
1Centralized Logging & AnalysisCloudTrail, CloudWatch Logs Insights, log aggregationBeginner
2Advanced CloudWatch MetricsCustom metrics, anomaly detection, cross-account dashboardsIntermediate
3Container & OS-Level ObservabilityCloudWatch Agent, EC2, ECS, EKS metricsIntermediate
4Open-Source Monitoring IntegrationsAmazon Managed Service for Prometheus & GrafanaAdvanced
5Event-Driven RemediationEventBridge, Lambda, SSM Automation RunbooksExpert

Observability Flow

Loading Diagram...

Learning Objectives per Module

By progressing through the curriculum, learners will achieve specific, testable outcomes critical to the role of a CloudOps Engineer.

Module 1: Centralized Logging & Analysis

  • Audit effectively: Configure AWS CloudTrail for comprehensive account auditing and data event tracking.
  • Query at scale: Write purpose-built syntax queries using CloudWatch Logs Insights to perform complex searches across application and system logs.

Module 2: Advanced CloudWatch Metrics

  • Implement intelligent alerting: Set up CloudWatch alarms featuring static and dynamic thresholds (anomaly detection).
  • Centralize visibility: Design and deploy customizable, shareable CloudWatch Dashboards that aggregate data across multiple AWS Regions and accounts.

Module 3: Container & OS-Level Observability

  • Deepen system monitoring: Configure and manage the CloudWatch agent to collect deep system-level metrics and internal logs from EC2 instances.
  • Observe modern workloads: Integrate monitoring agents within Amazon ECS and Amazon EKS clusters to track task and pod health.

Module 4: Open-Source Monitoring Integrations

  • Adopt open standards: Explain the architecture and benefits of Amazon Managed Service for Prometheus.
  • Visualize beautifully: Identify use cases and configure Amazon Managed Grafana to create rich, interactive visual dashboards compatible with open-source tools.

Module 5: Event-Driven Remediation

  • Automate responses: Configure Amazon EventBridge rules to trigger remediation actions automatically upon state changes.
  • Deploy runbooks: Execute predefined and custom Systems Manager (SSM) Automation runbooks to self-heal infrastructure without human intervention.

Success Metrics

How will you know you have mastered the Advanced Observability Services curriculum? Your success will be measured by your ability to:

  1. Deploy the CloudWatch Agent Programmatically: Successfully use SSM or User Data to install and configure the CloudWatch agent across a fleet of simulated EC2 and EKS nodes.
  2. Resolve an Incident Using Insights: Given a simulated application failure, identify the root cause within 5 minutes using CloudWatch Logs Insights and VPC Flow Logs.
  3. Create a Multi-Account Grafana Dashboard: Successfully link metrics from at least two different AWS accounts into a single Managed Grafana visualization.
  4. Achieve Zero-Touch Remediation: Build an EventBridge rule that detects a stopped EC2 instance or a full EBS volume, automatically triggering an SSM runbook to remediate the issue.

Real-World Application

In modern enterprise environments, downtime translates directly into lost revenue and damaged reputation. Traditional monitoring focuses on "what is broken?" (e.g., a server is down). Advanced observability answers "why is it broken, and how can we prevent it?"

Scenario: Imagine working for a global e-commerce platform during a flash sale. An unexpected spike in traffic causes memory exhaustion on several backend containers.

  • Without advanced observability: Customers experience timeout errors. The operations team spends 45 minutes manually SSHing into servers to read logs and restart services.
  • With advanced observability: The CloudWatch agent detects memory anomalies instantly. Metrics are pushed to a unified Grafana dashboard. EventBridge detects the CloudWatch alarm and triggers an AWS Lambda function that automatically scales up the Amazon ECS cluster and cycles the unhealthy containers—resolving the issue before end-users even notice.

Real-World Observability Architecture

Loading Diagram...

[!TIP] Career Impact: Mastering these tools shifts your role from reactive administrator (fixing broken things) to proactive engineer (designing self-healing systems), a highly sought-after skill in DevOps and Site Reliability Engineering (SRE) roles.

Curriculum Overview811 words

Amazon CloudWatch Metrics and Alarms: Curriculum Overview

Amazon CloudWatch Metrics and Alarms

Read full article

Amazon CloudWatch Metrics and Alarms: Curriculum Overview

[!NOTE] This curriculum aligns with the AWS Certified SysOps Administrator - Associate (SOA-C03) exam domain: Monitoring, Logging, Analysis, Remediation, and Performance Optimization.

Prerequisites

Before embarking on this curriculum, learners must possess a foundational understanding of the AWS ecosystem to ensure they can fully grasp advanced monitoring concepts.

  • Compute Services Fluency: Basic understanding of Amazon EC2, AWS Lambda, Amazon ECS, and Amazon EKS.
  • Operational Foundations: Proficiency using the AWS Management Console and the AWS Command Line Interface (CLI).
  • IAM Principles: Knowledge of Identity and Access Management (IAM) roles and policies, specifically the principle of least privilege required for resource monitoring.
  • Networking Basics: Understanding of VPCs, subnets, and security groups to comprehend network-level metrics.

Module Breakdown

This curriculum is structured to take you from foundational monitoring concepts to advanced, automated remediation strategies.

Loading Diagram...
ModuleCore FocusDifficultyEstimated Time
1. FundamentalsMetrics, Namespaces, DashboardsBeginner2 Hours
2. CW AgentEC2/Container Logs & Custom MetricsIntermediate3 Hours
3. Alarms & SNSStatic/Dynamic Thresholds, Composite AlarmsIntermediate3 Hours
4. RemediationEventBridge, SSM Automation RunbooksAdvanced4 Hours

Learning Objectives per Module

Module 1: CloudWatch Fundamentals

  • Analyze Standard Metrics: Interpret default metrics reported by AWS services at 1-minute and 5-minute intervals (e.g., Lambda invocations, execution time, errors, and throttling).
  • Implement Custom Metrics: Define and publish custom business or application-level metrics to specific CloudWatch Namespaces.
  • Design Dashboards: Create customizable, cross-region, and cross-account CloudWatch dashboards to visualize health across the entire AWS infrastructure.

Module 2: Advanced Collection & The CW Agent

  • Deploy the CloudWatch Agent: Configure and manage the CW Agent on EC2 instances to collect granular OS-level metrics (e.g., memory utilization, disk space) and application logs.
  • Monitor Containers: Implement monitoring for Amazon Elastic Container Service (ECS) and Elastic Kubernetes Service (EKS) clusters.
  • Log Analytics: Utilize CloudWatch Logs Insights to query log streams (e.g., filtering Lambda log streams for RequestID, billed duration, and memory size).

Module 3: Alarms, Thresholds, & Notifications

  • Configure CloudWatch Alarms: Set up static and anomaly-detection (dynamic) thresholds to monitor resource health.
  • Build Composite Alarms: Combine multiple alarms to reduce alarm fatigue and trigger actions only when specific multi-condition criteria are met.
  • Implement Notifications: Configure alarms to push alerts to Amazon Simple Notification Service (SNS) topics for email, SMS, or third-party ticketing integration.

[!TIP] Remember the key Lambda metrics that typically drive alarms: Errors (logic/runtime failures), Execution Time (slowest 1-5% of responses), and Throttling (concurrency limits reached).

Module 4: Automated Remediation & Operations

  • Event-Driven Architectures: Use Amazon EventBridge to route state changes and enrich events.
  • Automate Remediation: Trigger custom or predefined AWS Systems Manager (SSM) Automation runbooks to self-heal infrastructure.
  • Auto Scaling Integration: Trigger EC2 Auto Scaling policies or RDS Aurora Add Replica policies based on sustained alarm states.

Core Formula: Calculating Metric Impact

Understanding the mathematical relationship of metrics is crucial for setting effective alarms. For example, to calculate the application error rate for AWS Lambda:

Error Rate (%)=(Total ErrorsTotal Invocations)×100\text{Error Rate (\%)} = \left( \frac{\text{Total Errors}}{\text{Total Invocations}} \right) \times 100

Success Metrics

How do you know you have mastered this curriculum? A successful candidate will be able to demonstrate the following hands-on capabilities:

  1. Independent Remediation: Successfully configure an alarm that detects high CPU utilization on an EC2 instance, triggers EventBridge, and executes an SSM runbook to automatically restart the instance.
  2. Visibility Architecture: Build a unified CloudWatch Dashboard that displays custom metrics, Lambda error rates, and EC2 memory utilization in a single pane of glass.
  3. Troubleshooting Prowess: Given a simulated Lambda throttling event, successfully query CloudWatch Logs Insights to isolate the affected RequestIDs and identify the capacity constraint.
  4. Cost-Aware Monitoring: Ensure custom metrics and extensive log ingestion are optimized to prevent unnecessary AWS spend.

Real-World Application

In modern Cloud Operations (CloudOps), monitoring is not just about watching graphs; it is about building self-healing systems.

Imagine a scenario where an e-commerce platform goes viral. Suddenly, your AWS Lambda functions experience a 500% spike in traffic. Without proper monitoring, your functions will throttle silently, leading to a degraded customer experience and lost revenue.

By applying the concepts in this curriculum, you establish a resilient architecture:

Loading Diagram...

Mastering CloudWatch Metrics and Alarms empowers you to transition from a reactive administrator (putting out fires) to a proactive CloudOps Engineer (preventing the fires from starting). This is a critical skill set for maintaining the Operational Excellence and Reliability pillars of the AWS Well-Architected Framework.

Curriculum Overview810 words

Curriculum Overview: Amazon EBS Performance, Troubleshooting, and Cost Optimization

Analyze Amazon Elastic Block Store (Amazon EBS) performance metrics, troubleshoot issues, and optimize volume types to improve performance and reduce cost

Read full article

Curriculum Overview: Amazon EBS Performance, Troubleshooting, and Cost Optimization

[!NOTE] Target Audience: SysOps Administrators, Cloud Operations Engineers, and candidates preparing for the AWS Certified CloudOps Engineer - Associate (SOA-C03) exam. Focus Area: Task 1.3.2 - Analyze Amazon Elastic Block Store (Amazon EBS) performance metrics, troubleshoot issues, and optimize volume types to improve performance and reduce cost.

Amazon Elastic Block Store (EBS) provides block-level storage volumes for use with EC2 instances. In real-world cloud operations, ensuring these volumes are highly performant and cost-effective is a critical day-to-day responsibility. This curriculum outlines the structured learning path to mastering EBS performance monitoring, troubleshooting bottlenecks, and implementing right-sizing strategies.


Prerequisites

Before diving into this curriculum, learners should have a solid foundation in the following areas:

  • Cloud Computing Fundamentals: Understanding of virtualization, guest operating systems, and basic cloud economics.
  • AWS EC2 Basics: Experience deploying, stopping, starting, and connecting to Amazon EC2 instances.
  • Storage Concepts: Differentiating between block storage (EBS), object storage (S3), and file storage (EFS).
  • AWS CloudWatch Basics: Familiarity with viewing metrics, creating simple alarms, and navigating the CloudWatch console.
  • CLI / IAM Setup: Access to an AWS account with necessary IAM permissions to create, modify, and monitor EC2 instances and EBS volumes.

Module Breakdown

This curriculum is divided into four progressive modules, moving from foundational architecture to advanced troubleshooting and cost optimization.

ModuleTitleDifficultyEst. Time
Module 1EBS Architecture & Volume ProfilesBeginner1.5 Hours
Module 2Monitoring EBS with CloudWatchIntermediate2.0 Hours
Module 3Troubleshooting Performance BottlenecksAdvanced2.5 Hours
Module 4Cost & Performance Optimization StrategiesIntermediate2.0 Hours
Click to expand: Module 1 Deep-Dive

Focuses on the fundamentals of block storage, distinguishing between SSD-backed (gp2, gp3, io1, io2) and HDD-backed (st1, sc1) volumes. Learners will explore baseline performance metrics, Input/Output Operations Per Second (IOPS), and throughput ceilings.

Click to expand: Module 2 Deep-Dive

Centers on Amazon CloudWatch metrics specifically for EBS. Key topics include understanding VolumeQueueLength, BurstBalance, VolumeReadBytes, and VolumeWriteOps.


Learning Objectives per Module

Module 1: EBS Architecture & Volume Profiles

  • Categorize the eight different Amazon EBS volume types based on their underlying hardware (SSD vs. HDD) and ideal use cases.
  • Explain the concept of baseline IOPS versus burstable IOPS.
  • Calculate expected performance using standard AWS formulas (e.g., IOPS=ThroughputI/O_SizeIOPS = \frac{Throughput}{I/O\_Size}).

Module 2: Monitoring EBS with CloudWatch

  • Interpret core CloudWatch metrics (VolumeReadOps, VolumeWriteOps, VolumeReadBytes, VolumeWriteBytes).
  • Analyze the VolumeQueueLength metric to distinguish between normal operations and potential latency issues.
  • Configure baseline performance alerts using CloudWatch Alarms to proactively catch BurstBalance depletion.

Module 3: Troubleshooting Performance Bottlenecks

  • Identify when an EC2 instance's network bandwidth is throttling EBS performance.
  • Enable and validate EBS-Optimization on compatible EC2 instance types.
  • Diagnose initialization latency issues and mitigate them using Fast Snapshot Restore (FSR) or manual block access techniques.

Module 4: Cost & Performance Optimization Strategies

  • Execute online volume modifications (Elastic Volumes) to upgrade or downgrade volume types without downtime.
  • Right-size provisioned IOPS based on historical CloudWatch data to prevent over-provisioning.
  • Design lifecycle policies using tags and AWS Data Lifecycle Manager to clean up orphaned volumes and snapshots.

Success Metrics

How will you know you have mastered this curriculum? Upon completion, learners should be able to consistently demonstrate the following:

  1. Metric Interpretation: Given a CloudWatch graph showing depleted BurstBalance and high VolumeQueueLength, correctly diagnose the bottleneck and propose the optimal volume upgrade.
  2. Cost Reduction: Successfully identify over-provisioned io1/io2 volumes and transition them to gp3 while maintaining required IOPS, calculating the monthly cost savings.
  3. Architectural Alignment: Match specific application workloads (e.g., transactional databases vs. big data log processing) to the correct EBS volume type with 100% accuracy.

EBS Optimization Lifecycle

Loading Diagram...

Real-World Application

Why does mastering EBS matter in a SysOps or Cloud Engineering career?

Scenario: The "Slow" Production Database

Imagine you are an on-call SysOps Administrator. Users report that the flagship e-commerce application is timing out. You check the EC2 dashboard and see CPU and memory are within normal limits.

By applying the skills from this curriculum, you dive into CloudWatch and observe that the VolumeQueueLength for the database's EBS volume is skyrocketing, and the BurstBalance has hit 0%.

Because you understand EBS performance, you realize the current gp2 volume's burst bucket is depleted. You quickly use the Elastic Volumes feature to modify the volume to gp3, explicitly provisioning higher baseline IOPS. The application recovers seamlessly with no downtime.

Diagnostic Decision Tree

Loading Diagram...

[!IMPORTANT] Cost Implication: Blindly throwing higher-tier volumes (like io2) at a performance problem is an easy but expensive fix. A skilled CloudOps engineer uses metrics like VolumeReadBytes and VolumeWriteBytes to determine the actual required I/O size, ensuring the company only pays for the performance it genuinely needs.


To supplement this curriculum, learners are encouraged to reference:

Curriculum Overview878 words

Curriculum Overview: Amazon EBS Performance, Troubleshooting, and Optimization

Analyze Amazon Elastic Block Store (Amazon EBS) performance metrics, troubleshoot issues, and optimize volume types to improve performance and reduce cost

Read full article

Curriculum Overview: Amazon EBS Performance, Troubleshooting, and Optimization

Welcome to the comprehensive curriculum for analyzing, troubleshooting, and optimizing Amazon Elastic Block Store (Amazon EBS). This course track aligns with the AWS Certified SysOps Administrator - Associate (SOA-C03) exam objectives (Task 1.3.2) and focuses on ensuring block storage architectures are performant, reliable, and cost-effective.


Prerequisites

Before diving into EBS performance tuning and troubleshooting, learners must have a foundational understanding of the following concepts:

  • Cloud Computing Basics: Familiarity with the AWS Well-Architected Framework, specifically the Performance Efficiency and Cost Optimization pillars.
  • Amazon EC2 Fundamentals: Understanding of the EC2 instance lifecycle, how instances attach to storage, and basic network traffic concepts.
  • Storage Paradigms: Knowledge of raw, unformatted block storage versus file and object storage, and why block storage is preferred for databases and boot volumes.
  • AWS Management Tools: Basic proficiency navigating the AWS Management Console and utilizing the AWS CLI for querying resources.

Module Breakdown

This curriculum is structured into four progressive modules, transitioning from foundational block storage concepts to advanced troubleshooting and optimization techniques.

ModuleTitleDifficultyCore Focus
Module 1EBS Architecture & Volume TypesBeginnerStorage classes, IOPS vs. Throughput, Pricing models
Module 2Monitoring EBS with CloudWatchIntermediateKey metrics (BurstBalance, VolumeQueueLength)
Module 3Troubleshooting Performance IssuesAdvancedIdentifying bottlenecks, network contention, and snapshot latency
Module 4Cost & Performance OptimizationAdvancedRightsizing, EBS-Optimized instances, Fast Snapshot Restore

[!NOTE] The modules are designed to be taken sequentially, as the optimization techniques in Module 4 heavily rely on the metric analysis skills developed in Module 2.


Learning Objectives per Module

Module 1: EBS Architecture & Volume Types

  • Differentiate between the eight different Amazon EBS volume types (e.g., gp2, gp3, io1, io2, st1, sc1).
  • Identify workload characteristics to determine if an application is transaction-intensive (requires high IOPS) or throughput-intensive (requires high MB/s).
  • Evaluate the pricing models associated with storage size versus provisioned performance.

Module 2: Monitoring EBS with CloudWatch

  • Define and track critical EBS CloudWatch metrics, including VolumeReadBytes, VolumeWriteBytes, VolumeReadOps, and VolumeWriteOps.
  • Analyze VolumeQueueLength to determine the number of pending I/O requests and assess host-to-EBS network link health.
  • Monitor BurstBalance for gp2, st1, and sc1 volumes to predict and alert on performance throttling.
Click to expand: Deeper Dive into Burst Balance

Certain volume types operate on a burst bucket model. They accrue I/O credits when idle and consume them during heavy traffic. If the BurstBalance metric reaches 0%, the volume is throttled to its baseline performance level, causing significant application latency.

Module 3: Troubleshooting Performance Issues

  • Diagnose I/O bottlenecks by correlating VolumeQueueLength with operating system-level metrics.
  • Identify the "latency penalty" associated with initializing volumes from EBS Snapshots.
  • Distinguish between EBS volume limits and EC2 instance-level bandwidth limits.

Module 4: Cost & Performance Optimization

  • Enable and configure EBS-optimization on supported Amazon EC2 instances to separate storage traffic from standard network traffic.
  • Implement Fast Snapshot Restore (FSR) to bypass initialization latency for critical recovery operations.
  • Rightsize volume I/O and capacity based on historical CloudWatch data to eliminate over-provisioning.

Visual Anchors

Workload to Volume Type Decision Matrix

Understanding how to map workload requirements to the correct volume type is a critical SysOps skill. Use this decision tree to optimize both performance and cost.

Loading Diagram...

Burst Balance Depletion Over Time

This diagram illustrates how an intensive workload depletes the burst credit balance of a gp2 volume over time, eventually leading to performance throttling.

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Success Metrics

How do you know you have mastered this curriculum? You will be able to successfully:

  1. Metric Interpretation: Look at a CloudWatch dashboard showing high VolumeQueueLength and low BurstBalance and immediately diagnose an under-provisioned gp2 volume.
  2. Cost Reduction: Audit an AWS account using Cost Explorer and identify oversized provisioned IOPS (io1/io2) volumes that can be safely downgraded to gp3 based on historical usage metrics.
  3. Architectural Optimization: Successfully provision an EC2 instance with EBS-optimization enabled, ensuring that standard network traffic does not contend with storage I/O.
  4. Disaster Recovery SLA Compliance: Implement Fast Snapshot Restore to ensure an initialized volume is ready for production immediately, meeting aggressive RTO (Recovery Time Objective) targets.

Real-World Application

Why does this matter in the field?

Imagine you are the SysOps Administrator for a high-traffic e-commerce platform during a flash sale. Your backend relational database is running on an EC2 instance backed by a standard gp2 EBS volume. As thousands of users simultaneously add items to their carts, the database performs heavy, random read/write operations.

Without an understanding of EBS performance:

  • The gp2 burst bucket entirely depletes.
  • The BurstBalance drops to zero, and the volume throttles to its baseline IOPS.
  • The VolumeQueueLength spikes as I/O requests back up.
  • Users experience extreme latency, shopping carts fail to load, and the company loses significant revenue.

By applying the skills in this curriculum, you would proactively monitor these metrics via CloudWatch alarms. You would recognize the bottleneck and seamlessly modify the volume type to gp3 or io2 (Provisioned IOPS), adjust the EC2 instance type to one that supports a higher EBS-optimized throughput, and ensure your system handles the flash sale smoothly.

Study Guide985 words

Mastering EBS and S3 Performance Metrics: AWS CloudOps Study Guide

Analyze EBS and S3 performance metrics

Read full article

Mastering EBS and S3 Performance Metrics

This guide covers the critical metrics and optimization strategies for Amazon Elastic Block Store (EBS) and Amazon Simple Storage Service (S3), specifically aligned with the AWS Certified CloudOps Engineer - Associate (SOA-C03) exam objectives.

Learning Objectives

After studying this chapter, you should be able to:

  • Analyze critical EBS performance metrics like VolumeQueueLength and BurstBalance to identify bottlenecks.
  • Remediate performance issues by optimizing volume types and enabling features like Fast Snapshot Restore.
  • Optimize S3 performance using Multi-part uploads and S3 Transfer Acceleration.
  • Automate remediation strategies using CloudWatch alarms and SSM Automation runbooks.

Key Terms & Glossary

  • IOPS (Input/Output Operations Per Second): A measure of the number of read and write operations performed per second. Essential for transaction-heavy workloads like databases.
  • Throughput: The amount of data transferred to or from a volume per second, usually measured in MB/s. Essential for streaming or large data processing.
  • Burst Balance: A metric for gp2, st1, and sc1 volumes representing the amount of "burst" credits remaining to exceed baseline performance.
  • Queue Length: The number of pending I/O requests for a device. High queue length often indicates a bottleneck.
  • S3 Transfer Acceleration: A bucket-level feature that enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket using Amazon CloudFront’s globally distributed Edge Locations.

The "Big Idea"

In a cloud environment, storage performance is not just about choosing the right "disk." It is a dynamic balance between latency, throughput, and cost. Effective CloudOps involves shifting from reactive troubleshooting to proactive monitoring. By mastering CloudWatch metrics, you can identify when an application is exceeding its IOPS allotment and automatically scale or switch volume types before the user experience degrades.

Formula / Concept Box

ConceptMetric / FormulaKey Interpretation
Throughput FormulaThroughput=IOPS×I/O SizeThroughput = IOPS \times I/O\ SizeLarger I/O sizes require more throughput for the same number of IOPS.
EBS HealthVolumeQueueLengthLow for transaction-intensive; High for throughput-intensive (HDD).
Burst HealthBurstBalanceIf it reaches 0%, the volume is throttled to its baseline performance.
S3 EfficiencyMulti-part UploadRecommended for objects > 100 MB; Required for objects > 5 GB.

Hierarchical Outline

  • Amazon EBS Performance Analysis
    • Critical CloudWatch Metrics
      • VolumeReadOps / VolumeWriteOps: Used to calculate total IOPS.
      • VolumeQueueLength: Identifying bottlenecks in the OS or network link.
      • BurstBalance: Monitoring credit depletion for burstable volumes.
    • Performance Optimization
      • EBS-Optimized Instances: Ensuring dedicated bandwidth for storage traffic.
      • Fast Snapshot Restore (FSR): Eliminating the latency penalty of first-touch reads on new volumes.
      • Volume Type Switching: Moving from gp2 to gp3 or io2 for predictable performance.
  • Amazon S3 Performance Optimization
    • Transfer Optimization
      • Multi-part Upload: Parallelizing uploads for higher throughput and reliability.
      • S3 Transfer Acceleration: Using Edge Locations to reduce latency over long distances.
    • Storage Management
      • S3 Lifecycle Policies: Automating transitions to lower-cost tiers based on access patterns.
      • DataSync: Simplifying large-scale data transfers into S3.

Visual Anchors

EBS Performance Troubleshooting Flow

Loading Diagram...

Data Transfer Comparison

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Metric: VolumeQueueLength
    • Definition: The number of I/O requests waiting to be processed by the storage device.
    • Real-World Example: In a busy grocery store, the "Queue Length" is the number of people waiting in line. If the cashier (EBS volume) is too slow, the line grows. For a database, a long line means the application has to wait to save data, causing lag.
  • Feature: S3 Lifecycle Policies
    • Definition: A set of rules that define actions that Amazon S3 applies to a group of objects (e.g., transition to Glacier or expiration).
    • Real-World Example: An office that keeps physical files in a desk for 30 days (S3 Standard), moves them to a filing cabinet for 90 days (S3 Standard-IA), and eventually sends them to an off-site warehouse for 7 years (Glacier) before shredding them (Expiration).

Worked Examples

Example 1: Troubleshooting Throttled EBS

Scenario: A developer reports that a database on an Amazon EC2 instance is experiencing high latency every afternoon.

  1. Metric Analysis: You check CloudWatch and see BurstBalance for the gp2 volume dropping to 0% at 2:00 PM and staying there until 4:00 PM.
  2. Diagnosis: The workload is exceeding the baseline IOPS provided by the current volume size, depleting the burst bucket.
  3. Remediation:
    • Short term: Increase the size of the gp2 volume (which increases baseline IOPS).
    • Long term: Migrate to a gp3 volume to provision higher IOPS independently of storage size, ensuring more cost-effective performance.

Example 2: Optimizing Large File Uploads to S3

Scenario: You need to upload a 50 GB database backup file to an S3 bucket from an on-premises server in London to a bucket in Tokyo.

  1. Action 1: Enable S3 Transfer Acceleration on the bucket to utilize the AWS global network.
  2. Action 2: Use the AWS CLI or SDK to perform a Multi-part Upload.
  3. Benefit: If a network interruption occurs, only the failed part (e.g., 100 MB) needs to be re-uploaded instead of the entire 50 GB file.

Checkpoint Questions

  1. Which CloudWatch metric is the most direct indicator that an EBS volume is acting as a bottleneck due to pending I/O requests?
  2. For a throughput-intensive application using HDD volumes (st1), is a high VolumeQueueLength always considered a failure state? Why or why not?
  3. What is the minimum object size for which AWS recommends using Multi-part uploads for S3?
  4. How does enabling "Fast Snapshot Restore" affect the performance of a newly created EBS volume?
  5. Which AWS service can be used to automate the modification of an EBS volume type when a CloudWatch alarm is triggered?
Click to see Answers
  1. VolumeQueueLength.
  2. No. HDD volumes are less sensitive to latency and can actually benefit from higher queue lengths for large, sequential I/O.
  3. 100 MB (though it is mandatory for files 5 GB or larger).
  4. It eliminates the latency penalty (initialization/pre-warming) by ensuring the volume is fully initialized at creation.
  5. AWS Systems Manager (SSM) Automation combined with Amazon EventBridge.
Curriculum Overview703 words

Curriculum Overview: Analyzing Events with the AWS Personal Health Dashboard

Analyze events using the AWS Personal Health Dashboard

Read full article

This curriculum overview details the learning path for mastering the AWS Personal Health Dashboard and the AWS Health API. By the end of this curriculum, learners will be able to monitor, analyze, and automate responses to service-level interruptions and planned changes within an AWS environment.

Prerequisites

Before beginning this module, learners should have a foundational understanding of the following AWS concepts and services:

  • AWS Management Console & CLI: Basic navigation and programmatic access.
  • Core AWS Services: General familiarity with Amazon EC2, Amazon S3, and AWS Lambda.
  • Amazon EventBridge (formerly CloudWatch Events): Understanding of event-driven architectures and rule routing.
  • AWS Organizations (Optional but recommended): Knowledge of multi-account management strategies.
  • Foundational Cloud Monitoring: Familiarity with the concepts of uptime, availability, and incident response.

[!NOTE] A mathematical understanding of availability goals is helpful. AWS pursues a 99.9% uptime for most services. You can calculate availability using the following block equation:

Availability (%)=(Total TimeDowntimeTotal Time)×100\text{Availability (\%)} = \left( \frac{\text{Total Time} - \text{Downtime}}{\text{Total Time}} \right) \times 100

Module Breakdown

The curriculum is structured into four progressive modules, transitioning from fundamental visibility to advanced automated remediation.

ModuleTitleDifficultyCore Focus
Module 1AWS Health FundamentalsBeginnerPublic vs. Personal Health Dashboards, UI navigation, and time zone configuration.
Module 2Multi-Account VisibilityIntermediateAWS Organizations integration, centralized event aggregation.
Module 3Automated Event RemediationAdvancedAmazon EventBridge integration, AWS Lambda triggers, and SNS notifications.
Module 4Enterprise Integrations & AHAExpertAWS Health API, AWS Health Aware (AHA) framework, Slack/Teams routing.

Architectural Context

The following diagram illustrates how AWS Health events flow from the core infrastructure to the end-user or automated remediation systems.

Loading Diagram...

Learning Objectives per Module

Module 1: AWS Health Fundamentals

  • Differentiate between the public AWS Health Dashboard (all global service events) and the AWS Personal Health Dashboard (personalized to your active resources).
  • Configure the Personal Health Dashboard settings, including local time zone preferences and console notifications.
  • Analyze alerts for event information, affected resources, and AWS-recommended troubleshooting guidance.

Module 2: Multi-Account Visibility

  • Enable AWS Health organizational view using AWS Organizations.
  • Aggregate health events from multiple member accounts into a single, centralized management account dashboard.
  • Identify cross-account impact during a localized AWS service degradation.

Module 3: Automated Event Remediation

  • Define Amazon EventBridge rules specifically targeting AWS Health events (e.g., aws.health).
  • Deploy AWS Lambda functions to execute automated runbooks (e.g., auto-restarting a degraded EC2 instance).
  • Route specific event categories (scheduled changes vs. critical notifications) to targeted operational teams.

Module 4: Enterprise Integrations & AHA

  • Evaluate support plan requirements (Business or Enterprise Support) to unlock direct AWS Health API access.
  • Deploy the AWS Health Aware (AHA) solution using AWS CloudFormation.
  • Integrate AHA with external operational channels such as Slack, Microsoft Teams, Splunk, or DataDog.

Success Metrics

How will you know you have mastered this curriculum? You should be able to consistently demonstrate the following capabilities:

  1. Dashboard Configuration: Successfully locate the Personal Health Dashboard and filter events by region, service, and status.
  2. Event Routing: Create a working EventBridge rule that captures an AWS Health scheduled maintenance event and successfully routes it to an SNS topic.
  3. Automation Readiness: Write a basic Lambda script that parses the JSON payload of an AWS Health event and outputs the affected Resource IDs to CloudWatch Logs.
  4. AHA Deployment (Stretch): Successfully launch the AWS Health Aware CloudFormation stack and receive a test notification in a third-party chat application.

[!IMPORTANT] The AWS Health API is only available to customers on an AWS Business Support or AWS Enterprise Support plan. Ensure your lab environment has the appropriate tier, or rely on EventBridge routing for standard accounts.

Real-World Application

In a real-world CloudOps or SysOps role, relying solely on reactive customer complaints to discover infrastructure degradation is a critical failure.

The Scenario: AWS detects degraded underlying hardware hosting one of your mission-critical Amazon EC2 instances.

The Application: Instead of waiting for the instance to fail completely, AWS publishes a scheduled maintenance event to your AWS Personal Health Dashboard. Because you implemented the lessons from this curriculum:

  1. The event is immediately caught by an Amazon EventBridge rule.
  2. The rule triggers an AWS Systems Manager (SSM) Automation runbook.
  3. The runbook automatically safely stops and restarts the instance onto healthy underlying hardware during an off-peak maintenance window.
  4. The AWS Health Aware (AHA) solution posts a summary of the incident and the automated resolution directly into your CloudOps Slack channel.

AWS Health Aware (AHA) Routing Flow

Loading Diagram...

By leveraging these tools, you transform unpredictable cloud infrastructure events into fully automated, actionable, and trackable workflows.

Study Guide820 words

Analyzing Security Findings: Amazon Inspector and AWS Security Hub

Analyze findings from Security Hub and Inspector

Read full article

Analyzing Security Findings: Amazon Inspector and AWS Security Hub

This guide focuses on the centralized management and analysis of security alerts within an AWS environment, specifically through the lens of Amazon Inspector and AWS Security Hub.

Learning Objectives

After studying this guide, you should be able to:

  • Analyze and group findings within the Amazon Inspector console.
  • Configure suppression rules to filter out noise in vulnerability reports.
  • Explain the benefits of centralizing security findings in AWS Security Hub.
  • Identify the prerequisites for exporting Inspector findings to Amazon S3.
  • Implement automated remediation workflows using EventBridge and Security Hub.

Key Terms & Glossary

  • Finding: A detailed report of a potential security issue or vulnerability identified by an AWS security service.
  • CVE (Common Vulnerabilities and Exposures): A list of publicly disclosed computer security flaws. Inspector maps findings to these IDs.
  • Suppression Rule: A set of criteria in Inspector used to automatically hide findings that are known risks or non-issues.
  • Insight: A collection of related findings in Security Hub that helps identify specific areas of risk (e.g., "S3 buckets with public read access").
  • Security Standard: A set of controls (e.g., CIS AWS Foundations) that Security Hub uses to measure your compliance.

The "Big Idea"

In a modern cloud environment, "security fatigue" is a real threat—admins are often overwhelmed by thousands of disconnected alerts. The Big Idea here is Centralized Visibility. Instead of checking EC2, ECR, and IAM separately, AWS uses Security Hub as a "single pane of glass" to aggregate findings from Inspector (vulnerabilities), GuardDuty (threats), and Macie (data privacy), allowing for prioritized, automated remediation.

Formula / Concept Box

FeatureAmazon InspectorAWS Security Hub
Primary GoalVulnerability & Reachability ScanningCentralized Security Posture Management
Scan TargetsEC2 instances, ECR images, LambdaIntegrated AWS Services & 3rd Party Tools
LogicScans software packages and network pathsAggregates findings and checks against Standards
OutputFindings (CSV/JSON/S3/Security Hub)Insights, Compliance Scores, Actions

Hierarchical Outline

  1. Amazon Inspector: Vulnerability Management
    • Finding Types: Software package vulnerabilities and network reachability.
    • Analysis Techniques: Grouping by account, instance, or finding state.
    • Suppression Rules: Using filters to exclude specific findings from the view.
    • Exporting Data:
      • EventBridge: For real-time notifications/remediation.
      • S3 Buckets: For long-term archival (requires AWS KMS encryption).
  2. AWS Security Hub: The Aggregator
    • Centralization: Collects findings from Inspector, GuardDuty, and Config.
    • Compliance: Automated checks against standards (PCI DSS, CIS, AWS Best Practices).
    • Dashboards: Visualizing security trends and high-priority issues.
    • Automation: Triggering Lambda or SSM via EventBridge custom actions.

Visual Anchors

Finding Lifecycle Flow

Loading Diagram...

Logical Architecture of Analysis

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Suppression Rule
    • Definition: A filter that hides specific findings based on criteria like AMI ID or Severity.
    • Example: Suppressing all "Medium" severity findings on a legacy development server that is scheduled for decommissioning next week.
  • Finding Grouping
    • Definition: Organizing findings by shared attributes to identify patterns.
    • Example: Grouping by "Vulnerability ID" to see if one specific outdated library is present across 50 different EC2 instances.
  • Security Standard
    • Definition: A prepackaged set of security best practices used for automated auditing.
    • Example: Using the CIS AWS Foundations Benchmark to automatically detect if the 'root' user has an active access key.

Worked Examples

Example 1: Exporting Inspector Findings to S3

Scenario: You need to archive all Inspector findings for compliance auditing over the next 7 years.

  1. Create an S3 Bucket: Ensure the bucket exists in the target region.
  2. Configure KMS: Create a symmetric KMS key. Inspector requires a customer-managed key to encrypt findings during the export process.
  3. Set Permissions: Update the S3 bucket policy and KMS key policy to allow the Inspector service principal (inspector2.amazonaws.com) to perform s3:PutObject and kms:GenerateDataKey.
  4. Generate Report: In the Inspector console, select "Reports," choose S3 as the destination, and provide the KMS Key ARN.

Example 2: Auto-Remediation Workflow

Scenario: Automatically stop an EC2 instance if Security Hub reports it has a "Critical" vulnerability.

  1. Finding Source: Inspector detects a critical CVE on instance i-12345 and sends it to Security Hub.
  2. Security Hub Insight: Security Hub flags the finding as critical.
  3. EventBridge Rule: Create a rule with an event pattern matching Source: aws.securityhub and Severity.Label: CRITICAL.
  4. Target: Set the target to an SSM Automation document AWS-StopEC2Instance.

Checkpoint Questions

  1. Which service provides prepackaged security standards like PCI DSS and CIS?
  2. What is mandatory for Amazon Inspector to export findings to an Amazon S3 bucket?
  3. How do you exclude known low-risk vulnerabilities from the Inspector console view without deleting them?
  4. To which AWS service does Inspector automatically export findings for real-time remediation triggers?
Click for Answers
  1. AWS Security Hub.
  2. A Customer Managed KMS Key for encryption.
  3. Create a Suppression Rule.
  4. Amazon EventBridge.
Study Guide1,050 words

SOA-C03 Study Guide: Performance Analysis & Automated Remediation

Analyze performance metrics and automate remediation strategies by using AWS services and functionality (for example, CloudWatch, AWS User Notifications, AWS Lambda, AWS Systems Manager, CloudTrail, auto scaling)

Read full article

Performance Analysis & Automated Remediation

This guide focuses on Content Domain 1 of the AWS Certified SysOps Administrator - Associate (SOA-C03) exam, specifically targeting the ability to analyze metrics and implement self-healing architectures.

Learning Objectives

After studying this guide, you should be able to:

  • Analyze CloudWatch metrics to identify performance bottlenecks and system failures.
  • Configure EventBridge rules to route operational events to remediation targets.
  • Implement AWS Systems Manager (SSM) Automation runbooks for common issues.
  • Automate EC2 instance recovery and scaling based on health and performance triggers.
  • Utilize AWS Health events to proactively respond to service-level interruptions.

Key Terms & Glossary

  • CloudWatch Alarm: A mechanism that watches a single metric over a specified time period and performs one or more actions based on the value of the metric relative to a threshold.
  • EventBridge (formerly CloudWatch Events): A serverless event bus that makes it easy to connect applications using data from your own applications, integrated SaaS applications, and AWS services.
  • SSM Automation Runbook: A document that defines the actions that Systems Manager performs on your managed instances and other AWS resources.
  • Metric Filter: A way to extract metric data from log groups in CloudWatch Logs.
  • Target: The resource or endpoint that EventBridge sends an event to when a rule's pattern is matched (e.g., Lambda, SSM, SNS).

The "Big Idea"

[!IMPORTANT] The core philosophy of modern SysOps is "Detection to Remediation without Intervention."

Instead of a human responder manually fixing a disk space issue or restarting a service, we build a closed-loop system:

  1. Detect (CloudWatch Metrics/Logs)
  2. Evaluate (CloudWatch Alarms)
  3. Act (EventBridge -> Lambda/SSM)
  4. Verify (Status Checks/Metrics return to normal).

Formula / Concept Box

ComponentRoleExample
ProducerGenerates the event/metricAmazon EC2, CloudTrail, AWS Health
EvaluatorDecides if action is neededCloudWatch Alarm (Static or Anomaly Detection)
RouterConnects the signal to the fixAmazon EventBridge Rules
RemediatorExecutes the corrective logicAWS Lambda, SSM Automation, Auto Scaling

Hierarchical Outline

  1. Monitoring & Data Collection
    • Standard Metrics: CPU, Network, Disk I/O (available by default).
    • Custom Metrics: Memory utilization, Disk Swap (requires CloudWatch Agent).
    • Log Processing: Using Metric Filters to turn log patterns into searchable data.
  2. Event-Driven Response
    • EventBridge: Matching patterns (e.g., EC2 State Change) and routing to targets.
    • AWS Health API: Responding to scheduled maintenance or regional service outages.
  3. Remediation Tools
    • SSM Automation: Predefined runbooks for patching, restarting, and resource optimization.
    • AWS Lambda: Custom Python/Node scripts for complex logic (e.g., updating Route 53 during a failover).
    • Auto Scaling: Dynamic, Scheduled, and Predictive scaling based on historical patterns.

Visual Anchors

Automated Remediation Flow

Loading Diagram...

Performance Optimization Cycle

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Anomaly Detection: A CloudWatch feature that applies machine learning to a metric's history to create a baseline of expected behavior.
    • Example: Identifying a sudden drop in application requests that occurs at 2:00 PM on a Tuesday, which usually sees high traffic.
  • Predictive Scaling: An Auto Scaling policy that uses machine learning to predict future traffic and schedule capacity changes in advance.
    • Example: An e-commerce site scaling up EC2 instances on Friday morning in anticipation of a weekend sale based on the last 3 months of data.
  • EC2 Status Check Remediation: Automatically recovering an instance if the underlying hardware fails.
    • Example: Using a CloudWatch Alarm on StatusCheckFailed_System to trigger the Recover action, which moves the instance to new hardware while keeping the same IP and ID.

Worked Examples

Scenario: Remediating Low Disk Space on EC2

Problem: An application server stops responding because the root EBS volume is 100% full.

Step-by-Step Solution:

  1. Metric Collection: Install the CloudWatch Agent on the EC2 instance to collect disk_used_percent (this is not a standard metric).
  2. Alarm Creation: Create a CloudWatch Alarm that triggers when disk_used_percent > 80% for 5 minutes.
  3. EventBridge Rule: Create an EventBridge rule that triggers when the Alarm enters the ALARM state.
  4. Target Selection: Set the target to an SSM Automation Runbook (e.g., AWS-ExpandVolumes or a custom script to clear /tmp files).
  5. Verification: The alarm should return to OK once the cleanup/expansion is complete.

Scenario: Lambda Performance Tuning

Problem: A Lambda function is frequently throttling or timing out.

Step-by-Step Solution:

  1. Analyze Metrics: Check Throttles, Duration, and Errors in CloudWatch.
  2. Optimization: Use AWS Compute Optimizer to analyze the function's memory allocation.
  3. Action: If Compute Optimizer suggests the function is memory-constrained, increase the memory setting (which also proportionally increases CPU power).

Checkpoint Questions

  1. Which metric requires the CloudWatch Agent to be installed on an EC2 instance? (Answer: Memory utilization or Disk space usage).
  2. What is the difference between an EventBridge Rule and a CloudWatch Alarm? (Answer: An Alarm monitors a specific threshold over time; a Rule matches a state change or event pattern instantaneously).
  3. How can you automate the recovery of an EC2 instance that failed a system status check? (Answer: Create a CloudWatch Alarm for the StatusCheckFailed_System metric and add an 'EC2 Action' to 'Recover').
  4. True or False: Predictive scaling is best for workloads that have random, unpredictable traffic spikes. (Answer: False. It requires historical patterns to work effectively).
  5. What service allows you to integrate AWS Health events with Slack or Microsoft Teams? (Answer: Amazon EventBridge or the AWS Health Aware (AHA) solution).
Study Guide890 words

Study Guide: Analyzing Spend Patterns with AWS Cost Explorer

Analyze spend patterns using AWS Cost Explorer

Read full article

Study Guide: Analyzing Spend Patterns with AWS Cost Explorer

This guide covers the fundamental and advanced capabilities of AWS Cost Explorer for the SysOps Administrator - Associate (SOA-C03) exam, focusing on visualization, forecasting, and cost management strategies.

Learning Objectives

By the end of this study guide, you will be able to:

  • Enable and Configure AWS Cost Explorer within the Billing and Cost Management console.
  • Identify Dimensions used for filtering cost data, including Regions, Services, and Cost Allocation Tags.
  • Differentiate between the visualization capabilities of Cost Explorer and the raw data of Cost and Usage Reports (CUR).
  • Forecast Future Spend and evaluate Reserved Instance (RI) utilization/coverage.
  • Manage Permissions and understand the impact of AWS Organizations on historical data visibility.

Key Terms & Glossary

  • AWS Cost Explorer (CE): A tool that enables you to visualize, understand, and manage your AWS costs and usage over time.
  • Cost and Usage Report (CUR): The most granular AWS billing dataset, which can be integrated into Redshift or QuickSight.
  • Cost Allocation Tags: Metadata (key-value pairs) applied to resources used to categorize and track costs at a granular level (e.g., Project: SecretAlpha).
  • Forecast: A prediction of future costs based on historical usage patterns for the next 12 months.
  • Paginated API Request: A programmatic call to retrieve data from Cost Explorer, which incurs a specific per-request fee.

The "Big Idea"

While raw billing data (CUR) is useful for deep data science and SQL-based auditing, AWS Cost Explorer is the primary engine for human-driven visualization. It transforms complex billing rows into actionable insights, allowing administrators to answer "Why did our EC2 spend spike last Tuesday?" or "Will we stay under budget for the next quarter?"

Formula / Concept Box

FeatureLimit / MetricCost / Note
Custom ReportsUp to 50 reportsIncluded in free UI access
Historical Data12 monthsTakes a few days to populate initially
Forecasting Range12 months aheadBased on previous usage trends
Data RefreshEvery 24 hoursCurrent month data has ~24h latency
API Access$0.01 per requestCharge applies to paginated API calls
IAM AccessNo default accessMust be explicitly granted via policy

Hierarchical Outline

  • I. Enabling Cost Explorer
    • Enabled via Billing and Cost Management console.
    • Org Impact: Management accounts can block member account access.
    • Historical Access: Joining an Org hides pre-org data; leaving hides membership-era data.
  • II. Data Visualization & Analysis
    • Preconfigured Views: Top five cost-accruing services, daily/monthly costs.
    • Filtering Dimensions: Account, Service, Region, Instance Type, Tag.
    • RI/Savings Plans: Monitoring utilization (how much you use) and coverage (how much is covered by the plan).
  • III. Reporting Capabilities
    • Custom Reports: Create up to 50 tailored views (e.g., CFO-specific reports).
    • Exporting: Data is the same as CUR but formatted for visual consumption.
  • IV. Security and Governance
    • IAM Policies: Essential for access control (no default access for users).
    • Cost Allocation Tags: Managed via Resource Groups & Tag Editor.

Visual Anchors

Data Flow and Access

Loading Diagram...

Historical Data Retention Logic

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Dimension Filtering: Selecting specific criteria to isolate costs.
    • Example: A SysOps admin filters by Region: us-east-1 and Service: Amazon RDS to investigate a specific database bill increase in North Virginia.
  • RI Utilization Report: A report showing how much of your purchased Reserved Instance discount is actually being used.
    • Example: If you purchased 10 RI instances but only run 7, the report shows 70% utilization, signaling a need to increase usage or avoid future purchases.
  • Forecasted Spend: A prediction based on current trends.
    • Example: Predicting that the "Project Secret" will cost $5,000 next month because usage has increased 10% weekly for the last 3 months.

Worked Examples

Scenario 1: The CFO's Special Report

Request: The CFO needs a report every month showing the utilization of EC2 instances tagged with Department: Finance across three specific member accounts. Step-by-Step Solution:

  1. Enable Tags: Ensure Department is activated as a Cost Allocation Tag in the Billing console.
  2. Filter: Open Cost Explorer and set the filter for Service (EC2), Tag (Department: Finance), and Linked Account (select the 3 accounts).
  3. Save: Click "Save as" to create one of the 50 custom reports.
  4. Visualize: Set the time range to "Monthly" and the chart type to "Bar" to show trends.

Scenario 2: Programmatic Cost Auditing

Problem: A developer writes a script that calls the Cost Explorer API every minute to update a custom dashboard. Result: The account incurs unexpected charges. Explanation: Each paginated API request costs $0.01. 60 requests/hour * 24 hours = $14.40/day. The solution is to cache results or use the free UI for non-automated needs.

Checkpoint Questions

  1. How many months of historical data can be viewed in AWS Cost Explorer?
  2. What is the cost associated with using the Cost Explorer User Interface (UI)?
  3. If an IAM user has AdministratorAccess, do they automatically have access to Cost Explorer?
  4. What happens to a standalone account's historical data when it joins an AWS Organization?
  5. How many custom reports can be created in a single account?
Click to see answers
  1. 12 months (and it can forecast 12 months forward).
  2. Free (Only API access incurs a cost).
  3. No. Access to Cost Explorer must be explicitly granted via IAM policies; there is no default access for users.
  4. The account no longer has access to cost and usage data from the time prior to joining the organization (though it regains it if it leaves).
  5. 50 custom reports.
Curriculum Overview863 words

AWS Well-Architected Principles & CloudOps Engineering Curriculum Overview

Apply Well-Architected principles to support AWS workloads

Read full article

AWS Well-Architected Principles & CloudOps Engineering

[!NOTE] Course Overview: A comprehensive curriculum focused on deploying, managing, and operating scalable, highly available, and fault-tolerant systems on AWS, directly aligned with the AWS Certified CloudOps Engineer - Associate (SOA-C03) exam domains.

Prerequisites

To be successful in this curriculum, learners must possess foundational knowledge in general IT operations and cloud computing principles before beginning.

General IT Experience

  • Operations Role: At least 1 year of experience in a systems administrator or related IT operations role.
  • Networking Basics: Understanding of core networking concepts including DNS, TCP/IP, and firewalls.
  • Scripting & OS: Familiarity with at least one scripting language (e.g., Python, Bash) and major operating systems (Linux/Windows).
  • Modern Workflows: Basic understanding of containerization (Docker), orchestration, and CI/CD pipelines (Git).

AWS Knowledge

  • Core Services: Hands-on familiarity with AWS storage (S3, EBS), compute (EC2), and networking services (VPC).
  • AWS Interfaces: Prior experience navigating the AWS Management Console and executing basic commands via the AWS CLI.

Module Breakdown

This curriculum is designed to progressively build your operational capabilities, culminating in advanced automation and remediation skills.

ModuleTitleDifficultyCore Well-Architected Pillar Focus
1AWS Operational FoundationsBeginnerOperational Excellence
2Monitoring, Logging & ObservabilityIntermediatePerformance Efficiency
3Performance & Cost OptimizationIntermediateCost Optimization
4Reliability & Business ContinuityAdvancedReliability
5Security & ComplianceAdvancedSecurity
6Deployment & AutomationAdvancedOperational Excellence

Curriculum Progression Flow

Loading Diagram...

Learning Objectives per Module

Module 1: AWS Operational Foundations

  • Understand the Well-Architected Framework: Describe the six pillars (Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability).
  • Master the CLI: Execute commands and analyze outputs using JMESPath query syntax to extract targeted JSON data.

Module 2: Monitoring, Logging, and Observability

  • Implement CloudWatch: Configure static and dynamic alarms for anomalous behavior.
  • Centralize Auditing: Enable AWS CloudTrail and integrate it with CloudWatch Logs Insights for real-time querying.
  • Extend Observability: Deploy the CloudWatch Agent on EC2 and ECS to capture deep system-level metrics.

Module 3: Performance and Cost Optimization

  • Rightsize Compute: Utilize AWS Compute Optimizer to interpret performance metrics and adjust instance families.
  • Optimize Storage: Analyze EBS IOPS and switch volume types to maximize efficiency while reducing monthly spend.
  • Implement FinOps: Configure AWS Budgets and Cost Anomaly Detection to proactively manage cloud expenditures.

Module 4: Reliability and Business Continuity

  • Architect High Availability: Implement Multi-AZ deployments for RDS and configure Route 53 DNS-level failover.
  • Design Disaster Recovery: Compare strategies (Pilot Light vs. Warm Standby) and evaluate RPO/RTO metrics.
  • Automate Backups: Utilize AWS Backup to create centralized retention vaults for EC2, RDS, and EFS.

Module 5: Security and Compliance

  • Enforce Least Privilege: Implement granular IAM identity-based and resource-based policies.
  • Protect Data: Manage encryption keys using AWS KMS and rotate sensitive database credentials via Secrets Manager.
  • Audit Compliance: Deploy AWS Config to monitor state changes and identify High-Risk Issues (HRIs) automatically.

Module 6: Deployment, Provisioning, and Automation

  • Adopt Infrastructure as Code (IaC): Manage complex resources using AWS CloudFormation and remediate stack drift.
  • Automate Remediation: Connect EventBridge to AWS Systems Manager (SSM) Automation runbooks to self-heal infrastructure.
Click to view an automated remediation workflow
Loading Diagram...

Success Metrics

How will you know you have mastered the curriculum? Mastery is evaluated through both objective exam readiness and practical engineering benchmarks.

Practical Validation

  • Zero High-Risk Issues: The ability to review an AWS account via Trusted Advisor and clear all Security and Reliability High-Risk Issues (HRIs).
  • Automated MTTR Reduction: Successfully configuring self-healing runbooks that reduce your Mean Time To Recovery.

Availability=UptimeUptime+Downtime\text{Availability} = \frac{\text{Uptime}}{\text{Uptime} + \text{Downtime}}

[!TIP] A successful cloud operator aims for "Five Nines" (99.999%) availability. This requires mastering the automated remediation techniques taught in Module 6 so downtime approaches zero.

Assessment Metrics

  • SOA-C03 Exam Readiness: Consistently scoring 80%+ on practice exams mirroring the official AWS Certified CloudOps Engineer - Associate format.
  • Troubleshooting Speed: Diagnosing complex VPC connectivity or IAM permission denial issues within 15 minutes using the IAM Policy Simulator and VPC Reachability Analyzer.

Real-World Application

Why does mastering the Well-Architected Framework and CloudOps matter in a professional career?

Terminology in Practice

  • Infrastructure as Code (IaC)

    • Definition: Managing and provisioning computing infrastructure through machine-readable definition files rather than physical hardware configuration or interactive configuration tools.
    • Real-World Example: Instead of manually clicking through the AWS Console to build an environment, a CloudOps engineer writes a CloudFormation YAML template that consistently deploys an Auto Scaling Group, ensuring environments are reproducible and version-controlled.
  • Disaster Recovery (Warm Standby)

    • Definition: A DR strategy where a scaled-down version of a fully functional environment is always running in the cloud.
    • Real-World Example: An e-commerce business experiences a catastrophic regional outage during Black Friday. Because they implemented a Warm Standby in a secondary AWS Region, Route 53 instantly routes customer traffic to the backup region, saving millions of dollars in potential lost revenue.

The Operational Mindset

In modern enterprise environments, manual intervention is a bottleneck. By applying these curriculum principles, you transition from a reactive administrator to a proactive CloudOps Engineer. You will save organizations money through automated Spot Instance utilization, protect user data via KMS encryption enforcement, and allow developer teams to deploy faster and safer.

More Study Notes (138)

Auditing AWS Network Protection Services

Audit AWS network protection services (for example, Amazon Route 53 Resolver DNS Firewall, AWS WAF, AWS Shield, AWS Network Firewall) in a single account

820 words

AWS Auditing and Compliance Management: Study Guide

Auditing and Compliance Management

920 words

Mastering Automation: EC2 Image Builder Study Guide

Automate AMI creation using EC2 Image Builder

924 words

Automating AWS Backups and Snapshots Study Guide

Automate snapshots and backups for AWS resources (for example, Amazon EC2 instances, RDS DB instances, Amazon Elastic Block Store [Amazon EBS] volumes, Amazon S3 buckets, DynamoDB tables) by using AWS services (for example, AWS Backup)

875 words

Mastering Automation of Existing AWS Resources (SOA-C03)

Automate the management of existing resources

845 words

Hands-On Lab: Implementing Monitoring, Alarms, and Remediation on AWS

AWS Certified CloudOps Engineer - Associate (SOA-C03) > Unit 1: Monitoring, Logging, Analysis, Remediation, and Performance Optimization

1,083 words

AWS Health and Incident Management Study Guide

AWS Health and Incident Management

890 words

AWS Management and Governance Tools: A Comprehensive Study Guide

AWS Management Tools

850 words

AWS Systems Manager (SSM) Operations: Comprehensive Study Guide

AWS Systems Manager (SSM) Operations

945 words

Curriculum Overview: Backup, Restore, and Disaster Recovery

Backup, Restore, and Disaster Recovery

863 words

Centralized Logging and Analysis: AWS Curriculum Overview

Centralized Logging and Analysis

878 words

Curriculum Overview: Centralized Logging and Analysis on AWS

Centralized Logging and Analysis

816 words

Cloud Financial Management & Cost Optimization

Cloud Financial Management

820 words

Curriculum Overview: Troubleshooting with AWS Networking Logs

Collect and interpret networking logs to troubleshoot issues (for example, VPC flow logs, Elastic Load Balancing [ELB] access logs, AWS WAF web ACL logs, CloudFront logs, container logs)

822 words

Mastering Automated Remediation with Amazon EventBridge

Configure Amazon EventBridge rules to trigger remediation

860 words

Amazon CloudWatch: Network Monitoring & Analysis

Configure and analyze Amazon CloudWatch network monitoring services

840 words

EC2 Auto Scaling & Scaling Policies: Curriculum Overview

Configure and manage EC2 Auto Scaling groups and scaling policies

878 words

Curriculum Overview: Configure and Manage Scaling in AWS Managed Databases

Configure and manage scaling in AWS managed databases (for example, Amazon RDS, Amazon DynamoDB)

660 words

CloudWatch Agent Deep Dive: Metrics and Logs for EC2 & Containers

Configure and manage the CloudWatch agent to collect metrics and logs from Amazon EC2 instances, Amazon Elastic Container Service (Amazon ECS) clusters, or Amazon Elastic Kubernetes Service (Amazon EKS) clusters

1,084 words

Curriculum Overview: ELB and Route 53 Health Checks

Configure and troubleshoot Elastic Load Balancing (ELB) and Amazon Route 53 health checks

796 words

Curriculum Overview: Configure and Manage an AWS VPC

Configure a VPC (for example, subnets, route tables, network ACLs, security groups, NAT gateways, internet gateway, egress-only internet gateway)

811 words

Mastering AWS Budgets and Cost Anomaly Detection

Configure AWS Budgets and Cost Anomaly Detection

820 words

Curriculum Overview: Configure AWS CloudTrail for Account Auditing

Configure AWS CloudTrail for account auditing

826 words

Curriculum Overview: AWS Monitoring and Logging (SOA-C03)

Configure AWS monitoring and logging by using AWS services (for example, Amazon CloudWatch, AWS CloudTrail, Amazon Managed Service for Prometheus)

863 words

AWS Notification Services: SNS, CloudWatch Alarms, and Budgets Study Guide

Configure AWS services to send notifications to Amazon Simple Notification Service (Amazon SNS) and to invoke alarms that send notifications to Amazon SNS

1,084 words

Mastering AWS WAF and Shield for Application Security

Configure AWS WAF and Shield for application protection

920 words

Curriculum Overview: Configure CloudWatch Alarms and Anomaly Detection

Configure CloudWatch alarms and anomaly detection

810 words

Curriculum Overview: Configure CloudWatch Alarms and Anomaly Detection

Configure CloudWatch alarms and anomaly detection

831 words

Curriculum Overview: Configure Content and Service Distribution on AWS

Configure content and service distribution (for example, Amazon CloudFront, AWS Global Accelerator)

782 words

Mastering AWS Route 53 Resolver and DNS Security

Configure DNS (for example, Route 53 Resolver)

820 words

Curriculum Overview: Configure Domains, DNS Services, and Content Delivery

Configure domains, DNS services, and content delivery

810 words

AWS Curriculum Overview: Configuring Fault-Tolerant Systems

Configure fault-tolerant systems (for example, Multi-AZ deployments)

785 words

Identity Security & External Trust: IAM Roles Anywhere and MFA

Configure IAM Roles Anywhere and Multi-Factor Authentication (MFA)

942 words

CloudWatch Alarms: Direct Actions, Composite Logic, and EventBridge Integration

Configure, identify, and troubleshoot CloudWatch alarms that can invoke AWS services directly or through Amazon EventBridge (for example, by creating composite alarms and identifying their invokable actions)

842 words

Curriculum Overview: Configuring Private Networking Connectivity

Configure private networking connectivity

820 words

AWS Security Operations: Configuring Reports and Remediating Findings

Configure reports and remediate findings from AWS services (for example, AWS Security Hub, Amazon GuardDuty, AWS Config, Amazon Inspector)

820 words

Mastering AWS Networking: Subnets, Route Tables, and Gateways

Configure subnets, route tables, and gateways

864 words

Curriculum Overview: Configure the CloudWatch Agent on EC2 and Containers

Configure the CloudWatch agent on EC2 and Containers

925 words

AWS Container Operations and Security Study Guide

Container Operations

860 words

Curriculum Overview: Creating and Managing AMIs & Container Images

Create and manage AMIs and container images (for example, Amazon EC2 Image Builder)

863 words

AWS CloudFormation & CDK: Infrastructure as Code Study Guide

Create and manage stacks of resources by using AWS CloudFormation and the AWS Cloud Development Kit (AWS CDK)

950 words

Curriculum Overview: Advanced Amazon CloudWatch Dashboards

Create, implement, and manage customizable and shareable CloudWatch dashboards that display metrics and alarms for AWS resources across multiple accounts and AWS Regions

698 words

AWS Systems Manager Automation: Predefined & Custom Runbooks Overview

Create or run custom and predefined Systems Manager Automation runbooks (for example, by using AWS SDKs or custom scripts) to automate tasks and streamline processes on AWS

639 words

Curriculum Overview: Automating AWS with Systems Manager (SSM) Runbooks

Create or run custom and predefined Systems Manager Automation runbooks (for example, by using AWS SDKs or custom scripts) to automate tasks and streamline processes on AWS

810 words

Curriculum Overview: AWS Systems Manager Automation Runbooks

Create or run custom and predefined Systems Manager Automation runbooks (for example, by using AWS SDKs or custom scripts) to automate tasks and streamline processes on AWS

863 words

Data Protection and Infrastructure Security: Comprehensive Study Guide

Data Protection and Infrastructure Security

925 words

AWS Elastic Beanstalk: Deployment and Lifecycle Management

Deploy applications using AWS Elastic Beanstalk

860 words

Curriculum Overview: The Six Pillars of the AWS Well-Architected Framework

Describe the six pillars of the Well-Architected Framework

813 words

Curriculum Overview: Design CloudWatch Dashboards for Multi-Account Visibility

Design CloudWatch Dashboards for multi-account visibility

686 words

Curriculum Overview: Detect and Remediate CloudFormation Stack Drift

Detect and remediate CloudFormation stack drift

624 words

Curriculum Overview: Enforcing AWS Compliance Requirements

Enforce compliance requirements (for example, AWS Region and service selections)

814 words

Curriculum Overview: Enforcing Governance using AWS Config

Enforce governance using AWS Config

834 words

AWS Shared Storage Solutions: EFS & FSx Curriculum Overview

Evaluate and select shared storage solutions (for example, Amazon Elastic File System [Amazon EFS], Amazon FSx), and optimize the solutions (for example, EFS lifecycle policies) for specific use cases and requirements

863 words

Optimizing Compute Costs: Evaluating Spot Instances and Savings Plans

Evaluate workloads for EC2 Spot Instance and Savings Plans eligibility

945 words

Chapter Study Guide: Event-Driven Remediation on AWS

Event-Driven Remediation

860 words

Curriculum Overview: Mastering AWS CLI Commands and Output Analysis

Execute CLI commands and Analyze CLI output using query and filter parameters

687 words

Curriculum Overview: Mastering the AWS Command Line Interface (CLI)

Execute commands using the AWS Command Line Interface (CLI)

782 words

Mastering SSM Automation for Automated Remediation

Execute SSM Automation runbooks for remediation

945 words

AWS Cloud Development Kit (CDK): The Evolution of Infrastructure as Code

Explain AWS CDK and its role in IaC

820 words

Curriculum Overview: Managed Service for Prometheus and Grafana

Explain the role of Managed Service for Prometheus and Grafana

863 words

AWS Disaster Recovery Procedures: Implementation & Strategy

Follow disaster recovery procedures

865 words

Curriculum Overview: High Availability and Resilience in AWS

High Availability and Resilience

862 words

Curriculum Overview: Identify and Remediate CloudFront Caching Issues

Identify and remediate CloudFront caching issues

863 words

Troubleshooting AWS Deployment Issues: Curriculum Overview

Identify and remediate deployment issues (for example, subnet sizing issues, CloudFormation errors, permissions issues)

811 words

Curriculum Overview: Automating Remediation and Monitoring Metrics (AWS SOA-C03)

Identify and remediate issues by using monitoring and availability metrics

836 words

Mastering Hybrid and Private Connectivity Troubleshooting

Identify and troubleshoot hybrid connectivity issues and private connectivity issues

865 words

AWS Identity and Access Management (IAM) Study Guide

Identity and Access Management

890 words

Study Guide: Implementing and Enforcing Data Classification

Implement and enforce a data classification scheme

820 words

Study Guide: Security and Compliance Management (SOA-C03)

Implement and manage security and compliance tools and policies

985 words

Curriculum Overview: Amazon S3 Performance & Optimization Strategies

Implement and optimize Amazon S3 performance strategies (for example, AWS DataSync, S3 Transfer Acceleration, multipart uploads, S3 Lifecycle policies) to enhance data transfer, storage efficiency, and access patterns

863 words

Curriculum Overview: Implement and Optimize Networking Features and Connectivity

Implement and optimize networking features and connectivity

810 words

Curriculum Overview: Implementing Automated Instance Recovery

Implement automated instance recovery

863 words

Mastering AWS Identity and Access Management (IAM)

Implement AWS Identity and Access Management (IAM) features (for example, password policies, multi-factor authentication [MFA], roles, federated identity, resource policies, policy conditions)

890 words

Curriculum Overview: Implementing AWS Caching for Dynamic Scalability

Implement caching by using AWS services to enhance dynamic scalability (for example, Amazon CloudFront, Amazon ElastiCache)

837 words

AWS Certified SysOps: Mastering Encryption at Rest with AWS KMS

Implement, configure, and troubleshoot encryption at rest (for example, AWS Key Management Service [AWS KMS])

1,342 words

Mastering Encryption in Transit with AWS Certificate Manager (ACM)

Implement, configure, and troubleshoot encryption in transit (for example, AWS Certificate Manager [ACM])

945 words

Curriculum Overview: Implement Custom Metrics and Namespaces

Implement custom metrics and namespaces

822 words

Curriculum Overview: Implementing Custom Metrics & Namespaces in AWS

Implement custom metrics and namespaces

687 words

Curriculum Overview: Implementing Deployment Strategies & Services

Implement deployment strategies and services

810 words

Curriculum Overview: Event-Driven Automation in AWS

Implement event-driven automation by using AWS services and features (for example, AWS Lambda, Amazon S3 Event Notifications)

694 words

Study Guide: Implementing IAM Policies, Roles, and Groups

Implement IAM policies, roles, and groups

945 words

AWS Monitoring & Logging: Metrics, Alarms, and Filters

Implement metrics, alarms, and filters by using AWS monitoring and logging services

732 words

Curriculum Overview: Implement, Monitor, and Optimize EC2 Capabilities

Implement, monitor, and optimize EC2 instances and their associated storage and networking capabilities (for example, EC2 placement groups)

813 words

Secure Multi-Account Strategies in AWS

Implement multi-account strategies securely

925 words

Curriculum Overview: AWS Compute, Storage, and Database Performance Optimization

Implement performance optimization strategies for compute, storage, and database resources

863 words

Curriculum Overview: AWS Performance Optimization Strategies

Implement performance optimization strategies for compute, storage, and database resources

822 words

Curriculum Overview: Implement Private Connectivity Using VPC Endpoints

Implement private connectivity using VPC Endpoints

673 words

AWS Trusted Advisor: Security Remediation and Best Practices

Implement remediation based on the results of AWS Trusted Advisor security checks

920 words

Route 53 Mastery: Routing Policies, Configuration, and Query Logging

Implement Route 53 routing policies, configurations, and query logging

925 words

Curriculum Overview: Implementing Versioning for Storage Services

Implement versioning for storage services (for example, Amazon S3, Amazon FSx)

725 words

Mastering Infrastructure as Code (IaC) and Resource Provisioning

Infrastructure as Code (IaC)

925 words

Curriculum Overview: Integrate AWS Health Events with External Notification Systems

Integrate AWS Health events with external notification systems

834 words

Curriculum Overview: Manage Elastic Load Balancing (ELB) Listeners and Rules

Manage Elastic Load Balancing (ELB) listeners and rules

786 words

Mastering Fleet Updates with AWS Systems Manager Patch Manager

Manage fleet updates with SSM Patch Manager

820 words

Curriculum Overview: Inter-VPC Connectivity via Peering and Transit Gateway

Manage inter-VPC connectivity via Peering and Transit Gateway

728 words

Curriculum Overview: Managing Stacks Using AWS CloudFormation

Manage stacks using AWS CloudFormation

862 words

Study Guide: Managing Workloads on Amazon ECS and EKS

Manage workloads on Amazon ECS and EKS

940 words

Curriculum Overview: Optimizing and Monitoring Amazon RDS Performance

Monitor Amazon RDS metrics (for example, Amazon RDS Performance Insights, CloudWatch alarms), and modify configurations to increase performance efficiency (for example, Performance Insights proactive recommendations, RDS Proxy)

863 words

Curriculum Overview: Network Troubleshooting and Monitoring

Network Troubleshooting and Monitoring

826 words

AWS Compute Optimization & Performance Remediation Curriculum

Optimize compute resources and remediate performance problems by using performance metrics, resource tags, and AWS tools

829 words

Compute Resource Optimization & Performance Remediation in AWS

Optimize compute resources and remediate performance problems by using performance metrics, resource tags, and AWS tools

878 words

Curriculum Overview: Optimize Compute Resources and Remediate Performance

Optimize compute resources and remediate performance problems by using performance metrics, resource tags, and AWS tools

634 words

Study Guide: Optimizing Compute with AWS Compute Optimizer

Optimize compute resources using AWS Compute Optimizer

1,450 words

Mastering the AWS Management Console & CLI Operations

Perform operations using the AWS Management Console

885 words

AWS Resource Provisioning and Maintenance Study Guide

Provision and maintain cloud resources

1,150 words

Mastering Multi-Account and Multi-Region Resource Sharing

Provision and share resources across multiple AWS Regions and accounts (for example, AWS Resource Access Manager [AWS RAM], CloudFormation StackSets)

1,150 words

Curriculum Overview: Querying Log Data with CloudWatch Logs Insights

Query log data using CloudWatch Logs Insights

945 words

AWS Global Infrastructure: From Regions to the Edge

Region, Availability Zone, Edge Location, Local Zone, Wavelength Zone, Outpost, Direct Connect Location

875 words

AWS Resource Maintenance and Application Provisioning: Curriculum Overview

Resource Maintenance and Application Provisioning

863 words

Resource Performance Optimization: AWS SOA-C03 Study Guide

Resource Performance Optimization

940 words

AWS Curriculum Overview: Scalability and Elasticity

Scalability and Elasticity

820 words

Securely Storing Secrets with AWS Secrets Manager: Curriculum Overview

Securely store secrets by using AWS services (for example AWS Secrets Manager)

820 words

Mastering the AWS Shared Responsibility Model

Shared Responsibility Model

945 words

AWS Global Infrastructure: Curriculum Overview

The AWS Global Infrastructure

836 words

Comprehensive Study Guide: The AWS Well-Architected Framework

The AWS Well-Architected Framework

820 words

Mastering the IAM Policy Simulator: A Troubleshooting Guide

Troubleshoot access issues using IAM Policy Simulator

845 words

Study Guide: Troubleshooting and Auditing AWS Access

Troubleshoot and audit access issues by using AWS tools (for example, AWS CloudTrail, IAM Access Analyzer, IAM policy simulator)

890 words

Mastering VPC Troubleshooting: Connectivity and Configuration

Troubleshoot VPC configurations (for example, subnets, route tables, network ACLs, security groups, transit gateways, NAT gateways)

920 words

Curriculum Overview: Reliability and Business Continuity

Unit 2: Reliability and Business Continuity

686 words

Lab: Building Resilient Storage with S3 Cross-Region Replication

Unit 2: Reliability and Business Continuity

845 words

Curriculum Overview: Deployment, Provisioning, and Automation

Unit 3: Deployment, Provisioning, and Automation

868 words

Lab: Automating Infrastructure and Remediation with CloudFormation and SSM

Unit 3: Deployment, Provisioning, and Automation

820 words

Lab: Hardening AWS Infrastructure with AWS Config and IAM Access Analyzer

Unit 4: Security and Compliance

845 words

Lab: Building High-Performance Content Delivery with Amazon CloudFront and S3 OAC

Unit 5: Networking and Content Delivery

840 words

Curriculum Overview: Unit 6 - Automated Remediation and Remedial Actions

Unit 6: Automated Remediation and Remedial Actions

645 words

Lab: Automated Remediation of Public S3 Buckets with AWS Config and SSM

Unit 6: Automated Remediation and Remedial Actions

820 words

Curriculum Overview: Performance and Cost Optimization (Unit 7)

Unit 7: Performance and Cost Optimization

768 words

Hands-On Lab: Optimizing AWS Resource Performance and Costs

Unit 7: Performance and Cost Optimization

945 words

Automating AWS Operations: Incident Remediation with Systems Manager and EventBridge

Unit 8: AWS Operational Foundations

845 words

Curriculum Overview: AWS Operational Foundations

Unit 8: AWS Operational Foundations

832 words

Curriculum Overview: Automating Resource Deployment with Third-Party Tools

Use and manage third-party tools to automate resource deployment (for example, Terraform, Git)

834 words

Curriculum Overview: Automating AWS Operational Processes

Use AWS services to automate operational processes (for example, AWS Systems Manager)

792 words

AWS EventBridge Mastery: Routing, Enriching, Delivering, and Troubleshooting

Use EventBridge to route, enrich, and deliver events, and troubleshoot any issues with event bus rules

684 words

Curriculum Overview: Amazon EventBridge Mastery

Use EventBridge to route, enrich, and deliver events, and troubleshoot any issues with event bus rules

662 words

Curriculum Overview: AWS EventBridge Routing, Enrichment, and Troubleshooting

Use EventBridge to route, enrich, and deliver events, and troubleshoot any issues with event bus rules

863 words

Curriculum Overview: Mastering AWS EventBridge Routing, Enrichment, and Troubleshooting

Use EventBridge to route, enrich, and deliver events, and troubleshoot any issues with event bus rules

738 words

Database Restoration & Recovery Strategies: RTO, RPO, and Cost Management

Use various methods to restore databases (for example, point-in-time restore) to meet recovery time objective (RTO), recovery point objective (RPO), and cost requirements

985 words

AWS VPC Administration: Comprehensive Study Guide

VPC Administration

845 words

Ready to practice? Jump straight in — no sign-up needed.

Take practice tests, review flashcards, and read study notes right now.

Take a Practice Test

AWS Certified CloudOps Engineer - Associate (SOA-C03) Practice Questions

Try 15 sample questions from a bank of 840. Answers and detailed explanations included.

Q1medium

A company is experiencing higher-than-expected costs for Amazon Elastic Block Store (Amazon EBS) volumes and snapshots. A cloud architect has been asked to explain the best practices for optimizing these storage costs.

Which statement provides a correct explanation of an effective strategy for reducing Amazon EBS expenses?

A.

Implementing Amazon Data Lifecycle Manager (DLM) to automate standard snapshot retention policies prevents manual snapshot sprawl, and minimizes costs because standard snapshots only store incremental block changes.

B.

Stopping underutilized Amazon EC2 instances during non-business hours is an effective strategy, because AWS automatically pauses billing for the provisioned capacity of any attached EBS volumes while the instance is stopped.

C.

Moving all daily EBS snapshots immediately to the EBS Snapshot Archive tier significantly reduces costs, as this tier uses incremental storage to provide the cheapest option for short-term backup retention.

D.

Modifying the configuration of underutilized EBS volumes to automatically transition their data to Amazon S3 Intelligent-Tiering is the most efficient way to reduce active block storage costs.

Show answer & explanation

Correct Answer: A

Option A is the correct answer. Amazon Data Lifecycle Manager (DLM) helps optimize costs by automating the creation, retention, and deletion of EBS snapshots, which prevents the costly sprawl of unmanaged backups. Furthermore, standard EBS snapshots are incremental, meaning you are only billed for the changed blocks rather than a full copy of the volume every time, making it highly cost-effective for regular backups.

Option B is incorrect. Amazon EBS volumes are billed based on their provisioned capacity. These costs continue to accrue even if the volume is detached from an instance or if the attached Amazon EC2 instance is in a stopped state.

Option C is incorrect. The EBS Snapshot Archive tier is designed for long-term retention (90 days or more) and stores a full copy of the block data, not incremental changes. Using it for frequent, short-term backups can actually increase overall storage size and incur early deletion or retrieval fees.

Option D is incorrect. Amazon EBS is block-level storage, while Amazon S3 is object-level storage. You cannot directly transition or tier active, underutilized EBS volumes to Amazon S3 Intelligent-Tiering.

Q2easy

Which of the following best describes an AWS CloudFormation stack?

A.

A collection of AWS resources that you can create, update, and delete as a single unit.

B.

A continuous integration and delivery (CI/CD) pipeline used exclusively to compile, test, and deploy application source code.

C.

A manual tagging mechanism used solely to group AWS resources together for billing and cost allocation reports.

D.

A point-in-time backup snapshot of all databases and storage volumes within a specific AWS Region.

Show answer & explanation

Correct Answer: A

An AWS CloudFormation stack is a collection of AWS resources that you manage as a single unit. When you use AWS CloudFormation, you define your infrastructure as code (IaC) in a template. AWS CloudFormation then provisions and configures those resources together, ensuring all dependencies are handled automatically.

Option B describes a CI/CD service like AWS CodePipeline. Option C describes AWS Resource Groups or Cost Allocation Tags. Option D describes a backup solution like AWS Backup or Amazon EBS snapshots.

Q3medium

A solutions architect is designing the network architecture for a new application. The application runs on Amazon EC2 instances located in a private subnet and needs to read and write large amounts of data to an Amazon S3 bucket. The company's security policy mandates that traffic between the EC2 instances and Amazon S3 must not traverse the public internet.

Which solution meets these requirements while being the MOST cost-effective?

A.

Deploy a NAT Gateway in a public subnet and configure the private subnet's route table to direct S3 traffic to the NAT Gateway.

B.

Create a Gateway VPC Endpoint for Amazon S3 and configure the private subnet's route table to direct traffic for S3 to the endpoint.

C.

Create an Interface VPC Endpoint (AWS PrivateLink) for Amazon S3 and assign a security group to allow traffic from the EC2 instances.

D.

Establish an AWS Site-to-Site VPN connection between the VPC and the Amazon S3 service endpoint.

Show answer & explanation

Correct Answer: B

To securely connect a private subnet to Amazon S3 without routing traffic over the public internet, you should use a VPC Endpoint. There are two types of VPC endpoints for S3: Gateway Endpoints and Interface Endpoints (AWS PrivateLink).

Option B is correct because a Gateway VPC Endpoint keeps traffic within the Amazon network, requires a simple route table update, and is offered at no additional charge, making it the most cost-effective solution for S3.

Option A is incorrect because a NAT Gateway routes traffic through an Internet Gateway using public IP addresses, which violates the requirement to avoid the public internet. Additionally, NAT Gateways incur hourly charges and data processing fees.

Option C is incorrect because while an Interface VPC Endpoint does keep traffic private, it incurs an hourly charge per Availability Zone and a data processing fee per gigabyte. The Gateway Endpoint is more cost-effective.

Option D is incorrect because an AWS Site-to-Site VPN is designed to connect an on-premises network to an AWS VPC, not to connect a VPC directly to an AWS service like S3.

Q4medium

An administrator needs to apply a common set of tags and view consolidated metrics for Amazon EC2 instances, Amazon S3 buckets, and Amazon RDS databases that belong to the "Production" environment. Which feature within the AWS Management Console provides the most efficient way to organize these resources and perform these operations?

A.

AWS Systems Manager Parameter Store

B.

AWS Resource Groups and Tag Editor

C.

AWS CloudFormation StackSets

D.

AWS Service Catalog

Show answer & explanation

Correct Answer: B

The correct answer is B. The AWS Resource Groups and Tag Editor features within the AWS Management Console are designed specifically for this use case. Tag Editor allows administrators to search for resources across multiple AWS services and apply, edit, or delete tags in bulk. Once tagged, AWS Resource Groups can be used to create custom console views and consolidate metrics for all resources sharing specific tags. AWS Systems Manager Parameter Store is used for configuration data management, not resource grouping. AWS CloudFormation StackSets are used to provision infrastructure across multiple accounts. AWS Service Catalog is used to manage catalogs of approved IT services.

Q5medium

A solutions architect is configuring a web application with a backend origin located in the us-west-2 region. The application must be served globally via an Amazon CloudFront distribution using a custom root domain (e.g., example.com) and must enforce HTTPS.

Which combination of steps is required to correctly configure the custom domain and HTTPS for this distribution?

A.

Request an AWS Certificate Manager (ACM) certificate in the us-west-2 region, add the domain as an Alternate Domain Name in CloudFront, and create an Alias A record in Amazon Route 53.

B.

Request an ACM certificate in the us-east-1 region, add the domain as an Alternate Domain Name in CloudFront, and create an Alias A record in Amazon Route 53.

C.

Request an ACM certificate in the us-east-1 region, add the domain as an Alternate Domain Name in CloudFront, and create a standard CNAME record for the root domain in Amazon Route 53.

D.

Request an ACM certificate in the us-east-1 region, configure the custom domain name directly on the backend origin resource, and point a standard A record to a static IP address of a CloudFront edge location.

Show answer & explanation

Correct Answer: B

Option B is correct. To use a custom SSL/TLS certificate with Amazon CloudFront, the certificate must be requested or imported in the US East (N. Virginia / us-east-1) region, regardless of where the backend origin is located. Additionally, the custom domain must be explicitly added as an Alternate Domain Name (CNAME) in the CloudFront distribution settings. Finally, because the DNS protocol does not allow standard CNAME records at the zone apex (the root domain), you must use an Amazon Route 53 Alias record (such as an Alias A record) to route traffic to the CloudFront distribution.

Option A is incorrect because ACM certificates used for CloudFront must be located in us-east-1, not the region of the backend origin.

Option C is incorrect because standard CNAME records cannot be used for a zone apex (root domain). A Route 53 Alias record must be used instead.

Option D is incorrect because the custom domain needs to be recognized by CloudFront via the Alternate Domain Name configuration, not just the backend origin. Furthermore, CloudFront edge locations use dynamic IP addresses, so pointing a standard A record to a static IP will cause routing failures.

Q6medium

A company's finance team needs a recurring, visual breakdown of historical AWS cloud spend over the last six months. The spend must be grouped by department using Cost Allocation Tags. The team wants to access this specific graphical view at the end of every month without having to manually recreate the filters or build charts from raw data.

Which solution meets these requirements with the LEAST operational overhead?

A.

Use AWS Cost Explorer to filter by Cost Allocation Tags, group by department, and save the custom report.

B.

Configure an AWS Budget to automatically generate and send a detailed visual breakdown of the monthly spend.

C.

Export the AWS Cost and Usage Report (CUR) to Amazon S3 and manually build the charts in a spreadsheet application.

D.

Use the AWS Pricing Calculator to model the monthly costs and group the data by department.

Show answer & explanation

Correct Answer: A

A is correct. AWS Cost Explorer allows users to deeply analyze cloud spend by filtering and grouping data using dimensions such as Cost Allocation Tags. It provides native, built-in graphical visualizations of historical spend. Users can save up to 50 custom reports, allowing specific filtering views to be quickly accessed every month without manual recreation, which minimizes operational overhead.

B is incorrect. AWS Budgets is designed for threshold-based alerting and automated actions, not for generating complex visual reporting of historical spend breakdowns.

C is incorrect. While exporting the AWS Cost and Usage Report (CUR) to Amazon S3 and using a spreadsheet application is possible, it is highly inefficient and requires manual effort to build charts, increasing operational overhead compared to native saved reports.

D is incorrect. The AWS Pricing Calculator is used to estimate costs for future, hypothetical architectures, not to analyze or visualize actual historical usage.

Q7medium

What is the primary purpose of an AWS Systems Manager (SSM) Automation runbook in the context of operational remediation?

A.

It provides a structured document that defines a sequence of steps to automatically fix configuration issues or respond to system failures across AWS resources.

B.

It acts as a declarative template used primarily to provision, update, and deploy entire infrastructure architectures from scratch.

C.

It is a monitoring feature that defines metric thresholds and sends automated alert notifications when an operational failure occurs.

D.

It serves as a serverless compute environment that requires writing custom application code to execute background operational tasks.

Show answer & explanation

Correct Answer: A

A is correct. An AWS Systems Manager (SSM) Automation runbook is a document that defines a structured sequence of steps to automate operational tasks. In the context of remediation, it is used to apply predefined or custom workflows to automatically restore resources to a healthy, compliant state without manual intervention.

B is incorrect. This describes AWS CloudFormation, which uses declarative templates for Infrastructure as Code (IaC) provisioning, rather than executing operational remediation workflows.

C is incorrect. This describes Amazon CloudWatch Alarms, which are used to monitor metrics and trigger alerts. While an alarm can trigger a runbook, the alarm itself is not the remediation workflow.

D is incorrect. This describes AWS Lambda. While Lambda functions can execute custom code for background tasks, SSM Automation runbooks orchestrate the workflow and provide many pre-built, no-code actions specifically designed for operational management.

Q8easy

Which AWS service functions as a fully managed container registry that makes it easy for developers to store, manage, and deploy Docker container images?

A.

Amazon Elastic Container Registry (Amazon ECR)

B.

Amazon Elastic Container Service (Amazon ECS)

C.

Amazon Elastic Kubernetes Service (Amazon EKS)

D.

Amazon Simple Storage Service (Amazon S3)

Show answer & explanation

Correct Answer: A

Amazon Elastic Container Registry (Amazon ECR) is a fully managed container registry provided by AWS. It allows developers to securely store, manage, share, and deploy container images and artifacts, integrating directly with AWS container orchestration services.

Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (Amazon EKS) are container orchestration services that manage and run containerized workloads, but they rely on an external registry (like ECR) to pull the actual images. Amazon Simple Storage Service (Amazon S3) is an object storage service; while it stores data, it is not a dedicated registry for managing Docker container images.

Correct Answer: A

Q9medium

A company runs a popular e-commerce application on Amazon EC2 instances with an Amazon RDS database. During flash sales, the application experiences high latency and slow page load times. Performance metrics show that the RDS database is overwhelmed with identical read queries for product catalog data, and the EC2 instances are heavily burdened by serving static product images to global users.

Which combination of AWS services should a solutions architect implement to cache data, offload the backend infrastructure, and enhance the application's dynamic scalability?

A.

Amazon CloudFront to cache the static product images at edge locations, and Amazon ElastiCache to cache the frequent database query results.

B.

AWS Global Accelerator to cache the static product images at edge locations, and Amazon Elastic File System (Amazon EFS) to cache the database query results.

C.

Amazon S3 Transfer Acceleration to serve the static product images, and Amazon CloudFront to cache the database query results.

D.

Amazon ElastiCache to serve the static product images directly to end users globally, and Amazon Elastic Block Store (Amazon EBS) to cache the database query results.

Show answer & explanation

Correct Answer: A

Correct Answer: Option A

To effectively enhance dynamic scalability, caching layers should be introduced both at the edge and at the backend:

  • Amazon CloudFront is a Content Delivery Network (CDN) that caches static and dynamic web content (such as product images) at edge locations worldwide. This offloads the traffic from the origin servers (Amazon EC2), allowing them to scale dynamically without being bogged down by static asset requests.
  • Amazon ElastiCache provides a managed in-memory data store (using Redis or Memcached). It is specifically designed to cache the results of frequent, read-heavy database queries, which significantly reduces the read load on the backend Amazon RDS database and improves response times.

Why the other options are incorrect:

  • Option B: AWS Global Accelerator optimizes network routing to improve application performance and availability, but it does not cache content at the edge like a CDN. Additionally, Amazon EFS is a scalable file storage service, not an in-memory caching layer for database queries.
  • Option C: Amazon S3 Transfer Acceleration speeds up uploads and downloads to S3 buckets over long distances, but it is not the primary edge caching mechanism for distributing assets (that is CloudFront's role). Furthermore, CloudFront cannot be used as an internal cache for backend database queries.
  • Option D: This option reverses the typical roles of the services. Amazon ElastiCache is a backend in-memory cache and cannot serve web assets directly to global end users. Amazon EBS provides persistent block storage volumes for EC2 instances, not low-latency, in-memory caching for database query results.
Q10medium

A company operates multiple AWS accounts across several AWS Regions. A cloud architect needs to deploy a standardized set of IAM roles and logging configurations to every account and Region. Additionally, the company maintains a centralized networking architecture. Application teams must be able to launch their Amazon EC2 instances directly into a set of centrally managed VPC subnets from their own AWS accounts.

Which combination of actions will meet these requirements with the LEAST operational overhead?

A.

Use AWS CloudFormation StackSets to provision the IAM roles and logging configurations across all accounts and Regions. Use AWS Resource Access Manager (AWS RAM) to share the centrally managed VPC subnets with the application teams' accounts.

B.

Use AWS CloudFormation StackSets to provision the IAM roles, logging configurations, and identical VPC subnets across all accounts and Regions. Use Amazon VPC Peering to connect the newly created subnets to the central network.

C.

Use AWS Organizations Service Control Policies (SCPs) to deploy the IAM roles and logging configurations across all accounts. Use AWS Resource Access Manager (AWS RAM) to share the centrally managed VPC subnets with the application teams' accounts.

D.

Use AWS CloudFormation StackSets to provision the IAM roles and logging configurations across all accounts and Regions. Use AWS Transit Gateway to share the centrally managed VPC subnets with the application teams' accounts.

Show answer & explanation

Correct Answer: A

To meet the company's requirements, the architect must use services designed for multi-account provisioning and resource sharing.

  • AWS CloudFormation StackSets allows administrators to define a single CloudFormation template and deploy it across multiple AWS accounts and Regions. This is the correct service for provisioning identical baseline infrastructure, such as IAM roles and logging configurations globally.
  • AWS Resource Access Manager (AWS RAM) is used to securely share existing AWS resources across accounts. By sharing a VPC subnet via AWS RAM, participating AWS accounts can launch resources (such as Amazon EC2 instances) directly into that centralized subnet without managing the underlying network infrastructure themselves.

Why the other options are incorrect:

  • Option B is incorrect because provisioning individual, duplicate VPC subnets in every account violates the requirement for a centralized network. Furthermore, Amazon VPC Peering only routes traffic between distinct VPCs; it does not allow cross-account resource deployment into a shared subnet.
  • Option C is incorrect because AWS Organizations Service Control Policies (SCPs) are boundary policies used to restrict maximum permissions across an organization. They cannot be used to provision or deploy infrastructure resources.
  • Option D is incorrect because AWS Transit Gateway is a network transit hub used to interconnect VPCs and on-premises networks. Like VPC Peering, it routes traffic but does not allow accounts to share specific subnets for direct resource deployment.

Correct Answer: Option A

Q11easy

What is the correct standard syntax for executing a command using the AWS Command Line Interface (CLI)?

A.

aws <command> <subcommand> [options and parameters]

B.

aws-cli <subcommand> <command> [options and parameters]

C.

aws <options and parameters> <command> <subcommand>

D.

execute-aws <command> <subcommand> [options]

Show answer & explanation

Correct Answer: A

The AWS CLI uses a specific multipart structure on the command line. It always begins with the base call aws. This is followed by the <command> (which typically represents an AWS service, such as s3 or ec2), and then the <subcommand> (the specific operation to perform on that service, like ls or describe-instances). Finally, any [options and parameters] are appended to the end to modify the command or provide necessary input values. Therefore, Option A is the correct standard syntax.

Q12medium

Which of the following best explains the primary purpose of AWS Systems Manager (SSM) Automation runbooks in the context of operational remediation?

A.

They are JSON or YAML documents that define automated, sequential actions to automatically fix common configuration issues or respond to system failures without manual intervention.

B.

They are declarative code templates used to provision entirely new infrastructure environments from scratch across multiple AWS accounts.

C.

They provide real-time monitoring dashboards that collect and visualize performance metrics and logs to identify potential operational issues.

D.

They enable secure, interactive terminal sessions for manually troubleshooting operational issues on instances without requiring SSH keys.

Show answer & explanation

Correct Answer: A

Option A is correct. AWS Systems Manager Automation runbooks are defined in JSON or YAML and execute predefined or custom steps sequentially. For operational remediation, they are often triggered by Amazon EventBridge to automatically resolve system failures or configuration issues, eliminating the need for manual intervention.

Option B is incorrect because it describes AWS CloudFormation, which is used for infrastructure provisioning, not operational automation runbooks. Option C is incorrect because it describes Amazon CloudWatch, which is used for monitoring and observability. Option D is incorrect because it describes AWS Systems Manager Session Manager, which facilitates manual administrative access rather than automated remediation.

Q13easy

What is the primary purpose of the AWS Personal Health Dashboard?

A.

To provide alerts and remediation guidance for AWS events that specifically impact your AWS resources

B.

To display the general up-to-the-minute status of all AWS services across all AWS Regions

C.

To monitor performance metrics such as CPU utilization for your Amazon EC2 instances

D.

To track and manage billing, invoicing, and cost allocation for your AWS account

Show answer & explanation

Correct Answer: A

The AWS Personal Health Dashboard provides a personalized view of the performance and availability of the AWS services underlying your specific AWS resources. It provides proactive alerts and remediation guidance when AWS is experiencing events that may directly impact your infrastructure.

  • Option B describes the general AWS Service Health Dashboard, which shows the overall status of AWS services, regardless of whether you use them.
  • Option C describes Amazon CloudWatch, which is used for monitoring resource performance and metrics.
  • Option D describes AWS Billing and Cost Management.
Q14medium

A company is deploying a new web application hosted on a fleet of Amazon EC2 instances. The instances are placed behind an Application Load Balancer (ALB). Security policies require all client traffic to be encrypted in transit using HTTPS.

Which solution meets this requirement with the LEAST operational overhead?

A.

Use AWS Certificate Manager (ACM) to provision a public SSL/TLS certificate and configure an HTTPS listener on the ALB to use this certificate.

B.

Use AWS Key Management Service (AWS KMS) to generate an SSL/TLS certificate and associate it with the ALB to terminate client connections.

C.

Provision a public SSL/TLS certificate using AWS Certificate Manager (ACM), download the certificate, and install it locally on each backend EC2 instance.

D.

Import a third-party SSL/TLS certificate into AWS Secrets Manager and configure the ALB to retrieve the certificate for its HTTPS listener.

Show answer & explanation

Correct Answer: A

To secure traffic in transit using an Application Load Balancer (ALB), you must create an HTTPS listener on the load balancer and configure it with an SSL/TLS certificate. AWS Certificate Manager (ACM) integrates natively with Elastic Load Balancing. This allows you to easily provision, manage, and automatically renew public SSL/TLS certificates, and deploy them directly on the ALB to terminate SSL/TLS client connections with minimal operational overhead.

Option B is incorrect because AWS KMS is designed for creating and managing cryptographic keys for encryption at rest, not for provisioning X.509 SSL/TLS certificates for encryption in transit.

Option C is incorrect because public SSL/TLS certificates provisioned through ACM cannot be exported or downloaded. They must be deployed on natively integrated AWS services like ALB, Amazon CloudFront, or Amazon API Gateway.

Option D is incorrect because AWS Secrets Manager does not natively integrate with ALBs for SSL certificate deployment and automatic renewal. ALBs require certificates to be managed by ACM or AWS Identity and Access Management (IAM).

Correct Answer: Option A

Q15hard

A cloud operations team is investigating a sudden database failure and suspects an underlying AWS infrastructure issue may be responsible. To investigate further, they access the AWS Personal Health Dashboard.

When analyzing the event within this specific dashboard, what unique combination of information will the team find to help them understand and resolve the disruption?

A.

A curated list of the team's specific AWS resources impacted by the disruption, detailed event descriptions, and actionable AWS-provided recommendations for remediation.

B.

Granular, real-time operating system metrics—such as memory and disk I/O utilization—to determine if the underlying instances exhausted their available resources.

C.

A chronological record of all API calls and configuration changes made to the database prior to the failure, helping to identify potential human error.

D.

The generalized, public status of all AWS services across all global regions to verify if a systemic, platform-wide outage is actively occurring.

Show answer & explanation

Correct Answer: A

The AWS Personal Health Dashboard provides a personalized view of the performance and availability of the specific AWS services an account is actively using. When analyzing an event, it provides a list of the exact affected resources in your account, detailed event information, and actionable AWS recommendations for remediation.

  • Option B describes Amazon CloudWatch, which is used for analyzing performance metrics.
  • Option C describes AWS CloudTrail, which is used for auditing API calls and configuration changes.
  • Option D describes the AWS Service Health Dashboard (the public dashboard), which provides general global service status but does not filter for or identify your specific impacted resources.

These are 15 of 840 questions available. Take a practice test →

AWS Certified CloudOps Engineer - Associate (SOA-C03) Flashcards

1,200 flashcards for spaced-repetition study. Showing 30 sample cards below.

Advanced Observability Services on AWS(5 cards shown)

Question

CloudWatch Agent

Answer

A software package installed on Amazon EC2 instances, on-premises servers, or container clusters to collect system-level metrics and logs.

[!TIP] By default, AWS can only see hypervisor-level metrics (like CPU and Network). The CloudWatch Agent is required to see internal guest OS metrics like Memory Utilization and Disk Space Used.

Question

Amazon Managed Service for Prometheus (AMP)

Answer

A serverless, open-source compatible monitoring and alerting service heavily optimized for containerized environments and microservices.

[!NOTE] It is primarily used to monitor applications running on Amazon EKS (Elastic Kubernetes Service) without the operational overhead of managing Prometheus infrastructure.

Question

Amazon Managed Grafana

Answer

A fully managed service based on the popular open-source platform used for data visualization and operational dashboards.

It allows teams to query, correlate, and visualize metrics, logs, and traces from multiple data sources instantly.

Common Integrations:

  • Amazon Managed Service for Prometheus
  • Amazon CloudWatch
  • AWS X-Ray
  • External databases

Question

System-Level Metrics vs. Hypervisor Metrics

Answer

The distinction between what AWS monitors by default versus what requires specialized agents within the OS.

Metric TypeVisibilityExamples
HypervisorDefault (Agentless)CPU Utilization, Disk Read/Write Ops, Network In/Out
System-LevelRequires CloudWatch AgentMemory Utilization, Disk Space Available, Swap Usage, Page Faults

[!WARNING] If a question asks how to monitor memory on an EC2 instance, the answer is always to install and configure the CloudWatch Agent!

Question

CloudWatch Container Insights

Answer

A feature of CloudWatch used to collect, aggregate, and summarize operational metrics and logs from containerized applications.

Loading Diagram...

[!TIP] It automatically generates dashboards tracking performance at the cluster, node, pod, and task levels for Amazon ECS, EKS, and AWS Fargate.

Amazon CloudWatch Metrics and Alarms(5 cards shown)

Question

Amazon CloudWatch Dashboards

Answer

Customizable, shareable home pages in the CloudWatch console used to monitor your AWS resources.

[!TIP] They are highly powerful because they can display metrics and alarms for AWS resources across multiple accounts and multiple AWS Regions within a single centralized view.

Question

CloudWatch Agent

Answer

A software package deployed to compute resources to collect system-level metrics and logs.

It is primarily used for gathering internal system metrics and logs from:

  • Amazon EC2 instances
  • Amazon ECS clusters
  • Amazon EKS clusters

[!WARNING] Default CloudWatch monitoring for EC2 only tracks hypervisor-visible metrics (like CPU and network I/O). You must install the CloudWatch Agent to capture operating system metrics like Memory and Disk Space utilization.

Question

CloudWatch Anomaly Detection

Answer

A CloudWatch feature that applies machine learning algorithms to continuously analyze metrics, establish an expected baseline, and trigger alarms based on dynamic thresholds.

Instead of creating a static threshold (e.g., trigger when CPU > 80%), anomaly detection automatically adjusts to natural metric patterns over time.

[!TIP] This is ideal for metrics with predictable daily or weekly trends (like varying website traffic), helping to significantly reduce false alarms.

Question

CloudWatch Custom Metrics

Answer

Application-specific or business-level data points published to CloudWatch that are not automatically tracked by AWS services.

When creating a custom metric, you define a Namespace to act as a container, isolating these metrics from standard AWS service metrics.

Metric TypeProvided ByExample
StandardAWS Services (Default)EC2 CPU Utilization, Lambda Invocations
CustomPutMetricData APINumber of e-commerce checkout failures

Question

Composite Alarms

Answer

A CloudWatch alarm that evaluates the states of multiple other underlying alarms to determine its own state.

By aggregating alarms, they help reduce alert noise and "alert fatigue."

[!NOTE] Composite alarms use logical operators such as AND, OR, and NOT.

Example Use Case: Only trigger a high-severity incident notification if an application's Error Rate is high AND Database CPU Utilization is > 90%.

When triggered, they can invoke AWS services directly or route events through Amazon EventBridge.

Amazon CloudWatch Network Monitoring and Troubleshooting(5 cards shown)

Question

VPC Flow Logs

Answer

A feature that captures metadata about the IP traffic going to and from network interfaces (ENIs) within an Amazon VPC.

Flow logs can be enabled at the VPC, subnet, or individual ENI level, and the log data can be published to Amazon CloudWatch Logs or Amazon S3.

[!NOTE] Flow logs do not capture actual packet payloads (they are not packet sniffers). They only capture metadata such as source/destination IP addresses, ports, protocols, and whether traffic was ACCEPT or REJECT.

Example: Analyzing a flow log record containing REJECT to determine that an overly restrictive Security Group or Network ACL is blocking legitimate inbound web traffic.

Question

VPC Reachability Analyzer

Answer

A configuration analysis tool that performs automated network path validation between a source and a destination within your AWS environment.

[!TIP] No actual network packets are sent! It uses automated reasoning to verify if your logical configuration (Route Tables, NACLs, Security Groups) allows a path.

Example: Troubleshooting an SSH connection timeout by running the Reachability Analyzer from an Internet Gateway (IGW) to an EC2 instance. The analyzer will highlight the exact configuration issue, such as a missing subnet route or a blocking Security Group rule.

Loading Diagram...

Question

CloudWatch Logs Insights

Answer

An interactive query engine used to rapidly search, filter, and analyze log data stored in Amazon CloudWatch Logs.

It features a purpose-built query language with commands like fields, filter, stats, and sort to extract actionable operational intelligence from massive log groups.

Common Network Use Case: Querying VPC Flow Logs to find the top 10 external IP addresses that are generating rejected connection attempts.

Example Query:

text
fields @timestamp, @message | filter action = "REJECT" | stats count() by srcAddr | sort count() desc | limit 10

Question

CloudWatch Network Monitor

Answer

An active network monitoring service that provides continuous visibility into network performance metrics—specifically packet loss and latency—between AWS and on-premises environments, or across AWS Regions.

It publishes these performance metrics directly to CloudWatch, allowing operations teams to set up proactive alerts before users report sluggish application performance.

[!WARNING] Don't confuse this with VPC Flow Logs! While Flow Logs tell you what traffic is occurring, Network Monitor tells you how well the connection is performing.

Example: Monitoring a hybrid architecture connected via AWS Direct Connect and configuring a CloudWatch Alarm to trigger an incident ticket if packet loss exceeds 2% for 5 consecutive minutes.

Question

CloudWatch Metric Filters

Answer

A CloudWatch feature that scans incoming log events for specific patterns and extracts numerical data, transforming those logs into standard CloudWatch Metrics.

Once a metric filter generates a metric, you can graph it on a CloudWatch Dashboard or use it to trigger a CloudWatch Alarm.

Log SourceMetric Filter TargetResulting Metric Example
VPC Flow LogsMatch REJECT actionsRejectedConnectionsCount
CloudFront LogsMatch 4xx HTTP status codesClientErrorRate

Example: Applying the filter pattern [version, account_id, interface_id, srcaddr, dstaddr, srcport, dstport, protocol, packets, bytes, start, end, action="REJECT", log_status] to a VPC Flow Log group to track anomalous spikes in denied network traffic.

Amazon CloudWatch & Network Monitoring Services(5 cards shown)

Question

VPC Flow Logs

Answer

A feature that enables you to capture information about the IP traffic going to and from network interfaces in your Virtual Private Cloud (VPC).

Flow log data can be published to Amazon CloudWatch Logs, Amazon S3, or Amazon Kinesis Data Firehose.

[!TIP] Use VPC Flow Logs to troubleshoot overly restrictive security groups or network ACLs by looking for REJECT records.

Example Flow Log Record:

text
2 123456789010 eni-1235b8ca123456789 172.31.16.139 172.31.16.21 20641 22 6 20 4249 1418530010 1418530070 ACCEPT OK

Question

VPC Reachability Analyzer

Answer

A network diagnostic tool that performs automated network path validation between a source and a destination in your VPCs.

It analyzes the network configuration to determine whether two resources can communicate, and if not, it identifies the blocking component (e.g., a specific Security Group rule or missing Route Table entry).

[!NOTE] Reachability Analyzer does not send actual packets over the network. It uses automated reasoning to build a mathematical model of your network configurations.

Question

CloudWatch Logs Insights

Answer

A fully managed, interactive log analytics service in AWS that allows you to search and analyze your log data using a purpose-built query language.

It is highly useful for diagnosing network connectivity issues by parsing massive volumes of VPC Flow Logs or Route 53 query logs.

Example Query (Finding rejected SSH traffic):

text
filter @logStream = 'eni-0a1b2c3d4e5f6g7h8' | filter action = 'REJECT' and dstPort = 22 | stats count() by srcAddr | sort count() desc

Question

VPC Traffic Mirroring

Answer

An AWS feature that allows you to copy network traffic from an Elastic Network Interface (ENI) of an EC2 instance and send it to out-of-band security and monitoring appliances.

Unlike VPC Flow Logs (which only capture metadata/headers), Traffic Mirroring captures the actual packet payload.

[!TIP] Use Traffic Mirroring for deep packet inspection (DPI), intrusion detection/prevention systems (IDS/IPS), or diagnosing complex application-level network issues.

Question

CloudWatch Anomaly Detection

Answer

A feature that applies machine learning algorithms to continuously analyze network and system metrics, creating a model of expected baseline behavior.

Instead of setting static thresholds (e.g., "Alert if NetworkIn > 500MB"), anomaly detection dynamically calculates standard deviations and triggers an alarm only when metrics fall outside the expected band.

Threshold TypeUse Case
Static ThresholdKnown hard limits (e.g., 90% disk space full)
Anomaly DetectionFluctuating traffic patterns (e.g., unexpected spike in ELB requests or network traffic)

Amazon EBS Performance Metrics and Optimization(5 cards shown)

Question

VolumeQueueLength

Answer

A CloudWatch metric that measures the number of pending I/O requests for an EBS volume device.

[!TIP]

  • Transaction-intensive workloads (SSDs): Should maintain a low queue length and high available IOPS to minimize latency.
  • Throughput-intensive workloads (HDDs): Are less sensitive to latency and can actually benefit from a high queue length when performing large, sequential I/O operations.

Question

BurstBalance

Answer

A CloudWatch metric that tracks the remaining percentage of I/O credits in the burst bucket for gp2, st1, and sc1 volumes.

[!WARNING] Depletion of the burst bucket (a balance of 0%) results in the volume being throttled down to its baseline performance, which causes a sudden drop in IOPS and increased latency.

Question

EBS-Optimized Instance

Answer

An Amazon EC2 feature that provides dedicated, predictable bandwidth between the EC2 instance and its attached Amazon EBS volumes.

[!NOTE] Enabling this setting prevents EBS volume storage traffic from contending with the instance's standard network traffic, mitigating a common source of poor storage performance.

Question

First-Access Latency Penalty (Snapshot Initialization)

Answer

The significant performance drop that occurs when a new EBS volume is created from a snapshot, because blocks are pulled down from Amazon S3 only when they are accessed for the first time.

Remediations:

  1. Amazon EBS Fast Snapshot Restore (FSR): Creates fully initialized volumes instantly (incurs additional cost).
  2. Manual Initialization: Use OS-level tools (like dd or fio) to manually read all blocks before putting the volume into production.

Question

VolumeReadOps & VolumeWriteOps

Answer

CloudWatch metrics that track the total number of read and write operations against an EBS volume over a specific time period.

These metrics help administrators identify if there are I/O size or volume throughput bottlenecks between the guest OS and the EBS volume.

[!TIP] By evaluating these operational metrics alongside bytes transferred (VolumeReadBytes / VolumeWriteBytes), you can define the appropriate I/O size for the application and calculate the exact total IOPS required when upgrading to Provisioned IOPS volume types.

Amazon EventBridge: Routing, Enriching, and Delivering Events(5 cards shown)

Question

Amazon EventBridge

Answer

A serverless event bus service used to receive, filter, transform, route, and deliver events across AWS services and third-party applications.

[!TIP] Think of it as the central nervous system of your AWS architecture. It continuously receives events from sources (like AWS Security Hub) and routes them to specific targets in near real-time, enabling automated remediation and reducing manual human interaction.

Question

EventBridge Rule

Answer

A configuration that watches an event bus for specific incoming events and routes them to targets for processing.

Rules use event patterns (or schedules) to determine which events to catch.

[!NOTE] When creating rules in the console, you can use predefined patterns that automatically fill in the source and detail type. For example, a predefined pattern can easily capture all new compliance findings directly from AWS Security Hub.

Question

EventBridge Targets

Answer

The destination resources or endpoints that receive an event when an EventBridge rule matches.

A single rule can route an event to multiple targets simultaneously.

Common Targets for Automated Remediation Include:

  • AWS Lambda: Invoking functions for custom code execution
  • Amazon EC2: Sending run commands via Systems Manager
  • AWS Step Functions: Triggering a state machine for complex workflows
  • Amazon SNS / SQS: Pushing notifications or queuing messages for downstream processing

Question

Event Pattern Filtering

Answer

The process of defining JSON-based matchers within an EventBridge rule to precisely select incoming events based on their data payload.

By specifying exact filter values, you ensure that targets are only invoked for relevant events, cutting down on noise and cost.

Example: Filtering Security Hub findings by specifying custom attributes like AWSAccountID, Compliance.Status, or RecordState:

json
{ "source": ["aws.securityhub"], "detail-type": ["Security Hub Findings - Imported"], "detail": { "findings": { "Compliance": { "Status": ["FAILED"] } } } }

Question

EventBridge Input Transformer

Answer

A feature used to customize, format, or enrich the payload of an event before EventBridge delivers it to a target.

It consists of two components:

  1. Input Path: Uses JSONPath to extract specific values from the original event payload and assign them to variables.
  2. Input Template: Uses those variables to construct a new data structure (e.g., converting a raw JSON error into a human-readable text string for an email notification).

[!TIP] Use the Input Transformer to enrich and deliver well-formatted events so the target service (like SNS) receives exactly what it needs, eliminating the need for an intermediary Lambda function just to parse JSON.

Showing 30 of 1,200 flashcards. Study all flashcards →

Ready to ace AWS Certified CloudOps Engineer - Associate (SOA-C03)?

Access all 840 practice questions, 12 timed mock exams, study notes, and flashcards — no sign-up required.

Start Studying — Free