AWS Monitoring and Logging Solutions: Comprehensive Study Guide
Monitoring and logging solutions (for example, Amazon CloudWatch)
AWS Monitoring and Logging Solutions: Comprehensive Study Guide
This guide covers the critical monitoring and logging architectures required for the AWS Certified Solutions Architect - Professional (SAP-C02) exam, focusing on Amazon CloudWatch and its integrated ecosystem.
Learning Objectives
By the end of this guide, you should be able to:
- Categorize monitoring activities into the four essential phases (Generation, Aggregation, Processing, and Storage).
- Distinguish between CloudWatch Metrics, Logs, and Alarms.
- Implement automated remediation strategies using EventBridge and SNS.
- Design cost-optimized log retention and analytics pipelines using S3 and Athena.
- Identify the correct service for tracing (X-Ray), compliance (Config), and API auditing (CloudTrail).
Key Terms & Glossary
- Metric Filter: A rule used to extract numerical data from log events (e.g., counting the occurrences of "ERROR" in a log stream).
- CloudWatch Synthetics: A service that uses "canaries" (scripts) to monitor endpoints 24/7, simulating user behavior.
- VPC Flow Logs: A feature that captures information about the IP traffic going to and from network interfaces in your VPC.
- CloudTrail: A service that provides a record of actions taken by a user, role, or an AWS service (the "Who, What, Where, When" of API calls).
- AWS Config: A service that provides a resource inventory and tracks configuration changes over time for compliance.
The "Big Idea"
Monitoring is not a passive activity; it is the feedback loop of the AWS Well-Architected Framework. In a professional architecture, monitoring serves three masters: Reliability (detecting and fixing failures), Operational Excellence (automating responses), and Performance (identifying bottlenecks). Without integrated logging and metrics, automation—the cornerstone of the Cloud—is impossible.
Formula / Concept Box
| Concept | Metric (Quantitative) | Log (Qualitative/Event) |
|---|---|---|
| Nature | Time-series data (numbers over time). | Discrete events (text records). |
| Storage | 15 months (automatically aggregated). | Indefinite (based on retention policy). |
| Action | Used for Alarms and Auto Scaling. | Used for Root Cause Analysis (RCA). |
| Analysis Tool | CloudWatch Metrics / Dashboards. | CloudWatch Logs Insights / Athena. |
Hierarchical Outline
- Phase 1: Generation (Monitoring All Components)
- Standard Metrics: Default metrics from EC2 (CPU, Disk I/O), RDS, and Lambda.
- Custom Metrics: Application-level KPIs (e.g., number of items in a cart).
- External Monitoring: Using CloudWatch Synthetics to test endpoints from the outside-in.
- Phase 2: Aggregation (Defining Metrics)
- Metric Filters: Turning unstructured logs into actionable metrics.
- Cross-Account Observability: Aggregating metrics across a multi-account organization.
- Phase 3: Real-time Processing & Alarming
- Notifications: Using Amazon SNS to alert teams via Slack, Email, or PagerDuty.
- Automation: Triggering Lambda functions or Systems Manager Automation for self-healing.
- Phase 4: Storage & Analytics
- Hot Storage: CloudWatch Logs Insights for quick SQL-like queries.
- Cold Storage: Exporting logs to Amazon S3 for long-term retention.
- Analytics: Using Amazon Athena to query logs directly on S3 using standard SQL.
Visual Anchors
The Four-Phase Monitoring Pipeline
CloudWatch Log Architecture
Definition-Example Pairs
- EventBridge Rule
- Definition: A serverless event bus that connects application data from various sources and routes it to targets.
- Example: Creating a rule that detects an EC2 instance state change to "stopped" and automatically triggers a Lambda function to restart it.
- CloudWatch Logs Insights
- Definition: An interactive, pay-as-you-go log analytics service.
- Example: Running a query to find the top 10 IP addresses causing 404 errors in your Apache access logs within the last hour.
- Canary
- Definition: A configurable script that runs on a schedule to monitor your endpoints and APIs.
- Example: A Node.js script that logs into your web portal every 5 minutes to ensure the "Buy Now" button is functional.
Worked Examples
Scenario: Automating Remediation for Disk Space
Problem: A critical EC2 application fails when disk utilization hits 100%.
- Step 1 (Generation): Install the CloudWatch Agent on the EC2 instance to collect the
disk_used_percentmetric (not available by default). - Step 2 (Alarming): Create a CloudWatch Alarm that triggers when
disk_used_percent > 80for 5 minutes. - Step 3 (Action): Set the Alarm target to an Amazon SNS topic.
- Step 4 (Remediation): Subscribe an AWS Lambda function to the SNS topic. The Lambda script identifies the instance and executes a Systems Manager (SSM) command to clear temporary logs or expand the EBS volume.
Checkpoint Questions
- Which service would you use to find out which IAM user deleted an S3 bucket yesterday?
- What is the most cost-effective way to store 7 years of logs for regulatory compliance while allowing for occasional SQL queries?
- True or False: CloudWatch provides RAM utilization metrics for EC2 instances by default.
- How can you combine multiple metrics into a single alarm (e.g., high CPU AND high 5XX errors)?
[!TIP] Answers: 1. CloudTrail; 2. S3 with Lifecycle Policies to Glacier, using Athena for queries; 3. False (requires the CloudWatch Agent); 4. Use Metric Math to create a composite metric/alarm.
Muddy Points & Cross-Refs
- CloudWatch vs. CloudTrail: CloudWatch monitors performance/health of resources; CloudTrail monitors API activity (who did what).
- CloudWatch vs. Config: CloudWatch is for metrics/logs; Config is for compliance and resource relationships.
- Data Retention: By default, logs in CloudWatch never expire. Always set a Retention Policy (e.g., 30 days) to avoid unnecessary costs, then export to S3 for long-term storage.
Comparison Tables
| Feature | AWS X-Ray | CloudWatch Synthetics |
|---|---|---|
| Primary Goal | Trace requests across distributed systems. | Monitor endpoint health and UI flows. |
| Perspective | Inside-out (follows the code). | Outside-in (simulates the user). |
| Best For | Debugging latency in microservices. | Verifying site availability/uptime. |
| Implementation | Requires SDK/Code instrumentation. | Requires writing scripts (Node/Python). |