Mastering Observability: Logging, Monitoring, and Tracing
Describe differences between logging, monitoring, and observability
Mastering Observability: Logging, Monitoring, and Tracing
This guide covers the fundamental concepts of troubleshooting and optimization as defined in the AWS Certified Developer - Associate (DVA-C02) curriculum, specifically focusing on the differences between logging, monitoring, and observability.
Learning Objectives
After studying this guide, you should be able to:
- Define and differentiate between Logging, Monitoring, and Observability.
- Identify the "Three Pillars of Observability": Logs, Metrics, and Traces.
- Select the appropriate AWS service (CloudWatch vs. X-Ray) for specific troubleshooting scenarios.
- Understand the importance of structured logging and custom metrics (EMF).
Key Terms & Glossary
- Telemetry: The collection of measurements or other data at remote or inaccessible points and their transmission to receiving equipment for monitoring.
- Structured Logging: Providing logs in a predictable format (like JSON) to make them easily searchable and machine-readable.
- Cardinality: In monitoring, refers to the number of unique values for a specific dimension (e.g., a UserID has high cardinality; a Region has low cardinality).
- Annotation: Key-value pairs added to traces in AWS X-Ray used for indexing and filtering.
- Metadata: Additional data in traces that is NOT indexed (used for storage only).
The "Big Idea"
Traditional Monitoring tells you if a system is broken (e.g., "The CPU is at 99%"). Observability is the property of a system that allows you to understand why it is broken by looking at its external outputs (telemetry). It moves us from simple dashboards to deep, cross-service debugging using traces and logs.
Formula / Concept Box
| Concept | Primary Goal | AWS Tool | Example Data Point |
|---|---|---|---|
| Logging | Discrete event record | Amazon CloudWatch Logs | [ERROR] User 123 failed login |
| Monitoring | Aggregate health/perf | Amazon CloudWatch Metrics | CPUUtilization: 74% |
| Tracing | Request lifecycle | AWS X-Ray | Lambda -> DynamoDB (200ms) |
Hierarchical Outline
- I. Monitoring (The "What")
- CloudWatch Metrics: Quantitative data points over time.
- Alarms: Threshold-based notifications (e.g., SNS trigger if Errors > 5).
- Dashboards: Visual aggregations for high-level health.
- II. Logging (The "How")
- CloudWatch Logs: Storage and search of text-based events.
- Insights: Querying logs using a SQL-like syntax.
- EMF (Embedded Metric Format): Pushing metrics inside logs for high-throughput efficiency.
- III. Tracing (The "Where")
- AWS X-Ray: End-to-end view of a request across distributed systems.
- Service Map: Visual dependency graph of your architecture.
- Segments/Subsegments: Individual units of work within a trace.
Visual Anchors
Telemetry Data Flow
The Pillars of Observability
\begin{tikzpicture} \draw[thick, fill=blue!10, opacity=0.5] (0,0) circle (1.5) node[below=1.2cm] {Logs}; \draw[thick, fill=red!10, opacity=0.5] (2,0) circle (1.5) node[below=1.2cm] {Metrics}; \draw[thick, fill=green!10, opacity=0.5] (1,1.7) circle (1.5) node[above=1.2cm] {Traces}; \node at (1,0.6) {\textbf{Observability}}; \end{tikzpicture}
Definition-Example Pairs
- Metric: A numeric value measured over time.
- Example: Tracking the
MemoryUtilizationof a Lambda function to determine if it needs more provisioned RAM.
- Example: Tracking the
- Log: An immutable, timestamped record of a discrete event.
- Example: A JSON entry in CloudWatch Logs showing the specific input parameters that caused a code exception.
- Trace: A representation of a series of related events (spans) that show the path of a request.
- Example: Identifying that a 5-second delay in an API call is specifically happening during the
PutItemcall to DynamoDB, not the Lambda execution itself.
- Example: Identifying that a 5-second delay in an API call is specifically happening during the
Worked Examples
Scenario: Troubleshooting a "Slow" Microservice
- Monitoring: A CloudWatch Alarm triggers because the P99 Latency metric for an API Gateway exceeds 2 seconds.
- Observability (Tracing): You open the AWS X-Ray Service Map. You see a "red" circle on the Lambda icon. Clicking it reveals that the subsegment
RemoteDBQueryis taking 1.8 seconds. - Logging: You use CloudWatch Logs Insights to query the logs for that specific
TraceID. You find the log entry:"query": "SELECT * FROM users", "rows_returned": 1000000. - Resolution: The issue is an unoptimized query returning too much data. You add a
LIMITclause to the code.
Checkpoint Questions
- Which service should you use to visualize the dependency relationship between your Lambda functions and SQS queues?
- What is the benefit of using Structured Logging (JSON) over plain text logs?
- If you want to be notified automatically when an error rate exceeds 5%, which AWS feature should you configure?
- Does AWS X-Ray help you find specific error messages, or the location of the error in the call chain?
[!TIP] For the DVA-C02 exam, remember: CloudWatch = Health/Performance/Logs, X-Ray = Distributed Tracing/Latency/Bottlenecks.
▶Click to expand answers
- AWS X-Ray (Service Map).
- It allows for easier searching, filtering, and automated analysis via CloudWatch Logs Insights.
- A CloudWatch Alarm tied to a Metric Filter and SNS.
- It primarily helps find the location/source of the error in the chain, though you can often link back to logs for the specific message.