AWS Observability: Interpreting Metrics, Logs, and Traces

This guide covers the essential skills for the AWS Certified Developer - Associate (DVA-C02) regarding Unit 4: Troubleshooting and Optimization. You will learn to navigate the "Three Pillars of Observability" within the AWS ecosystem.

Learning Objectives

After studying this guide, you should be able to:

Differentiate between logging, monitoring (metrics), and tracing.
Query Amazon CloudWatch Logs using Logs Insights syntax.
Implement custom metrics using the CloudWatch Embedded Metric Format (EMF).
Interpret AWS X-Ray service maps and trace segments to perform Root Cause Analysis (RCA).
Identify performance bottlenecks and application defects through dashboard analysis.

Key Terms & Glossary

Metric: A time-ordered set of data points (e.g., CPU utilization) used for monitoring trends.
Log: A discrete record of an event that occurred within an application or system (e.g., a 404 error message).
Trace: A representation of the end-to-end journey of a single request through a distributed system.
Annotation (X-Ray): Key-value pairs used for indexing and filtering traces (searchable).
Metadata (X-Ray): Key-value pairs that provide additional context but are NOT indexed (not searchable).
Embedded Metric Format (EMF): A JSON specification used to instruct CloudWatch Logs to automatically extract metrics from log data.

The "Big Idea"

Observability is more than just "monitoring." While monitoring tells you if something is wrong (e.g., "The server is down"), observability allows you to understand why something is wrong by correlating data across different dimensions. In a microservices architecture, this correlation is the only way to find a single failing component among hundreds of interconnected services.

Formula / Concept Box

Concept	Data Type	Primary Tool	Best Used For
Metrics	Numeric / Aggregated	CloudWatch Metrics	High-level health, Alarms, Autoscaling
Logs	Text / Structured JSON	CloudWatch Logs	Detailed event history, debugging specific errors
Traces	Segment / Subsegments	AWS X-Ray	Latency analysis, dependency mapping, bottleneck identification

[!TIP] Use CloudWatch Alarms for metrics and CloudWatch Logs Insights for querying massive amounts of log data in seconds.

Hierarchical Outline

Monitoring with Amazon CloudWatch
- Standard Metrics: Default metrics provided by AWS services (e.g., Lambda Duration, S3 BucketSizeBytes).
- Custom Metrics: Application-specific data points sent via PutMetricData or EMF.
- CloudWatch Dashboards: Global views of health across multiple regions and accounts.
Logging Strategies
- Structured Logging: Using JSON instead of plain text to make logs machine-readable.
- CloudWatch Logs Insights: A purpose-built query language to filter and aggregate log data.
- Subscription Filters: Real-time delivery of logs to Kinesis, Lambda, or OpenSearch.
Distributed Tracing with AWS X-Ray
- Service Graph: A visual representation of service dependencies.
- Sampling: Controlling the amount of data sent to X-Ray to manage costs.
- Segments vs. Subsegments: A segment represents a compute resource; subsegments represent remote calls or local code blocks.

Visual Anchors

Observability Data Flow

Loading Diagram...

X-Ray Trace Hierarchy

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Structured Logging: The practice of formatting logs as JSON objects.
- Example: Instead of logging "User 123 logged in", you log {"event": "Login", "userId": 123, "status": "success"}. This allows CloudWatch Logs Insights to group by userId instantly.
Sampling Rate: The percentage of requests that are recorded for tracing.
- Example: Setting a sampling rate of 5% in X-Ray ensures you capture enough data to see trends without incurring the cost of tracing every single request in a high-traffic app.
Metric Dimensions: Name/value pairs that are part of a metric's identity.
- Example: For a metric named ServerErrors, a dimension could be InstanceID=i-12345 or Region=us-east-1.

Worked Examples

Example 1: Finding 5xx Errors in CloudWatch Logs

You are seeing a spike in HTTP 500 errors but don't know which API endpoint is failing. You use Logs Insights:

Query:

sql

fields @timestamp, @message, status
| filter status >= 500
| stats count(*) by bin(1m)

Explanation: This query selects the timestamp and status code, filters for errors (>= 500), and creates a bar chart (stats) counting errors per minute (bin(1m)).

Example 2: Implementing EMF (Embedded Metric Format)

You want to generate metrics for "OrderValue" without calling the PutMetricData API (which is throttled and synchronous).

Implementation (Node.js snippet):

json

{
  "_aws": {
    "Timestamp": 1611234567890,
    "CloudWatchMetrics": [
      {
        "Namespace": "RetailApp",
        "Dimensions": [["Region"]],
        "Metrics": [{"Name": "OrderValue", "Unit": "None"}]
      }
    ]
  },
  "Region": "us-west-2",
  "OrderValue": 45.99,
  "User": "jdoe"
}

Explanation: By logging this JSON string to CloudWatch Logs, CloudWatch automatically extracts a metric named OrderValue under the RetailApp namespace.

Checkpoint Questions

Scenario: Your application is slow. X-Ray shows a high latency in a specific segment. How do you find the specific line of code or database query causing the delay?
- Answer: Look at the Subsegments within that segment. If it's an external call (e.g., DynamoDB), check the metadata or annotations for the query parameters.
True/False: CloudWatch Metrics are stored forever by default.
- Answer: False. Standard metrics have varying retention (e.g., 15 months), and CloudWatch Logs retention is configurable from 1 day to indefinitely.
Problem: You need to filter X-Ray traces based on a custom UserTier (e.g., 'Gold', 'Silver'). Should you use an Annotation or Metadata?
- Answer: Annotation. Annotations are indexed and allow you to filter traces in the X-Ray console using filter expressions.
Metric Aggregation: If you want to see the minimum, maximum, and average latency of a Lambda function, which CloudWatch feature do you use?
- Answer: Metric Statistics (specifically p90 or p99 percentiles are best for identifying outliers).