Root Cause Analysis Mastery: Debugging Serverless Applications on AWS
Assist in a root cause analysis
Root Cause Analysis Mastery: Debugging Serverless Applications on AWS
This lab focuses on the critical DVA-C02 skill of Assisting in a Root Cause Analysis (RCA). You will act as a developer troubleshooting a failing serverless data pipeline. You'll move from identifying a failure in logs to tracing the execution path in X-Ray and eventually fixing the defect.
[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges to your AWS account.
Prerequisites
Before starting, ensure you have:
- An AWS Account with administrative access.
- AWS CLI configured with credentials (
aws configure). - Basic familiarity with Python and JSON.
- IAM permissions to create Lambda, S3, DynamoDB, and CloudWatch resources.
Learning Objectives
By the end of this lab, you will be able to:
- Query CloudWatch Logs using Log Insights to find specific application errors.
- Analyze X-Ray Traces to identify service integration bottlenecks and failures.
- Implement Custom Metrics using the CloudWatch Embedded Metric Format (EMF).
- Perform an RCA by correlating logs, traces, and metrics to identify a code defect.
Architecture Overview
We are troubleshooting a "Thumbnail Processor" application. An image is uploaded to S3, triggering a Lambda function that logs metadata to DynamoDB. Currently, the metadata step is failing.
Step-by-Step Instructions
Step 1: Deploy the Faulty Infrastructure
We will deploy a CloudFormation stack that intentionally contains a bug in the Lambda code's integration with DynamoDB.
aws cloudformation deploy \
--stack-name brainybee-rca-lab \
--template-body https://raw.githubusercontent.com/aws-samples/aws-serverless-workshops/master/Observability/template.yaml \
--capabilities CAPABILITY_IAM▶Console alternative
Navigate to CloudFormation > Create stack > With new resources. Upload the template URL and follow the wizard to name the stack
brainybee-rca-lab.
Step 2: Trigger the Failure
Upload a test file to the newly created S3 bucket to trigger the Lambda function.
# Get the bucket name from stack outputs
BUCKET_NAME=$(aws cloudformation describe-stacks --stack-name brainybee-rca-lab --query "Stacks[0].Outputs[?OutputKey=='BucketName'].OutputValue" --output text)
echo "test data" > test-image.jpg
aws s3 cp test-image.jpg s3://$BUCKET_NAME/test-image.jpgStep 3: Query Logs with CloudWatch Insights
The Lambda failed silently. We need to find the specific error message.
# Start a query for the last 5 minutes
QUERY_ID=$(aws logs start-query \
--log-group-name /aws/lambda/ThumbnailFunction \
--start-time $(date +%s -d "5 minutes ago") \
--end-time $(date +%s) \
--query-string 'fields @timestamp, @message | filter @message like /Error/ | sort @timestamp desc' \
--query 'queryId' --output text)
# Wait 5 seconds, then get results
aws logs get-query-results --query-id $QUERY_ID[!TIP] In the CloudWatch Console, go to Logs Insights, select the log group, and run:
fields @timestamp, @message | filter @message like /Error/
Step 4: Analyze the Trace in X-Ray
Logs show a ResourceNotFoundException. We need to see which service integration is actually failing.
# Get trace summaries for the last minute
aws xray get-trace-summaries --start-time $(date +%s -d "1 minute ago") --end-time $(date +%s)▶Console alternative
Navigate to CloudWatch > X-Ray traces > Service map. You will see a red circle around the connection between Lambda and DynamoDB, indicating a 400-series error.
Checkpoints
- S3 Upload: Is the file visible in
aws s3 ls s3://$BUCKET_NAME? - Logs: Did the Log Insights query return an error string like
Requested resource not found (Table: MetadataTable)? - Traces: Does the X-Ray Service Map show a failed node for DynamoDB?
Teardown
To avoid costs, delete the resources created during this lab.
# Empty the bucket first
aws s3 rm s3://$BUCKET_NAME --recursive
# Delete the stack
aws cloudformation delete-stack --stack-name brainybee-rca-labTroubleshooting
| Issue | Possible Cause | Fix |
|---|---|---|
AccessDenied on S3 upload | IAM permissions missing | Ensure your CLI user has s3:PutObject for the bucket. |
| Query returns 0 results | Lambda hasn't logged yet | Wait 30 seconds for CloudWatch to ingest the logs and retry. |
| Stack deletion hangs | S3 bucket not empty | Manually delete files in the S3 bucket before deleting the stack. |
Challenge
Goal: Implement an automated monitor.
- Create a CloudWatch Metric Filter that looks for the word "Error" in the
/aws/lambda/ThumbnailFunctionlog group. - Assign this filter to a custom metric named
ProcessingErrors. - Create a CloudWatch Alarm that sends an SNS notification if
ProcessingErrors> 0 for a 1-minute period.
Cost Estimate
- Lambda: Free tier (first 1M requests/mo).
- S3: $0.023 per GB (negligible for this lab).
- CloudWatch Logs: $0.50 per GB ingested.
- X-Ray: Free tier (first 100,000 traces/mo).
- Total Estimated Cost: < $0.05 (well within AWS Free Tier).
Concept Review
In this lab, we performed a standard Root Cause Analysis (RCA) flow. This flow typically narrows down from a broad symptom to a specific code or configuration defect.
Key Comparisons
| Tool | Primary Purpose | Best Used For... |
|---|---|---|
| CloudWatch Logs | Discrete event recording | Finding stack traces and specific error messages. |
| CloudWatch Metrics | Numerical aggregation | Detecting trends and triggering automated alarms. |
| AWS X-Ray | Distributed request tracing | Identifying high latency or failures in multi-service calls. |
| Log Insights | Log querying | Sifting through thousands of log lines using a SQL-like syntax. |