Study Guide1,004 words
Troubleshooting AWS Deployment Failures via Service Logs
Troubleshoot deployment failures by using service output logs
Troubleshooting Deployment Failures: Using Service Output Logs
\nDeploying applications to AWS involves multiple moving parts. When a deployment fails, the root cause is often buried within logs. This guide focuses on identifying, locating, and interpreting service logs to perform root cause analysis (RCA) as required for the DVA-C02 exam.
Learning Objectives
- Identify the primary log locations for key AWS compute and deployment services.
- Differentiate between deployment events (infrastructure failures) and application logs (runtime failures).
- Perform log querying using Amazon CloudWatch Logs Insights to isolate specific errors.
- Interpret common error patterns in logs (e.g., IAM permission errors, timeouts, and syntax issues).
Key Terms & Glossary
- CloudWatch Logs Insights: A fully managed service to search and analyze log data using a purpose-built query language.
- Standard Error (stderr): The default stream where applications write error messages; captured automatically by AWS Lambda and ECS.
- Deployment Rollback: An automatic process where AWS reverts a service to its last known healthy state after a failed deployment.
- Log Stream: A sequence of log events that share the same source (e.g., a specific instance of a Lambda function).
- Log Group: A group of log streams that share the same retention, monitoring, and access control settings.
The "Big Idea"\nTroubleshooting is the process of "Peeing the Onion." You start with the high-level orchestration event (e.g., a CloudFormation failure or a CodeDeploy error) to find where the process stopped, then dive into the specific service logs (e.g., Lambda execution logs or ECS container logs) to find why the application failed to start or run.
Formula / Concept Box
| Service | Primary Log Source | Common Failure Indicator |
|---|---|---|
| AWS Lambda | CloudWatch Logs: /aws/lambda/<name> | Task timed out, Process exited before completing |
| ECS/Fargate | CloudWatch Logs: (via awslogs driver) | Essential container in task exited, OOMKilled |
| API Gateway | Execution Logs / Access Logs | 502 Bad Gateway, 403 Forbidden |
| Elastic Beanstalk | /var/log/eb-activity.log | Instance deployment failed, Health check failed |
| CloudFormation | Stack Events Tab | ROLLBACK_IN_PROGRESS, Access Denied |
Hierarchical Outline
- Phase 1: Detection & Orchestration Logs
- CloudFormation Events: Always check the "Events" tab first to see which resource failed to create or update.
- CodeDeploy Deployment History: Look for Lifecycle Event failures (e.g.,
BeforeInstallorValidateService).
- Phase 2: Compute Runtime Logs
- AWS Lambda: Focus on the
REPORTline for duration/memory andERRORlines for stack traces. - Amazon ECS: Use the
awslogslog driver to send containerstdoutandstderrto CloudWatch.
- AWS Lambda: Focus on the
- Phase 3: Deep Dive Analysis
- CloudWatch Logs Insights: Writing queries to filter for
ERRORorCRITICALkeywords across multiple log streams. - X-Ray Traces: Identifying service integration issues where logs show a generic "500 Error" but the trace shows a downstream timeout.
- CloudWatch Logs Insights: Writing queries to filter for
Visual Anchors
Troubleshooting Workflow
Loading Diagram...
Log Aggregation Architecture
Compiling TikZ diagram…
⏳
Running TeX engine…
This may take a few seconds
Definition-Example Pairs
- Runtime Error: An error occurring while the program is executing, rather than during compilation.
- Example: A Lambda function failing because it tried to access a DynamoDB table that doesn't exist in the new environment.
- Liveness/Readiness Probe Failure: A signal that a container is not yet ready to receive traffic, leading to a deployment rollback.
- Example: An ECS Fargate task failing a health check because the security group blocks the ALB on port 80.
- Log Filter Pattern: A specific string or pattern used to search through logs.
- Example: Searching for
"?ERROR ?Exception ?Fail"to catch most critical issues in a Java or Python app.
- Example: Searching for
Worked Examples
Example 1: Lambda Initialization Failure
Scenario: A deployment via SAM completes, but the API returns 502 Bad Gateway.
- Check Logs: Navigate to
/aws/lambda/my-function. - Identify Error: Find a log entry:
Runtime.ImportModuleError: Unable to import module 'app': No module named 'requests'. - Root Cause: The
requestslibrary was not included in the deployment package (Missing dependency). - Fix: Update
requirements.txtand rebuild the artifact.
Example 2: ECS Task Churn
Scenario: CloudFormation stays in UPDATE_IN_PROGRESS then rolls back.
- Check Events: CloudFormation says
Resource handler returned message: "Task failed to start". - Check Container Logs: CloudWatch shows
exec /usr/bin/java: no such file or directory. - Root Cause: The Dockerfile entrypoint points to an incorrect path or the base image changed.
Checkpoint Questions
- Where would you find logs if a Lambda function times out before writing any application-level logs?
- What is the benefit of using CloudWatch Logs Insights over the standard CloudWatch Logs console search?
- If a CodeDeploy deployment fails during the
ValidateServicehook, where should you look for logs? - How can you differentiate between an IAM permission issue and a code syntax error in a log file?
▶Click for Answers
- In the CloudWatch Log Stream for that function; look for the REPORT line, which will contain
Status: timeout. - Logs Insights supports a query language (sort, filter, stats, limit) and can search across multiple log groups simultaneously.
- Check the deployment logs on the instance (for EC2/On-Premise) or the Lambda logs (for Lambda deployments) for the hook function code.
- IAM issues usually contain strings like
AccessDeniedoris not authorized to perform, whereas syntax errors usually show language-specific traces (e.g.,SyntaxErrororNullPointerException).