Troubleshooting Deployment Failures: Using Service Output Logs

\nDeploying applications to AWS involves multiple moving parts. When a deployment fails, the root cause is often buried within logs. This guide focuses on identifying, locating, and interpreting service logs to perform root cause analysis (RCA) as required for the DVA-C02 exam.

Learning Objectives

Identify the primary log locations for key AWS compute and deployment services.
Differentiate between deployment events (infrastructure failures) and application logs (runtime failures).
Perform log querying using Amazon CloudWatch Logs Insights to isolate specific errors.
Interpret common error patterns in logs (e.g., IAM permission errors, timeouts, and syntax issues).

Key Terms & Glossary

CloudWatch Logs Insights: A fully managed service to search and analyze log data using a purpose-built query language.
Standard Error (stderr): The default stream where applications write error messages; captured automatically by AWS Lambda and ECS.
Deployment Rollback: An automatic process where AWS reverts a service to its last known healthy state after a failed deployment.
Log Stream: A sequence of log events that share the same source (e.g., a specific instance of a Lambda function).
Log Group: A group of log streams that share the same retention, monitoring, and access control settings.

The "Big Idea"\nTroubleshooting is the process of "Peeing the Onion." You start with the high-level orchestration event (e.g., a CloudFormation failure or a CodeDeploy error) to find where the process stopped, then dive into the specific service logs (e.g., Lambda execution logs or ECS container logs) to find why the application failed to start or run.

Formula / Concept Box

Service	Primary Log Source	Common Failure Indicator
AWS Lambda	CloudWatch Logs: `/aws/lambda/<name>`	`Task timed out`, `Process exited before completing`
ECS/Fargate	CloudWatch Logs: (via `awslogs` driver)	`Essential container in task exited`, `OOMKilled`
API Gateway	Execution Logs / Access Logs	`502 Bad Gateway`, `403 Forbidden`
Elastic Beanstalk	`/var/log/eb-activity.log`	`Instance deployment failed`, `Health check failed`
CloudFormation	Stack Events Tab	`ROLLBACK_IN_PROGRESS`, `Access Denied`

Hierarchical Outline

Phase 1: Detection & Orchestration Logs
- CloudFormation Events: Always check the "Events" tab first to see which resource failed to create or update.
- CodeDeploy Deployment History: Look for Lifecycle Event failures (e.g., BeforeInstall or ValidateService).
Phase 2: Compute Runtime Logs
- AWS Lambda: Focus on the REPORT line for duration/memory and ERROR lines for stack traces.
- Amazon ECS: Use the awslogs log driver to send container stdout and stderr to CloudWatch.
Phase 3: Deep Dive Analysis
- CloudWatch Logs Insights: Writing queries to filter for ERROR or CRITICAL keywords across multiple log streams.
- X-Ray Traces: Identifying service integration issues where logs show a generic "500 Error" but the trace shows a downstream timeout.

Visual Anchors

Troubleshooting Workflow

Loading Diagram...

Log Aggregation Architecture

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Runtime Error: An error occurring while the program is executing, rather than during compilation.
- Example: A Lambda function failing because it tried to access a DynamoDB table that doesn't exist in the new environment.
Liveness/Readiness Probe Failure: A signal that a container is not yet ready to receive traffic, leading to a deployment rollback.
- Example: An ECS Fargate task failing a health check because the security group blocks the ALB on port 80.
Log Filter Pattern: A specific string or pattern used to search through logs.
- Example: Searching for "?ERROR ?Exception ?Fail" to catch most critical issues in a Java or Python app.

Worked Examples

Example 1: Lambda Initialization Failure

Scenario: A deployment via SAM completes, but the API returns 502 Bad Gateway.

Check Logs: Navigate to /aws/lambda/my-function.
Identify Error: Find a log entry: Runtime.ImportModuleError: Unable to import module 'app': No module named 'requests'.
Root Cause: The requests library was not included in the deployment package (Missing dependency).
Fix: Update requirements.txt and rebuild the artifact.

Example 2: ECS Task Churn

Scenario: CloudFormation stays in UPDATE_IN_PROGRESS then rolls back.

Check Events: CloudFormation says Resource handler returned message: "Task failed to start".
Check Container Logs: CloudWatch shows exec /usr/bin/java: no such file or directory.
Root Cause: The Dockerfile entrypoint points to an incorrect path or the base image changed.

Checkpoint Questions

Where would you find logs if a Lambda function times out before writing any application-level logs?
What is the benefit of using CloudWatch Logs Insights over the standard CloudWatch Logs console search?
If a CodeDeploy deployment fails during the ValidateService hook, where should you look for logs?
How can you differentiate between an IAM permission issue and a code syntax error in a log file?

▶Click for Answers

In the CloudWatch Log Stream for that function; look for the REPORT line, which will contain Status: timeout.
Logs Insights supports a query language (sort, filter, stats, limit) and can search across multiple log groups simultaneously.
Check the deployment logs on the instance (for EC2/On-Premise) or the Lambda logs (for Lambda deployments) for the hook function code.
IAM issues usually contain strings like AccessDenied or is not authorized to perform, whereas syntax errors usually show language-specific traces (e.g., SyntaxError or NullPointerException).

", "word_count": 945, "suggested_title": "Troubleshooting Deployments with AWS Logs" } ```

Troubleshooting Deployment Failures: Using Service Output Logs

Learning Objectives

Identify the primary log locations for key AWS compute and deployment services.
Differentiate between deployment events (infrastructure failures) and application logs (runtime failures).
Perform log querying using Amazon CloudWatch Logs Insights to isolate specific errors.
Interpret common error patterns in logs (e.g., IAM permission errors, timeouts, and syntax issues).

Key Terms & Glossary

CloudWatch Logs Insights: A fully managed service to search and analyze log data using a purpose-built query language.
Standard Error (stderr): The default stream where applications write error messages; captured automatically by AWS Lambda and ECS.
Deployment Rollback: An automatic process where AWS reverts a service to its last known healthy state after a failed deployment.
Log Stream: A sequence of log events that share the same source (e.g., a specific instance of a Lambda function).
Log Group: A group of log streams that share the same retention, monitoring, and access control settings.

The "Big Idea"\nTroubleshooting is the process of "Peeing the Onion." You start with the high-level orchestration event (e.g., a CloudFormation failure or a CodeDeploy error) to find where the process stopped, then dive into the specific service logs (e.g., Lambda execution logs or ECS container logs) to find why the application failed to start or run.

Formula / Concept Box

Service	Primary Log Source	Common Failure Indicator
AWS Lambda	CloudWatch Logs: `/aws/lambda/<name>`	`Task timed out`, `Process exited before completing`
ECS/Fargate	CloudWatch Logs: (via `awslogs` driver)	`Essential container in task exited`, `OOMKilled`
API Gateway	Execution Logs / Access Logs	`502 Bad Gateway`, `403 Forbidden`
Elastic Beanstalk	`/var/log/eb-activity.log`	`Instance deployment failed`, `Health check failed`
CloudFormation	Stack Events Tab	`ROLLBACK_IN_PROGRESS`, `Access Denied`

Hierarchical Outline

Phase 1: Detection & Orchestration Logs
- CloudFormation Events: Always check the "Events" tab first to see which resource failed to create or update.
- CodeDeploy Deployment History: Look for Lifecycle Event failures (e.g., BeforeInstall or ValidateService).
Phase 2: Compute Runtime Logs
- AWS Lambda: Focus on the REPORT line for duration/memory and ERROR lines for stack traces.
- Amazon ECS: Use the awslogs log driver to send container stdout and stderr to CloudWatch.
Phase 3: Deep Dive Analysis
- CloudWatch Logs Insights: Writing queries to filter for ERROR or CRITICAL keywords across multiple log streams.
- X-Ray Traces: Identifying service integration issues where logs show a generic "500 Error" but the trace shows a downstream timeout.

Visual Anchors

Troubleshooting Workflow

Loading Diagram...

Log Aggregation Architecture

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Runtime Error: An error occurring while the program is executing, rather than during compilation.
- Example: A Lambda function failing because it tried to access a DynamoDB table that doesn't exist in the new environment.
Liveness/Readiness Probe Failure: A signal that a container is not yet ready to receive traffic, leading to a deployment rollback.
- Example: An ECS Fargate task failing a health check because the security group blocks the ALB on port 80.
Log Filter Pattern: A specific string or pattern used to search through logs.
- Example: Searching for "?ERROR ?Exception ?Fail" to catch most critical issues in a Java or Python app.

Worked Examples

Example 1: Lambda Initialization Failure

Scenario: A deployment via SAM completes, but the API returns 502 Bad Gateway.

Check Logs: Navigate to /aws/lambda/my-function.
Identify Error: Find a log entry: Runtime.ImportModuleError: Unable to import module 'app': No module named 'requests'.
Root Cause: The requests library was not included in the deployment package (Missing dependency).
Fix: Update requirements.txt and rebuild the artifact.

Example 2: ECS Task Churn

Scenario: CloudFormation stays in UPDATE_IN_PROGRESS then rolls back.

Check Events: CloudFormation says Resource handler returned message: "Task failed to start".
Check Container Logs: CloudWatch shows exec /usr/bin/java: no such file or directory.
Root Cause: The Dockerfile entrypoint points to an incorrect path or the base image changed.

Checkpoint Questions

Where would you find logs if a Lambda function times out before writing any application-level logs?
What is the benefit of using CloudWatch Logs Insights over the standard CloudWatch Logs console search?
If a CodeDeploy deployment fails during the ValidateService hook, where should you look for logs?
How can you differentiate between an IAM permission issue and a code syntax error in a log file?

▶Click for Answers

In the CloudWatch Log Stream for that function; look for the REPORT line, which will contain Status: timeout.
Logs Insights supports a query language (sort, filter, stats, limit) and can search across multiple log groups simultaneously.
Check the deployment logs on the instance (for EC2/On-Premise) or the Lambda logs (for Lambda deployments) for the hook function code.
IAM issues usually contain strings like AccessDenied or is not authorized to perform, whereas syntax errors usually show language-specific traces (e.g., SyntaxError or NullPointerException).

", "word_count": 945, "suggested_title": "Troubleshooting Deployments with AWS Logs" } ```

Troubleshooting AWS Deployment Failures via Service Logs

Troubleshooting Deployment Failures: Using Service Output Logs

Learning Objectives

Key Terms & Glossary

Formula / Concept Box

Hierarchical Outline

Visual Anchors

Troubleshooting Workflow

Log Aggregation Architecture

Definition-Example Pairs

Worked Examples

Example 1: Lambda Initialization Failure

Example 2: ECS Task Churn

Checkpoint Questions

Troubleshooting AWS Deployment Failures via Service Logs

Troubleshooting Deployment Failures: Using Service Output Logs

Learning Objectives

Key Terms & Glossary

Formula / Concept Box

Hierarchical Outline

Visual Anchors

Troubleshooting Workflow

Log Aggregation Architecture

Definition-Example Pairs

Worked Examples

Example 1: Lambda Initialization Failure

Example 2: ECS Task Churn

Checkpoint Questions