Lab: Monitoring and Auditing AWS Data Pipelines
Maintaining and Monitoring Data Pipelines
Lab: Monitoring and Auditing AWS Data Pipelines
This hands-on lab guides you through implementing a robust monitoring and alerting solution for a serverless data pipeline. You will learn to capture logs, create metric filters, and automate notifications when failures occur.
Prerequisites
- An active AWS Account.
- AWS CLI installed and configured with Administrator access.
- Basic knowledge of Python and SQL.
- Familiarity with the AWS Management Console.
[!IMPORTANT] Ensure your CLI is configured for a specific region (e.g.,
us-east-1) and use that region consistently throughout the lab.
Learning Objectives
- Configure Amazon CloudWatch Logs to centralize pipeline execution data.
- Create a CloudWatch Metric Filter to detect specific error patterns (e.g., "ERROR").
- Set up Amazon SNS for automated real-time alerts.
- Utilize CloudWatch Logs Insights to perform log analysis for auditing.
Architecture Overview
Step-by-Step Instructions
Step 1: Create an SNS Topic and Subscription
You need a notification channel to receive alerts when your pipeline fails.
# Create the SNS Topic
aws sns create-topic --name brainybee-pipeline-alerts
# Subscribe your email (Replace YOUR_EMAIL)
aws sns subscribe --topic-arn arn:aws:sns:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:brainybee-pipeline-alerts --protocol email --notification-endpoint <YOUR_EMAIL>▶Console alternative
- Navigate to SNS > Topics > Create topic.
- Name it
brainybee-pipeline-alerts. - In the topic view, click Create subscription.
- Select Email and enter your address.
[!NOTE] Check your inbox and click Confirm Subscription in the email from AWS.
Step 2: Create a CloudWatch Log Group
Centralize your pipeline logs for monitoring.
aws logs create-log-group --log-group-name /aws/vendedlogs/pipeline-monitorStep 3: Simulate a Pipeline Failure
We will use a Lambda function to simulate a data processing task that intermittently logs errors.
# Create an execution role for Lambda
# (Simplified for lab purposes)
aws iam create-role --role-name lambda-monitor-role --assume-role-policy-document '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "lambda.amazonaws.com"},"Action": "sts:AssumeRole"}]}'
# Attach logging permissions
aws iam attach-role-policy --role-name lambda-monitor-role --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRoleCreate a file named lambda_function.py:
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
print("START: Data processing task")
# Simulate a failure
logger.error("ERROR: Data validation failed for record ID 9921")
return {"status": "complete"}# Zip and deploy
zip function.zip lambda_function.py
aws lambda create-function --function-name pipeline-worker \
--zip-file fileb://function.zip --handler lambda_function.lambda_handler \
--runtime python3.9 --role arn:aws:iam::<YOUR_ACCOUNT_ID>:role/lambda-monitor-role
# Invoke to generate logs
aws lambda invoke --function-name pipeline-worker out.txtStep 4: Create a Metric Filter and Alarm
This step automates the detection of the word "ERROR" in your logs.
# Create Metric Filter
aws logs put-metric-filter \
--log-group-name /aws/lambda/pipeline-worker \
--filter-name ErrorFilter \
--filter-pattern "ERROR" \
--metric-transformations metricName=ErrorCount,metricNamespace=PipelineMonitor,metricValue=1
# Create Alarm
aws cloudwatch put-metric-alarm \
--alarm-name PipelineErrorAlarm \
--metric-name ErrorCount \
--namespace PipelineMonitor \
--statistic Sum \
--period 60 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:brainybee-pipeline-alertsCheckpoints
- SNS Confirmation: Do you have a green "Confirmed" status in the SNS Console?
- Log Discovery: Navigate to CloudWatch Logs > Log Groups >
/aws/lambda/pipeline-worker. Can you see the "ERROR" message? - Alarm State: In CloudWatch Alarms, is
PipelineErrorAlarmin theOKorALARMstate? (Invoke the Lambda again if it stays in OK).
Troubleshooting
| Issue | Possible Cause | Fix |
|---|---|---|
| No email received | SNS Subscription not confirmed | Check spam folder and click confirm link. |
| Alarm stays in 'INSUFFICIENT_DATA' | No logs matched the filter | Invoke the Lambda function 2-3 times to trigger the pattern. |
| Lambda fails to create | Role not yet propagated | Wait 10 seconds and retry the create-function command. |
Clean-Up / Teardown
[!WARNING] Always delete lab resources to avoid unexpected AWS charges.
# Delete Alarm
aws cloudwatch delete-alarms --alarm-names PipelineErrorAlarm
# Delete Log Group
aws logs delete-log-group --log-group-name /aws/lambda/pipeline-worker
# Delete Lambda
aws lambda delete-function --function-name pipeline-worker
# Delete SNS Topic
aws sns delete-topic --topic-arn arn:aws:sns:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:brainybee-pipeline-alerts
# Delete IAM Role (Detach policy first)
aws iam detach-role-policy --role-name lambda-monitor-role --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
aws iam delete-role --role-name lambda-monitor-roleCost Estimate
- CloudWatch: First 5GB of logs and 10 alarms are free tier eligible. < $0.10 for this lab.
- Lambda: First 1 million requests per month are free. $0.00 for this lab.
- SNS: First 1,000 emails per month are free. $0.00 for this lab.
Stretch Challenge
Modify the Metric Filter to use a JSON filter pattern. If your Lambda logs in JSON format (e.g., {'status': 'ERROR', 'code': 500}), create a filter that only triggers an alarm if the code is 500.
Concept Review
Monitoring vs. Auditing
| Tool | Primary Use Case |
|---|---|
| CloudWatch Logs | Storing and searching application-level logs. |
| CloudWatch Alarms | Triggering actions based on metric thresholds. |
| AWS CloudTrail | Auditing API calls made by users or services. |
| Redshift System Tables | Troubleshooting data load errors (e.g., STL_LOAD_ERRORS). |