AWS Data Engineering: Extracting & Preparing Logs for Audits
Extract logs for audits
AWS Data Engineering: Extracting & Preparing Logs for Audits
This study guide covers the critical skills required for Task 3.3 and 4.4 of the AWS Certified Data Engineer – Associate (DEA-C01) exam, focusing on logging, monitoring, and auditability within data pipelines.
Learning Objectives
After studying this guide, you should be able to:
- Differentiate between AWS CloudTrail and Amazon CloudWatch Logs for auditing.
- Configure log extraction from various sources into centralized storage.
- Analyze logs using Amazon Athena, CloudWatch Logs Insights, and OpenSearch.
- Ensure log integrity and compliance for security audits.
- Troubleshoot pipeline performance using log-based insights.
Key Terms & Glossary
- AWS CloudTrail: A service that records API calls and user activity across the AWS infrastructure.
- Log Stream: A sequence of log events that share the same source (e.g., a specific EC2 instance or Lambda container).
- Log Group: A collection of log streams that share the same retention, monitoring, and access control settings.
- CloudTrail Lake: A managed data lake for capturing, storing, and querying activity logs using SQL.
- Log File Integrity Validation: A feature that uses cryptographic hashes to prove that CloudTrail logs have not been tampered with after delivery.
- SerDes (Serializer/Deserializer): Used by Amazon Athena to tell the engine how to interpret data from specific log formats (like JSON or CSV).
The "Big Idea"
In data engineering, logs are the "black box" flight recorder of your pipeline. While monitoring tells you if a pipeline is broken (e.g., a high error rate), logging and auditing tell you why it broke and who touched it. For audits, the goal is traceability: being able to reconstruct the path of a data record from ingestion to consumption while identifying every API call that modified the environment.
Formula / Concept Box
| Concept | Metric / Rule | Key Syntax/Note |
|---|---|---|
| CloudTrail Retention | Default: 90 Days | Must create a "Trail" to store logs in S3 for longer periods. |
| CloudWatch Log Limit | Max Event Size: 256 KB | Large API calls (e.g., launching 500 nodes) may exceed this and won't log. |
| Athena Querying | Standard SQL | Uses External Tables pointing to S3 prefixes. |
| Log Retention | 1 day to 10 years | Configured at the Log Group level. |
Hierarchical Outline
- Log Sources
- AWS CloudTrail: Management vs. Data events (S3 object-level actions).
- CloudWatch Logs: Application logs (Lambda
printstatements, Glue logs). - Service-Specific: Redshift (Connection/User logs), EMR (Step logs).
- Storage and Centralization
- Amazon S3: Cheap, long-term archival; target for CloudTrail and exported CW Logs.
- CloudWatch Log Groups: Active monitoring and real-time alerts.
- Analysis Tools
- Amazon Athena: Ad-hoc SQL analysis on S3-stored logs.
- CloudWatch Logs Insights: Built-in query language for rapid CW Log searches.
- Amazon OpenSearch: Advanced full-text search and visual dashboards (Kibana).
- Audit Readiness
- Integrity Validation: Ensuring logs aren't deleted or modified.
- Encryption: SSE-S3 or SSE-KMS for logs at rest.
Visual Anchors
The Log Processing Pipeline
Hierarchy of CloudWatch Logging
\begin{tikzpicture}[node distance=1.5cm, every node/.style={draw, rectangle, rounded corners, fill=blue!10, align=center}] \node (LG) {\textbf{Log Group}\ (Policy/Retention)}; \node (LS1) [below left of=LG, xshift=-1cm] {\textbf{Log Stream A}\ (Instance 1)}; \node (LS2) [below right of=LG, xshift=1cm] {\textbf{Log Stream B}\ (Instance 2)}; \node (E1) [below of=LS1] {Log Event 1}; \node (E2) [below of=LS1, yshift=-0.8cm] {Log Event 2}; \node (E3) [below of=LS2] {Log Event 1};
\draw[->] (LG) -- (LS1);
\draw[->] (LG) -- (LS2);
\draw[->] (LS1) -- (E1);
\draw[->] (LS1) -- (E2);
\draw[->] (LS2) -- (E3);\end{tikzpicture}
Definition-Example Pairs
- Management Events: Provide visibility into control plane operations (e.g.,
CreateBucket,TerminateInstance).- Example: An auditor asks who deleted a production S3 bucket. You search CloudTrail Management events for the
DeleteBucketAPI call.
- Example: An auditor asks who deleted a production S3 bucket. You search CloudTrail Management events for the
- Data Events: Provide visibility into resource-level operations (e.g.,
PutObject,GetItem).- Example: You need to see which IAM user downloaded a sensitive CSV from an S3 bucket. You enable Data Events for that specific bucket in CloudTrail.
- Metric Filters: Rules that turn log text into numeric data for graphing.
- Example: Create a filter for the string "404" in web access logs to create a
404ErrorCountmetric and set an alarm.
- Example: Create a filter for the string "404" in web access logs to create a
Worked Examples
1. Programmatic Logging in AWS Lambda
To extract custom audit info from your pipeline, use the logging library in Python.
import logging
import json
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
# AUDIT: Log the identity of the invoker and the input
logger.info(f"User Identity: {event.get('identity')}")
logger.info(f"Processing Record ID: {event.get('record_id')}")
try:
# Data processing logic here
return {"status": "success"}
except Exception as e:
logger.error(f"ERROR: Failed to process record {event.get('record_id')}. Reason: {e}")
raise e2. Querying CloudTrail Logs with Athena
If you have exported your CloudTrail logs to S3, you can find specific API failures using SQL:
SELECT
eventtime,
eventsource,
eventname,
sourceipaddress,
errorcode,
errormessage
FROM "sampledb"."cloudtrail_logs"
WHERE errorcode IS NOT NULL
AND eventtime >= '2023-10-01T00:00:00Z'
LIMIT 10;Checkpoint Questions
- Q: How can you verify that a CloudTrail log file has not been altered since it was delivered to S3?
- Q: Which service is best suited for searching gigabytes of application logs for a specific error string using a SQL-like syntax?
- Q: True or False: Redshift audit logging is enabled by default.
- Q: If an API call is larger than 256 KB, will it appear in CloudWatch Logs when streamed from CloudTrail?
▶Click for Answers
- Use Log File Integrity Validation to check the cryptographic hashes.
- CloudWatch Logs Insights (or Amazon Athena if stored in S3).
- False. It must be explicitly enabled.
- No. CloudTrail does not send log events larger than 256 KB to CloudWatch Logs.
Comparison Tables
| Feature | CloudTrail | CloudWatch Logs | Amazon OpenSearch |
|---|---|---|---|
| Primary Use | "Who did what?" (Auditing) | "How is the app doing?" | Search/Log Analytics |
| Data Type | API Activity (JSON) | Application/System Logs | General Log Data |
| Query Method | CloudTrail Lake (SQL) | Logs Insights (Proprietary) | Query DSL / Dashboards |
| Default Retention | 90 Days (Event History) | Indefinite (Configurable) | Depends on disk/nodes |
Muddy Points
- CloudTrail vs. AWS Config: CloudTrail tracks events (the "action"). AWS Config tracks resource state (the "result"). If a bucket's policy changed, CloudTrail tells you who did it; Config tells you what it looked like before vs. after.
- CloudWatch Agent: Don't confuse the native CloudWatch integration (like Lambda logs) with the CloudWatch Agent. You must manually install the Agent on EC2 instances or on-premises servers to collect logs and system-level metrics (like RAM usage) that aren't natively captured.