Study Guide945 words

AWS Data Engineering: Extracting & Preparing Logs for Audits

Extract logs for audits

AWS Data Engineering: Extracting & Preparing Logs for Audits

This study guide covers the critical skills required for Task 3.3 and 4.4 of the AWS Certified Data Engineer – Associate (DEA-C01) exam, focusing on logging, monitoring, and auditability within data pipelines.

Learning Objectives

After studying this guide, you should be able to:

  • Differentiate between AWS CloudTrail and Amazon CloudWatch Logs for auditing.
  • Configure log extraction from various sources into centralized storage.
  • Analyze logs using Amazon Athena, CloudWatch Logs Insights, and OpenSearch.
  • Ensure log integrity and compliance for security audits.
  • Troubleshoot pipeline performance using log-based insights.

Key Terms & Glossary

  • AWS CloudTrail: A service that records API calls and user activity across the AWS infrastructure.
  • Log Stream: A sequence of log events that share the same source (e.g., a specific EC2 instance or Lambda container).
  • Log Group: A collection of log streams that share the same retention, monitoring, and access control settings.
  • CloudTrail Lake: A managed data lake for capturing, storing, and querying activity logs using SQL.
  • Log File Integrity Validation: A feature that uses cryptographic hashes to prove that CloudTrail logs have not been tampered with after delivery.
  • SerDes (Serializer/Deserializer): Used by Amazon Athena to tell the engine how to interpret data from specific log formats (like JSON or CSV).

The "Big Idea"

In data engineering, logs are the "black box" flight recorder of your pipeline. While monitoring tells you if a pipeline is broken (e.g., a high error rate), logging and auditing tell you why it broke and who touched it. For audits, the goal is traceability: being able to reconstruct the path of a data record from ingestion to consumption while identifying every API call that modified the environment.

Formula / Concept Box

ConceptMetric / RuleKey Syntax/Note
CloudTrail RetentionDefault: 90 DaysMust create a "Trail" to store logs in S3 for longer periods.
CloudWatch Log LimitMax Event Size: 256 KBLarge API calls (e.g., launching 500 nodes) may exceed this and won't log.
Athena QueryingStandard SQLUses External Tables pointing to S3 prefixes.
Log Retention1 day to 10 yearsConfigured at the Log Group level.

Hierarchical Outline

  1. Log Sources
    • AWS CloudTrail: Management vs. Data events (S3 object-level actions).
    • CloudWatch Logs: Application logs (Lambda print statements, Glue logs).
    • Service-Specific: Redshift (Connection/User logs), EMR (Step logs).
  2. Storage and Centralization
    • Amazon S3: Cheap, long-term archival; target for CloudTrail and exported CW Logs.
    • CloudWatch Log Groups: Active monitoring and real-time alerts.
  3. Analysis Tools
    • Amazon Athena: Ad-hoc SQL analysis on S3-stored logs.
    • CloudWatch Logs Insights: Built-in query language for rapid CW Log searches.
    • Amazon OpenSearch: Advanced full-text search and visual dashboards (Kibana).
  4. Audit Readiness
    • Integrity Validation: Ensuring logs aren't deleted or modified.
    • Encryption: SSE-S3 or SSE-KMS for logs at rest.

Visual Anchors

The Log Processing Pipeline

Loading Diagram...

Hierarchy of CloudWatch Logging

\begin{tikzpicture}[node distance=1.5cm, every node/.style={draw, rectangle, rounded corners, fill=blue!10, align=center}] \node (LG) {\textbf{Log Group}\ (Policy/Retention)}; \node (LS1) [below left of=LG, xshift=-1cm] {\textbf{Log Stream A}\ (Instance 1)}; \node (LS2) [below right of=LG, xshift=1cm] {\textbf{Log Stream B}\ (Instance 2)}; \node (E1) [below of=LS1] {Log Event 1}; \node (E2) [below of=LS1, yshift=-0.8cm] {Log Event 2}; \node (E3) [below of=LS2] {Log Event 1};

code
\draw[->] (LG) -- (LS1); \draw[->] (LG) -- (LS2); \draw[->] (LS1) -- (E1); \draw[->] (LS1) -- (E2); \draw[->] (LS2) -- (E3);

\end{tikzpicture}

Definition-Example Pairs

  • Management Events: Provide visibility into control plane operations (e.g., CreateBucket, TerminateInstance).
    • Example: An auditor asks who deleted a production S3 bucket. You search CloudTrail Management events for the DeleteBucket API call.
  • Data Events: Provide visibility into resource-level operations (e.g., PutObject, GetItem).
    • Example: You need to see which IAM user downloaded a sensitive CSV from an S3 bucket. You enable Data Events for that specific bucket in CloudTrail.
  • Metric Filters: Rules that turn log text into numeric data for graphing.
    • Example: Create a filter for the string "404" in web access logs to create a 404ErrorCount metric and set an alarm.

Worked Examples

1. Programmatic Logging in AWS Lambda

To extract custom audit info from your pipeline, use the logging library in Python.

python
import logging import json logger = logging.getLogger() logger.setLevel(logging.INFO) def lambda_handler(event, context): # AUDIT: Log the identity of the invoker and the input logger.info(f"User Identity: {event.get('identity')}") logger.info(f"Processing Record ID: {event.get('record_id')}") try: # Data processing logic here return {"status": "success"} except Exception as e: logger.error(f"ERROR: Failed to process record {event.get('record_id')}. Reason: {e}") raise e

2. Querying CloudTrail Logs with Athena

If you have exported your CloudTrail logs to S3, you can find specific API failures using SQL:

sql
SELECT eventtime, eventsource, eventname, sourceipaddress, errorcode, errormessage FROM "sampledb"."cloudtrail_logs" WHERE errorcode IS NOT NULL AND eventtime >= '2023-10-01T00:00:00Z' LIMIT 10;

Checkpoint Questions

  1. Q: How can you verify that a CloudTrail log file has not been altered since it was delivered to S3?
  2. Q: Which service is best suited for searching gigabytes of application logs for a specific error string using a SQL-like syntax?
  3. Q: True or False: Redshift audit logging is enabled by default.
  4. Q: If an API call is larger than 256 KB, will it appear in CloudWatch Logs when streamed from CloudTrail?
Click for Answers
  1. Use Log File Integrity Validation to check the cryptographic hashes.
  2. CloudWatch Logs Insights (or Amazon Athena if stored in S3).
  3. False. It must be explicitly enabled.
  4. No. CloudTrail does not send log events larger than 256 KB to CloudWatch Logs.

Comparison Tables

FeatureCloudTrailCloudWatch LogsAmazon OpenSearch
Primary Use"Who did what?" (Auditing)"How is the app doing?"Search/Log Analytics
Data TypeAPI Activity (JSON)Application/System LogsGeneral Log Data
Query MethodCloudTrail Lake (SQL)Logs Insights (Proprietary)Query DSL / Dashboards
Default Retention90 Days (Event History)Indefinite (Configurable)Depends on disk/nodes

Muddy Points

  • CloudTrail vs. AWS Config: CloudTrail tracks events (the "action"). AWS Config tracks resource state (the "result"). If a bucket's policy changed, CloudTrail tells you who did it; Config tells you what it looked like before vs. after.
  • CloudWatch Agent: Don't confuse the native CloudWatch integration (like Lambda logs) with the CloudWatch Agent. You must manually install the Agent on EC2 instances or on-premises servers to collect logs and system-level metrics (like RAM usage) that aren't natively captured.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free