AWS Certified Data Engineer: Verifying and Cleaning Data

Data cleaning and verification are the most critical steps in the data engineering lifecycle. Messy data—containing missing values, outliers, and inconsistencies—leads to "garbage in, garbage out" for downstream analytics and machine learning models. This guide covers how to use AWS-native tools like Lambda, Athena, and SageMaker Data Wrangler to ensure data quality.

Learning Objectives

After studying this guide, you should be able to:

Identify which AWS service is best suited for specific cleaning tasks (e.g., real-time vs. batch).
Configure AWS Lambda for event-driven data validation and enrichment.
Execute SQL-based cleaning and verification using Amazon Athena.
Utilize SageMaker Data Wrangler and Glue DataBrew for visual, no-code data preparation.
Implement interactive data exploration using Jupyter Notebooks and Spark.

Key Terms & Glossary

EDA (Exploratory Data Analysis): The process of using visualizations and statistics (histograms, scatterplots) to understand a dataset's characteristics before cleaning.
UDF (User Defined Function): Custom logic (often in Python or Java) used to extend the standard SQL capabilities of Athena or Redshift.
Data Profiling: Assessing the quality, structure, and content of data (e.g., identifying the percentage of null values in a column).
Federated Query: Using Athena to query data across different sources (like RDS or On-prem) without moving the data, often powered by Lambda connectors.
Data Wrangler: A feature of Amazon SageMaker that simplifies data cleaning with 300+ built-in transformations.

The "Big Idea"

Data cleaning is a "task-dictates-the-tool" process. If you need real-time verification as files land in S3, Lambda is your go-to. If you need to perform ad-hoc SQL cleaning on petabytes of data, Athena is the choice. For data scientists needing a visual, repeatable workflow, SageMaker Data Wrangler or Glue DataBrew provides the most efficiency. High-quality data is the prerequisite for all reliable business intelligence (BI) and artificial intelligence (AI).

Formula / Concept Box

Tool Selection Logic	Best Use Case
AWS Lambda	Real-time, event-driven cleaning (single record or small batches).
Amazon Athena	Serverless SQL-based cleaning/verification on S3 data lakes.
SageMaker Data Wrangler	No-code visual cleaning for ML pipelines with anomaly detection.
Glue DataBrew	Visual data profiling and normalization for BI workloads.
Jupyter Notebooks	Complex, programmatic exploration using PySpark or Scala.

Hierarchical Outline

Event-Driven Cleaning with AWS Lambda
- Triggers: S3 ObjectCreated events initiate cleaning as soon as data is ingested.
- UDFs: Extend SQL engines (Athena/Redshift) to handle complex logic (e.g., custom PII masking).
SQL-Based Verification with Amazon Athena
- Standard SQL: Use WHERE, COALESCE, and CASE statements to filter or fix nulls.
- CTAS (Create Table As Select): Save cleaned versions of data back to S3 in optimized formats (Parquet).
- Athena for Spark: Use notebooks for complex transformations that standard SQL cannot handle.
Visual Data Preparation
- SageMaker Data Wrangler: 300+ transformations; handles data drift and anomaly detection.
- AWS Glue DataBrew: Over 250 built-in transformations; excellent for data profiling reports.
Interactive Exploration
- Jupyter Notebooks: Available in SageMaker, Glue, and EMR for prototyping cleaning scripts.

Visual Anchors

Data Cleaning Workflow

Loading Diagram...

Lambda UDF Interaction in Athena

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Imputation: Replacing missing data with substituted values.
- Example: Replacing null values in a "Temperature" column with the average temperature of the last 24 hours.
Standardization: Converting data to a common format.
- Example: Converting all date strings from "MM/DD/YYYY" and "DD-MM-YY" to the ISO-8601 format "YYYY-MM-DD".
Deduplication: Identifying and removing duplicate records.
- Example: Using a SQL ROW_NUMBER() function in Athena to find and delete duplicate transaction IDs.

Worked Examples

Example 1: Cleaning Missing Values with Athena SQL

Scenario: You have a CSV in S3 where the user_id is sometimes missing. You want to create a clean view.

sql

CREATE OR REPLACE VIEW cleaned_users AS
SELECT 
    COALESCE(user_id, 'UNKNOWN') AS user_id,
    LOWER(email) AS normalized_email,
    signup_date
FROM raw_users_table
WHERE signup_date IS NOT NULL;

Example 2: Real-time Validation with Lambda

Scenario: A Lambda function is triggered by an S3 upload. It checks if the file is a CSV and if the header is correct.

python

import boto3

def lambda_handler(event, context):
    s3 = boto3.client('s3')
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Read first line to check header
    response = s3.get_object(Bucket=bucket, Key=key)
    header = response['Body'].readline().decode('utf-16')
    
    if "required_column" not in header:
        # Move to 'quarantine' folder
        print(f"Invalid file {key}. Quarantining...")

Checkpoint Questions

Which service allows you to use 300+ built-in transformations without writing code?
How does Amazon Athena charge for its standard SQL queries?
When should you use a Lambda UDF instead of standard SQL in Amazon Redshift or Athena?
What is the difference between Glue DataBrew and SageMaker Data Wrangler?

Comparison Tables

Feature	Lambda	Athena	Data Wrangler	DataBrew
Interface	Code (Python/Node)	SQL / Notebooks	Visual / No-code	Visual / No-code
Scaling	Automatic per-event	Automatic per-query	Managed Instance	Serverless
Primary Use	Real-time cleaning	Ad-hoc analytics	ML Prep / EDA	BI Prep / Profiling
Cost Model	Per request/duration	Per data scanned	Per hour (instance)	Per node-hour

Muddy Points & Cross-Refs

DataBrew vs. Data Wrangler: They are very similar. Focus on the consumer. If the task mentions "Data Scientists" or "ML Pipelines," lean towards Data Wrangler. If it mentions "Business Analysts" or "QuickSight," lean towards DataBrew.
Athena vs. Redshift Spectrum: Both query S3 via SQL. Use Athena for ad-hoc, serverless needs. Use Redshift Spectrum if you already have a Redshift cluster and want to join S3 data with local warehouse tables.
Lambda Limits: Remember that Lambda has a 15-minute timeout and 10GB memory limit. For cleaning massive individual files (e.g., a 100GB CSV), use AWS Glue or EMR instead.