Study Guide920 words

AWS Certified Data Engineer: Verifying and Cleaning Data

Verify and clean data (for example, Lambda, Athena, QuickSight, Jupyter Notebooks, Amazon SageMaker Data Wrangler)

AWS Certified Data Engineer: Verifying and Cleaning Data

Data cleaning and verification are the most critical steps in the data engineering lifecycle. Messy data—containing missing values, outliers, and inconsistencies—leads to "garbage in, garbage out" for downstream analytics and machine learning models. This guide covers how to use AWS-native tools like Lambda, Athena, and SageMaker Data Wrangler to ensure data quality.

Learning Objectives

After studying this guide, you should be able to:

  • Identify which AWS service is best suited for specific cleaning tasks (e.g., real-time vs. batch).
  • Configure AWS Lambda for event-driven data validation and enrichment.
  • Execute SQL-based cleaning and verification using Amazon Athena.
  • Utilize SageMaker Data Wrangler and Glue DataBrew for visual, no-code data preparation.
  • Implement interactive data exploration using Jupyter Notebooks and Spark.

Key Terms & Glossary

  • EDA (Exploratory Data Analysis): The process of using visualizations and statistics (histograms, scatterplots) to understand a dataset's characteristics before cleaning.
  • UDF (User Defined Function): Custom logic (often in Python or Java) used to extend the standard SQL capabilities of Athena or Redshift.
  • Data Profiling: Assessing the quality, structure, and content of data (e.g., identifying the percentage of null values in a column).
  • Federated Query: Using Athena to query data across different sources (like RDS or On-prem) without moving the data, often powered by Lambda connectors.
  • Data Wrangler: A feature of Amazon SageMaker that simplifies data cleaning with 300+ built-in transformations.

The "Big Idea"

Data cleaning is a "task-dictates-the-tool" process. If you need real-time verification as files land in S3, Lambda is your go-to. If you need to perform ad-hoc SQL cleaning on petabytes of data, Athena is the choice. For data scientists needing a visual, repeatable workflow, SageMaker Data Wrangler or Glue DataBrew provides the most efficiency. High-quality data is the prerequisite for all reliable business intelligence (BI) and artificial intelligence (AI).

Formula / Concept Box

Tool Selection LogicBest Use Case
AWS LambdaReal-time, event-driven cleaning (single record or small batches).
Amazon AthenaServerless SQL-based cleaning/verification on S3 data lakes.
SageMaker Data WranglerNo-code visual cleaning for ML pipelines with anomaly detection.
Glue DataBrewVisual data profiling and normalization for BI workloads.
Jupyter NotebooksComplex, programmatic exploration using PySpark or Scala.

Hierarchical Outline

  1. Event-Driven Cleaning with AWS Lambda
    • Triggers: S3 ObjectCreated events initiate cleaning as soon as data is ingested.
    • UDFs: Extend SQL engines (Athena/Redshift) to handle complex logic (e.g., custom PII masking).
  2. SQL-Based Verification with Amazon Athena
    • Standard SQL: Use WHERE, COALESCE, and CASE statements to filter or fix nulls.
    • CTAS (Create Table As Select): Save cleaned versions of data back to S3 in optimized formats (Parquet).
    • Athena for Spark: Use notebooks for complex transformations that standard SQL cannot handle.
  3. Visual Data Preparation
    • SageMaker Data Wrangler: 300+ transformations; handles data drift and anomaly detection.
    • AWS Glue DataBrew: Over 250 built-in transformations; excellent for data profiling reports.
  4. Interactive Exploration
    • Jupyter Notebooks: Available in SageMaker, Glue, and EMR for prototyping cleaning scripts.

Visual Anchors

Data Cleaning Workflow

Loading Diagram...

Lambda UDF Interaction in Athena

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Imputation: Replacing missing data with substituted values.
    • Example: Replacing null values in a "Temperature" column with the average temperature of the last 24 hours.
  • Standardization: Converting data to a common format.
    • Example: Converting all date strings from "MM/DD/YYYY" and "DD-MM-YY" to the ISO-8601 format "YYYY-MM-DD".
  • Deduplication: Identifying and removing duplicate records.
    • Example: Using a SQL ROW_NUMBER() function in Athena to find and delete duplicate transaction IDs.

Worked Examples

Example 1: Cleaning Missing Values with Athena SQL

Scenario: You have a CSV in S3 where the user_id is sometimes missing. You want to create a clean view.

sql
CREATE OR REPLACE VIEW cleaned_users AS SELECT COALESCE(user_id, 'UNKNOWN') AS user_id, LOWER(email) AS normalized_email, signup_date FROM raw_users_table WHERE signup_date IS NOT NULL;

Example 2: Real-time Validation with Lambda

Scenario: A Lambda function is triggered by an S3 upload. It checks if the file is a CSV and if the header is correct.

python
import boto3 def lambda_handler(event, context): s3 = boto3.client('s3') bucket = event['Records'][0]['s3']['bucket']['name'] key = event['Records'][0]['s3']['object']['key'] # Read first line to check header response = s3.get_object(Bucket=bucket, Key=key) header = response['Body'].readline().decode('utf-16') if "required_column" not in header: # Move to 'quarantine' folder print(f"Invalid file {key}. Quarantining...")

Checkpoint Questions

  1. Which service allows you to use 300+ built-in transformations without writing code?
  2. How does Amazon Athena charge for its standard SQL queries?
  3. When should you use a Lambda UDF instead of standard SQL in Amazon Redshift or Athena?
  4. What is the difference between Glue DataBrew and SageMaker Data Wrangler?

Comparison Tables

FeatureLambdaAthenaData WranglerDataBrew
InterfaceCode (Python/Node)SQL / NotebooksVisual / No-codeVisual / No-code
ScalingAutomatic per-eventAutomatic per-queryManaged InstanceServerless
Primary UseReal-time cleaningAd-hoc analyticsML Prep / EDABI Prep / Profiling
Cost ModelPer request/durationPer data scannedPer hour (instance)Per node-hour

Muddy Points & Cross-Refs

  • DataBrew vs. Data Wrangler: They are very similar. Focus on the consumer. If the task mentions "Data Scientists" or "ML Pipelines," lean towards Data Wrangler. If it mentions "Business Analysts" or "QuickSight," lean towards DataBrew.
  • Athena vs. Redshift Spectrum: Both query S3 via SQL. Use Athena for ad-hoc, serverless needs. Use Redshift Spectrum if you already have a Redshift cluster and want to join S3 data with local warehouse tables.
  • Lambda Limits: Remember that Lambda has a 15-minute timeout and 10GB memory limit. For cleaning massive individual files (e.g., a 100GB CSV), use AWS Glue or EMR instead.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free