Hands-On Lab945 words

Automating Data Quality Validation with AWS Glue and DQDL

Data Quality and Validation

Automating Data Quality Validation with AWS Glue and DQDL

Maintaining high-quality, reliable data is essential for accurate insights. As a data engineer, you must ensure the integrity, completeness, and accuracy of the data flowing through your pipelines. This lab demonstrates how to use AWS Glue Data Quality and the Data Quality Definition Language (DQDL) to profile, validate, and monitor your data at scale.

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges in your AWS account.

Prerequisites

  • AWS Account: Access to an AWS account with permissions for AWS Glue, S3, and IAM.
  • AWS CLI: Installed and configured with appropriate credentials.
  • IAM Permissions: Your user must have the AdministratorAccess policy or specific permissions for glue:*, s3:*, and cloudwatch:*.
  • Region: Use us-east-1 for this lab unless otherwise specified.

Learning Objectives

  • Define Data Quality Rules: Use DQDL to create assertions for completeness, consistency, and integrity.
  • Perform Data Profiling: Automatically generate rule recommendations from an existing dataset.
  • Integrate Quality Checks: Embed data quality checks within an AWS Glue ETL pipeline.
  • Monitor Outcomes: Inspect data quality scores and individual record validation results.

Architecture Overview

This lab builds a serverless pipeline that validates incoming raw data against a defined ruleset before it is used for downstream processing.

Loading Diagram...

Cost Estimate

ServiceEstimated Cost (30 mins)Free Tier Eligibility
AWS Glue~$0.44 (1 DPU x 0.5hr)No (Job runs only)
Amazon S3<$0.01Yes (5GB Storage)
CloudWatch Logs<$0.01Yes (5GB Ingestion)
Total~$0.45

Step-by-Step Instructions

Step 1: Prepare the S3 Environment

You need a bucket to store the sample raw data and the validation results.

bash
# Generate a unique bucket name suffix SUFFIX=$RANDOM BUCKET_NAME="brainybee-lab-data-quality-$SUFFIX" # Create the bucket aws s3 mb s3://$BUCKET_NAME --region us-east-1
Console alternative

Navigate to

S3
Create bucket

. Name it

brainybee-lab-data-quality-<unique-id>

and leave other settings as default.

Step 2: Upload Sample "Bad" Data

We will create a CSV file containing typical data quality issues: missing fields, invalid ages, and duplicate records.

bash
cat <<EOF > user_data.csv id,name,email,age 1,Alice,alice@example.com,25 2,Bob,,30 3,Charlie,charlie@example.com,invalid_age 4,David,david@example.com,-5 5,Eve,eve@example.com,40 5,Eve,eve@example.com,40 EOF aws s3 cp user_data.csv s3://$BUCKET_NAME/raw/user_data.csv

Step 3: Run Data Quality Profiling

AWS Glue can analyze your data to suggest rules. We will first crawl the data into the Glue Data Catalog.

  1. Create a Glue Database:
    bash
    aws glue create-database --database-input '{"Name": "dq_lab_db"}'
  2. Create a Table via Crawler: Navigate to AWS Glue Console > Crawlers > Create Crawler.
    • Name: user_data_crawler
    • Data Source: s3://<YOUR_BUCKET_NAME>/raw/
    • IAM Role: Create a new role (e.g., AWSGlueServiceRole-DQ).
    • Target Database: dq_lab_db.
    • Click Run Crawler.

Step 4: Apply DQDL Ruleset

Once the table exists, we will define the Data Quality Definition Language (DQDL) rules.

[!TIP] DQDL rules are declarative. Instead of writing code, you specify what "good" data looks like.

The Ruleset:

dqdl
Rules = [ RowCount > 0, Completeness "email" > 0.95, IsType "age" = "Integral", ColumnValues "age" > 0, IsUnique "id" ]
  1. In the Glue Console, go to Data Catalog > Tables > user_data.
  2. Select the Data Quality tab.
  3. Click Create Data Quality Ruleset.
  4. Paste the DQDL ruleset above and click Save.

Step 5: Execute and Inspect the Quality Check

Run the ruleset and observe how the "bad" data impacts the quality score.

  1. Select the ruleset you created and click Run.
  2. Wait for the status to change from Starting to Succeeded.
  3. Observe the Data Quality Score.

Checkpoints

Verification StepExpected Result
Check RowCount rulePASSED (6 rows detected)
Check Completeness "email"FAILED (1 missing email out of 6 is ~16% missing)
Check IsType "age"FAILED ("invalid_age" is a string, not an integer)
Check IsUnique "id"FAILED (Duplicate ID 5 detected)

Concept Review

AWS Glue Data Quality is built on the Deequ framework, an open-source library developed by Amazon.

\begin{tikzpicture}[node distance=2cm] \draw[fill=blue!10, rounded corners] (0,0) rectangle (4,1.5) node[midway] {Raw Data Ingestion}; \draw[->, thick] (4,0.75) -- (5,0.75); \draw[fill=green!10, rounded corners] (5,0) rectangle (9,1.5) node[midway, align=center] {AWS Glue DQ \ (DQDL Rules)}; \draw[->, thick] (9,0.75) -- (10,0.75); \draw[fill=orange!10, rounded corners] (10,0) rectangle (14,1.5) node[midway, align=center] {Verification Result \ (Pass/Fail Score)}; \end{tikzpicture}

Data Quality Dimensions

DimensionDescriptionDQDL Example
CompletenessChecks for null or empty valuesCompleteness "column" > 0.9
ConsistencyChecks if data types match expectationsIsType "column" = "Date"
AccuracyChecks if values fall within rangeColumnValues "age" between 1 and 120
IntegrityChecks for uniqueness and primary keysIsUnique "id"

Troubleshooting

IssuePossible CauseFix
Rule Run FailsIAM Role lacks S3 permissionsEnsure AWSGlueServiceRole has s3:GetObject and s3:PutObject for the bucket.
0% Data Quality ScoreSchema MismatchVerify the crawler correctly identified column headers in the CSV.
Connection TimeoutVPC ConfigurationIf running in a private subnet, ensure S3 Gateway Endpoints are configured.

Challenge

Task: Modify the ETL pipeline.

  1. Update the DQDL ruleset to include a check that ensures the name column has no leading or trailing whitespace (Hint: use ColumnLength).
  2. Configure the Glue Job to Fail the job if the Data Quality Score is below 80%.
Show Hint

In the Glue Visual Job editor, use the "Evaluate Data Quality" transform. Under "Action", select "Fail job when data quality fails".

Teardown

To avoid ongoing costs, delete the resources created in this lab:

bash
# Delete the S3 Bucket and all contents aws s3 rb s3://$BUCKET_NAME --force # Delete the Glue Database aws glue delete-database --name dq_lab_db # Delete the Crawler aws glue delete-crawler --name user_data_crawler

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free