Automating Data Quality Validation with AWS Glue and DQDL

Maintaining high-quality, reliable data is essential for accurate insights. As a data engineer, you must ensure the integrity, completeness, and accuracy of the data flowing through your pipelines. This lab demonstrates how to use AWS Glue Data Quality and the Data Quality Definition Language (DQDL) to profile, validate, and monitor your data at scale.

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges in your AWS account.

Prerequisites

AWS Account: Access to an AWS account with permissions for AWS Glue, S3, and IAM.
AWS CLI: Installed and configured with appropriate credentials.
IAM Permissions: Your user must have the AdministratorAccess policy or specific permissions for glue:*, s3:*, and cloudwatch:*.
Region: Use us-east-1 for this lab unless otherwise specified.

Learning Objectives

Define Data Quality Rules: Use DQDL to create assertions for completeness, consistency, and integrity.
Perform Data Profiling: Automatically generate rule recommendations from an existing dataset.
Integrate Quality Checks: Embed data quality checks within an AWS Glue ETL pipeline.
Monitor Outcomes: Inspect data quality scores and individual record validation results.

Architecture Overview

This lab builds a serverless pipeline that validates incoming raw data against a defined ruleset before it is used for downstream processing.

Loading Diagram...

Cost Estimate

Service	Estimated Cost (30 mins)	Free Tier Eligibility
AWS Glue	~$0.44 (1 DPU x 0.5hr)	No (Job runs only)
Amazon S3	<$0.01	Yes (5GB Storage)
CloudWatch Logs	<$0.01	Yes (5GB Ingestion)
Total	~$0.45

Step-by-Step Instructions

Step 1: Prepare the S3 Environment

You need a bucket to store the sample raw data and the validation results.

bash

# Generate a unique bucket name suffix
SUFFIX=$RANDOM
BUCKET_NAME="brainybee-lab-data-quality-$SUFFIX"

# Create the bucket
aws s3 mb s3://$BUCKET_NAME --region us-east-1

▶Console alternative

Navigate to

Create bucket

. Name it

brainybee-lab-data-quality-<unique-id>

and leave other settings as default.

Step 2: Upload Sample "Bad" Data

We will create a CSV file containing typical data quality issues: missing fields, invalid ages, and duplicate records.

bash

cat <<EOF > user_data.csv
id,name,email,age
1,Alice,alice@example.com,25
2,Bob,,30
3,Charlie,charlie@example.com,invalid_age
4,David,david@example.com,-5
5,Eve,eve@example.com,40
5,Eve,eve@example.com,40
EOF

aws s3 cp user_data.csv s3://$BUCKET_NAME/raw/user_data.csv

Step 3: Run Data Quality Profiling

AWS Glue can analyze your data to suggest rules. We will first crawl the data into the Glue Data Catalog.

Create a Glue Database:
bash
aws glue create-database --database-input '{"Name": "dq_lab_db"}'
Create a Table via Crawler: Navigate to AWS Glue Console > Crawlers > Create Crawler.
- Name: user_data_crawler
- Data Source: s3://<YOUR_BUCKET_NAME>/raw/
- IAM Role: Create a new role (e.g., AWSGlueServiceRole-DQ).
- Target Database: dq_lab_db.
- Click Run Crawler.

Step 4: Apply DQDL Ruleset

Once the table exists, we will define the Data Quality Definition Language (DQDL) rules.

[!TIP] DQDL rules are declarative. Instead of writing code, you specify what "good" data looks like.

The Ruleset:

dqdl

Rules = [
    RowCount > 0,
    Completeness "email" > 0.95,
    IsType "age" = "Integral",
    ColumnValues "age" > 0,
    IsUnique "id"
]

In the Glue Console, go to Data Catalog > Tables > user_data.
Select the Data Quality tab.
Click Create Data Quality Ruleset.
Paste the DQDL ruleset above and click Save.

Step 5: Execute and Inspect the Quality Check

Run the ruleset and observe how the "bad" data impacts the quality score.

Select the ruleset you created and click Run.
Wait for the status to change from Starting to Succeeded.
Observe the Data Quality Score.

Checkpoints

Verification Step	Expected Result
Check `RowCount` rule	PASSED (6 rows detected)
Check `Completeness "email"`	FAILED (1 missing email out of 6 is ~16% missing)
Check `IsType "age"`	FAILED ("invalid_age" is a string, not an integer)
Check `IsUnique "id"`	FAILED (Duplicate ID 5 detected)

Concept Review

AWS Glue Data Quality is built on the Deequ framework, an open-source library developed by Amazon.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Data Quality Dimensions

Dimension	Description	DQDL Example
Completeness	Checks for null or empty values	`Completeness "column" > 0.9`
Consistency	Checks if data types match expectations	`IsType "column" = "Date"`
Accuracy	Checks if values fall within range	`ColumnValues "age" between 1 and 120`
Integrity	Checks for uniqueness and primary keys	`IsUnique "id"`

Troubleshooting

Issue	Possible Cause	Fix
Rule Run Fails	IAM Role lacks S3 permissions	Ensure `AWSGlueServiceRole` has `s3:GetObject` and `s3:PutObject` for the bucket.
0% Data Quality Score	Schema Mismatch	Verify the crawler correctly identified column headers in the CSV.
Connection Timeout	VPC Configuration	If running in a private subnet, ensure S3 Gateway Endpoints are configured.

Challenge

Task: Modify the ETL pipeline.

Update the DQDL ruleset to include a check that ensures the name column has no leading or trailing whitespace (Hint: use ColumnLength).
Configure the Glue Job to Fail the job if the Data Quality Score is below 80%.

▶Show Hint

In the Glue Visual Job editor, use the "Evaluate Data Quality" transform. Under "Action", select "Fail job when data quality fails".

Teardown

To avoid ongoing costs, delete the resources created in this lab:

bash

# Delete the S3 Bucket and all contents
aws s3 rb s3://$BUCKET_NAME --force

# Delete the Glue Database
aws glue delete-database --name dq_lab_db

# Delete the Crawler
aws glue delete-crawler --name user_data_crawler

Automating Data Quality Validation with AWS Glue and DQDL

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges in your AWS account.

Prerequisites

AWS Account: Access to an AWS account with permissions for AWS Glue, S3, and IAM.
AWS CLI: Installed and configured with appropriate credentials.
IAM Permissions: Your user must have the AdministratorAccess policy or specific permissions for glue:*, s3:*, and cloudwatch:*.
Region: Use us-east-1 for this lab unless otherwise specified.

Learning Objectives

Define Data Quality Rules: Use DQDL to create assertions for completeness, consistency, and integrity.
Perform Data Profiling: Automatically generate rule recommendations from an existing dataset.
Integrate Quality Checks: Embed data quality checks within an AWS Glue ETL pipeline.
Monitor Outcomes: Inspect data quality scores and individual record validation results.

Architecture Overview

This lab builds a serverless pipeline that validates incoming raw data against a defined ruleset before it is used for downstream processing.

Loading Diagram...

Cost Estimate

Service	Estimated Cost (30 mins)	Free Tier Eligibility
AWS Glue	~$0.44 (1 DPU x 0.5hr)	No (Job runs only)
Amazon S3	<$0.01	Yes (5GB Storage)
CloudWatch Logs	<$0.01	Yes (5GB Ingestion)
Total	~$0.45

Step-by-Step Instructions

Step 1: Prepare the S3 Environment

You need a bucket to store the sample raw data and the validation results.

bash

# Generate a unique bucket name suffix
SUFFIX=$RANDOM
BUCKET_NAME="brainybee-lab-data-quality-$SUFFIX"

# Create the bucket
aws s3 mb s3://$BUCKET_NAME --region us-east-1

▶Console alternative

Navigate to

Create bucket

. Name it

brainybee-lab-data-quality-<unique-id>

and leave other settings as default.

Step 2: Upload Sample "Bad" Data

We will create a CSV file containing typical data quality issues: missing fields, invalid ages, and duplicate records.

bash

cat <<EOF > user_data.csv
id,name,email,age
1,Alice,alice@example.com,25
2,Bob,,30
3,Charlie,charlie@example.com,invalid_age
4,David,david@example.com,-5
5,Eve,eve@example.com,40
5,Eve,eve@example.com,40
EOF

aws s3 cp user_data.csv s3://$BUCKET_NAME/raw/user_data.csv

Step 3: Run Data Quality Profiling

AWS Glue can analyze your data to suggest rules. We will first crawl the data into the Glue Data Catalog.

Create a Glue Database:
bash
aws glue create-database --database-input '{"Name": "dq_lab_db"}'
Create a Table via Crawler: Navigate to AWS Glue Console > Crawlers > Create Crawler.
- Name: user_data_crawler
- Data Source: s3://<YOUR_BUCKET_NAME>/raw/
- IAM Role: Create a new role (e.g., AWSGlueServiceRole-DQ).
- Target Database: dq_lab_db.
- Click Run Crawler.

Step 4: Apply DQDL Ruleset

Once the table exists, we will define the Data Quality Definition Language (DQDL) rules.

[!TIP] DQDL rules are declarative. Instead of writing code, you specify what "good" data looks like.

The Ruleset:

dqdl

Rules = [
    RowCount > 0,
    Completeness "email" > 0.95,
    IsType "age" = "Integral",
    ColumnValues "age" > 0,
    IsUnique "id"
]

In the Glue Console, go to Data Catalog > Tables > user_data.
Select the Data Quality tab.
Click Create Data Quality Ruleset.
Paste the DQDL ruleset above and click Save.

Step 5: Execute and Inspect the Quality Check

Run the ruleset and observe how the "bad" data impacts the quality score.

Select the ruleset you created and click Run.
Wait for the status to change from Starting to Succeeded.
Observe the Data Quality Score.

Checkpoints

Verification Step	Expected Result
Check `RowCount` rule	PASSED (6 rows detected)
Check `Completeness "email"`	FAILED (1 missing email out of 6 is ~16% missing)
Check `IsType "age"`	FAILED ("invalid_age" is a string, not an integer)
Check `IsUnique "id"`	FAILED (Duplicate ID 5 detected)

Concept Review

AWS Glue Data Quality is built on the Deequ framework, an open-source library developed by Amazon.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Data Quality Dimensions

Dimension	Description	DQDL Example
Completeness	Checks for null or empty values	`Completeness "column" > 0.9`
Consistency	Checks if data types match expectations	`IsType "column" = "Date"`
Accuracy	Checks if values fall within range	`ColumnValues "age" between 1 and 120`
Integrity	Checks for uniqueness and primary keys	`IsUnique "id"`

Troubleshooting

Issue	Possible Cause	Fix
Rule Run Fails	IAM Role lacks S3 permissions	Ensure `AWSGlueServiceRole` has `s3:GetObject` and `s3:PutObject` for the bucket.
0% Data Quality Score	Schema Mismatch	Verify the crawler correctly identified column headers in the CSV.
Connection Timeout	VPC Configuration	If running in a private subnet, ensure S3 Gateway Endpoints are configured.

Challenge

Task: Modify the ETL pipeline.

Update the DQDL ruleset to include a check that ensures the name column has no leading or trailing whitespace (Hint: use ColumnLength).
Configure the Glue Job to Fail the job if the Data Quality Score is below 80%.

▶Show Hint

In the Glue Visual Job editor, use the "Evaluate Data Quality" transform. Under "Action", select "Fail job when data quality fails".

Teardown

To avoid ongoing costs, delete the resources created in this lab:

bash

# Delete the S3 Bucket and all contents
aws s3 rb s3://$BUCKET_NAME --force

# Delete the Glue Database
aws glue delete-database --name dq_lab_db

# Delete the Crawler
aws glue delete-crawler --name user_data_crawler