Automating Data Quality Validation with AWS Glue and DQDL
Data Quality and Validation
Automating Data Quality Validation with AWS Glue and DQDL
Maintaining high-quality, reliable data is essential for accurate insights. As a data engineer, you must ensure the integrity, completeness, and accuracy of the data flowing through your pipelines. This lab demonstrates how to use AWS Glue Data Quality and the Data Quality Definition Language (DQDL) to profile, validate, and monitor your data at scale.
[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges in your AWS account.
Prerequisites
- AWS Account: Access to an AWS account with permissions for AWS Glue, S3, and IAM.
- AWS CLI: Installed and configured with appropriate credentials.
- IAM Permissions: Your user must have the
AdministratorAccesspolicy or specific permissions forglue:*,s3:*, andcloudwatch:*. - Region: Use
us-east-1for this lab unless otherwise specified.
Learning Objectives
- Define Data Quality Rules: Use DQDL to create assertions for completeness, consistency, and integrity.
- Perform Data Profiling: Automatically generate rule recommendations from an existing dataset.
- Integrate Quality Checks: Embed data quality checks within an AWS Glue ETL pipeline.
- Monitor Outcomes: Inspect data quality scores and individual record validation results.
Architecture Overview
This lab builds a serverless pipeline that validates incoming raw data against a defined ruleset before it is used for downstream processing.
Cost Estimate
| Service | Estimated Cost (30 mins) | Free Tier Eligibility |
|---|---|---|
| AWS Glue | ~$0.44 (1 DPU x 0.5hr) | No (Job runs only) |
| Amazon S3 | <$0.01 | Yes (5GB Storage) |
| CloudWatch Logs | <$0.01 | Yes (5GB Ingestion) |
| Total | ~$0.45 |
Step-by-Step Instructions
Step 1: Prepare the S3 Environment
You need a bucket to store the sample raw data and the validation results.
# Generate a unique bucket name suffix
SUFFIX=$RANDOM
BUCKET_NAME="brainybee-lab-data-quality-$SUFFIX"
# Create the bucket
aws s3 mb s3://$BUCKET_NAME --region us-east-1▶Console alternative
Navigate to
. Name it
brainybee-lab-data-quality-<unique-id>and leave other settings as default.
Step 2: Upload Sample "Bad" Data
We will create a CSV file containing typical data quality issues: missing fields, invalid ages, and duplicate records.
cat <<EOF > user_data.csv
id,name,email,age
1,Alice,alice@example.com,25
2,Bob,,30
3,Charlie,charlie@example.com,invalid_age
4,David,david@example.com,-5
5,Eve,eve@example.com,40
5,Eve,eve@example.com,40
EOF
aws s3 cp user_data.csv s3://$BUCKET_NAME/raw/user_data.csvStep 3: Run Data Quality Profiling
AWS Glue can analyze your data to suggest rules. We will first crawl the data into the Glue Data Catalog.
- Create a Glue Database:
bash
aws glue create-database --database-input '{"Name": "dq_lab_db"}' - Create a Table via Crawler:
Navigate to AWS Glue Console > Crawlers > Create Crawler.
- Name:
user_data_crawler - Data Source:
s3://<YOUR_BUCKET_NAME>/raw/ - IAM Role: Create a new role (e.g.,
AWSGlueServiceRole-DQ). - Target Database:
dq_lab_db. - Click Run Crawler.
- Name:
Step 4: Apply DQDL Ruleset
Once the table exists, we will define the Data Quality Definition Language (DQDL) rules.
[!TIP] DQDL rules are declarative. Instead of writing code, you specify what "good" data looks like.
The Ruleset:
Rules = [
RowCount > 0,
Completeness "email" > 0.95,
IsType "age" = "Integral",
ColumnValues "age" > 0,
IsUnique "id"
]- In the Glue Console, go to Data Catalog > Tables > user_data.
- Select the Data Quality tab.
- Click Create Data Quality Ruleset.
- Paste the DQDL ruleset above and click Save.
Step 5: Execute and Inspect the Quality Check
Run the ruleset and observe how the "bad" data impacts the quality score.
- Select the ruleset you created and click Run.
- Wait for the status to change from
StartingtoSucceeded. - Observe the Data Quality Score.
Checkpoints
| Verification Step | Expected Result |
|---|---|
Check RowCount rule | PASSED (6 rows detected) |
Check Completeness "email" | FAILED (1 missing email out of 6 is ~16% missing) |
Check IsType "age" | FAILED ("invalid_age" is a string, not an integer) |
Check IsUnique "id" | FAILED (Duplicate ID 5 detected) |
Concept Review
AWS Glue Data Quality is built on the Deequ framework, an open-source library developed by Amazon.
\begin{tikzpicture}[node distance=2cm] \draw[fill=blue!10, rounded corners] (0,0) rectangle (4,1.5) node[midway] {Raw Data Ingestion}; \draw[->, thick] (4,0.75) -- (5,0.75); \draw[fill=green!10, rounded corners] (5,0) rectangle (9,1.5) node[midway, align=center] {AWS Glue DQ \ (DQDL Rules)}; \draw[->, thick] (9,0.75) -- (10,0.75); \draw[fill=orange!10, rounded corners] (10,0) rectangle (14,1.5) node[midway, align=center] {Verification Result \ (Pass/Fail Score)}; \end{tikzpicture}
Data Quality Dimensions
| Dimension | Description | DQDL Example |
|---|---|---|
| Completeness | Checks for null or empty values | Completeness "column" > 0.9 |
| Consistency | Checks if data types match expectations | IsType "column" = "Date" |
| Accuracy | Checks if values fall within range | ColumnValues "age" between 1 and 120 |
| Integrity | Checks for uniqueness and primary keys | IsUnique "id" |
Troubleshooting
| Issue | Possible Cause | Fix |
|---|---|---|
| Rule Run Fails | IAM Role lacks S3 permissions | Ensure AWSGlueServiceRole has s3:GetObject and s3:PutObject for the bucket. |
| 0% Data Quality Score | Schema Mismatch | Verify the crawler correctly identified column headers in the CSV. |
| Connection Timeout | VPC Configuration | If running in a private subnet, ensure S3 Gateway Endpoints are configured. |
Challenge
Task: Modify the ETL pipeline.
- Update the DQDL ruleset to include a check that ensures the
namecolumn has no leading or trailing whitespace (Hint: useColumnLength). - Configure the Glue Job to Fail the job if the Data Quality Score is below 80%.
▶Show Hint
In the Glue Visual Job editor, use the "Evaluate Data Quality" transform. Under "Action", select "Fail job when data quality fails".
Teardown
To avoid ongoing costs, delete the resources created in this lab:
# Delete the S3 Bucket and all contents
aws s3 rb s3://$BUCKET_NAME --force
# Delete the Glue Database
aws glue delete-database --name dq_lab_db
# Delete the Crawler
aws glue delete-crawler --name user_data_crawler