Mastering Data Quality Rules: AWS Glue Data Quality & DataBrew

Learning Objectives

After studying this guide, you should be able to:

Define data quality rules using the Data Quality Definition Language (DQDL).
Distinguish between visual data profiling in AWS Glue DataBrew and programmatic checks in AWS Glue Data Quality.
Construct composite rules using logical operators to validate complex datasets.
Interpret row-level outcomes to identify and filter "bad" data during ETL processes.
Implement automated data quality checks within AWS Glue Studio and via infrastructure as code.

Key Terms & Glossary

DQDL (Data Quality Definition Language): A case-sensitive domain-specific language used to write data quality rules in AWS Glue.
Ruleset: A collection of one or more DQDL rules grouped together to evaluate a dataset.
Data Profiling: The process of analyzing a dataset to generate statistics (like mean, null count, or correlation) used to recommend quality rules.
In-Transit Validation: Running data quality checks during the execution of an ETL job before the data is written to a target.
rowLevelOutcomes: An output option in Glue Studio that provides detailed pass/fail status for every individual record in a dataset.

The "Big Idea"

Data quality is the "immune system" of a data pipeline. Without it, your data lake becomes a "data swamp" where decisions are based on incomplete, inaccurate, or stale information. AWS provides two primary tools: DataBrew (for visual, profile-based cleaning) and Glue Data Quality (for rule-based, automated validation). By defining precise rules in DQDL, engineers can proactively intercept bad data before it pollutes downstream analytics.

Formula / Concept Box

Concept	Syntax / Structure
DQDL Basic Format	`<RuleType> <Parameter> <Expression>`
Ruleset Container	`Rules = [ Rule1, Rule2 ]`
Composite Logic	`(Rule 1) and (Rule 2)` OR `(Rule 1) or (Rule 2)`
Row-Level Result	`DataQualityEvaluationResult` (Passed/Failed)

[!IMPORTANT] DQDL is case-sensitive. iscomplete is invalid; it must be IsComplete.

Hierarchical Outline

AWS Glue DataBrew
- Visual Profiling: Automated discovery of data patterns.
- Rule Recommendations: Suggested checks based on the dataset profile.
- Customized Rules: Manual enforcement of business-specific constraints.
AWS Glue Data Quality (DQDL)
- Core Rule Types:
  - Completeness: IsComplete, Completeness (checking for NULLs).
  - Uniqueness: IsUnique, Uniqueness, IsPrimaryKey.
  - Integrity: ReferentialIntegrity, DatasetMatch, SchemaMatch.
  - Statistical: Mean, StandardDeviation, RowCount.
- Custom Logic: CustomSQL for requirements not covered by built-in types.
Integration & Automation
- Glue Studio: Adding data quality nodes to visual ETL jobs.
- API/SDK: Programmatic access for custom applications.
- IaC: Deploying rulesets via AWS CloudFormation.

Visual Anchors

Data Quality Workflow in ETL

Loading Diagram...

DQDL Syntax Anatomy

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

ColumnCorrelation: Checks the statistical relationship between two columns.
- Example: Verifying that Unit_Price and Total_Cost have a high correlation to ensure no calculation errors occurred.
DataFreshness: Compares a date column against the current time.
- Example: Ensuring that the last_updated timestamp is no more than 24 hours old for a daily sales report.
IsPrimaryKey: A composite check for both uniqueness and completeness.
- Example: Validating that transaction_id is never null and never repeats in the database.

Worked Examples

Example 1: Validating a Taxi Dataset

Scenario: You need to ensure that 90% of taxi rides have passengers and that the trip ID is unique and complete.

DQDL Solution:

sql

Rules = [
    Completeness "passenger_count" > 0.90,
    (IsComplete "trip_id") and (IsUnique "trip_id")
]

Example 2: Range and Custom SQL

Scenario: You need to check if the fare_amount is between 1 and 100, and use a custom query for a specific business logic check.

DQDL Solution:

sql

Rules = [
    ColumnValues "fare_amount" between 1 and 100,
    CustomSql "SELECT count(*) FROM primary WHERE fare_amount < 0" = 0
]

Comparison Tables

Feature	AWS Glue DataBrew	AWS Glue Data Quality
Primary Interface	Visual/No-code Console	Code-based (DQDL)
Best For	Data Analysts & Scientists	Data Engineers & Developers
Integration	Profile Jobs & Recipes	ETL Jobs & Glue Data Catalog
Complexity	Drag-and-drop transformations	Programmatic, SQL-like rules
Scale	Best for interactive cleaning	Best for high-volume automated pipelines

Checkpoint Questions

What is the main difference between IsComplete and Completeness? (Answer: IsComplete checks if 100% of data is present, while Completeness allows you to set a percentage threshold, e.g., > 0.95)
True or False: Logical operators and and or can be combined in a single DQDL expression. (Answer: False, based on current documentation limitations mentioned in the study guide.)
What are the two output options provided by the Data Quality node in Glue Studio? (Answer: rowLevelOutcomes and ruleOutcomes)
Why is the rowLevelOutcomes option useful for downstream processing? (Answer: It adds diagnostic columns to each record, allowing you to filter out specifically failed rows while keeping the passing ones.)

Muddy Points & Cross-Refs

Case Sensitivity: One of the most common errors is writing isUnique instead of IsUnique. Always double-check the documentation for the exact casing.
Logical Operator Limitations: While you can use and or or, you cannot nest them (e.g., (A and B) or C). For complex logic, consider using multiple separate rules or a CustomSQL rule.
Further Study: For more on deploying these rules via code, look into the AWS::Glue::DataQualityRuleset resource in CloudFormation documentation.

Mastering Data Quality Rules: AWS Glue Data Quality & DataBrew

Learning Objectives

After studying this guide, you should be able to:

Define data quality rules using the Data Quality Definition Language (DQDL).
Distinguish between visual data profiling in AWS Glue DataBrew and programmatic checks in AWS Glue Data Quality.
Construct composite rules using logical operators to validate complex datasets.
Interpret row-level outcomes to identify and filter "bad" data during ETL processes.
Implement automated data quality checks within AWS Glue Studio and via infrastructure as code.

Key Terms & Glossary

DQDL (Data Quality Definition Language): A case-sensitive domain-specific language used to write data quality rules in AWS Glue.
Ruleset: A collection of one or more DQDL rules grouped together to evaluate a dataset.
Data Profiling: The process of analyzing a dataset to generate statistics (like mean, null count, or correlation) used to recommend quality rules.
In-Transit Validation: Running data quality checks during the execution of an ETL job before the data is written to a target.
rowLevelOutcomes: An output option in Glue Studio that provides detailed pass/fail status for every individual record in a dataset.

The "Big Idea"

Formula / Concept Box

Concept	Syntax / Structure
DQDL Basic Format	`<RuleType> <Parameter> <Expression>`
Ruleset Container	`Rules = [ Rule1, Rule2 ]`
Composite Logic	`(Rule 1) and (Rule 2)` OR `(Rule 1) or (Rule 2)`
Row-Level Result	`DataQualityEvaluationResult` (Passed/Failed)

[!IMPORTANT] DQDL is case-sensitive. iscomplete is invalid; it must be IsComplete.

Hierarchical Outline

AWS Glue DataBrew
- Visual Profiling: Automated discovery of data patterns.
- Rule Recommendations: Suggested checks based on the dataset profile.
- Customized Rules: Manual enforcement of business-specific constraints.
AWS Glue Data Quality (DQDL)
- Core Rule Types:
  - Completeness: IsComplete, Completeness (checking for NULLs).
  - Uniqueness: IsUnique, Uniqueness, IsPrimaryKey.
  - Integrity: ReferentialIntegrity, DatasetMatch, SchemaMatch.
  - Statistical: Mean, StandardDeviation, RowCount.
- Custom Logic: CustomSQL for requirements not covered by built-in types.
Integration & Automation
- Glue Studio: Adding data quality nodes to visual ETL jobs.
- API/SDK: Programmatic access for custom applications.
- IaC: Deploying rulesets via AWS CloudFormation.

Visual Anchors

Data Quality Workflow in ETL

Loading Diagram...

DQDL Syntax Anatomy

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

ColumnCorrelation: Checks the statistical relationship between two columns.
- Example: Verifying that Unit_Price and Total_Cost have a high correlation to ensure no calculation errors occurred.
DataFreshness: Compares a date column against the current time.
- Example: Ensuring that the last_updated timestamp is no more than 24 hours old for a daily sales report.
IsPrimaryKey: A composite check for both uniqueness and completeness.
- Example: Validating that transaction_id is never null and never repeats in the database.

Worked Examples

Example 1: Validating a Taxi Dataset

Scenario: You need to ensure that 90% of taxi rides have passengers and that the trip ID is unique and complete.

DQDL Solution:

sql

Rules = [
    Completeness "passenger_count" > 0.90,
    (IsComplete "trip_id") and (IsUnique "trip_id")
]

Example 2: Range and Custom SQL

Scenario: You need to check if the fare_amount is between 1 and 100, and use a custom query for a specific business logic check.

DQDL Solution:

sql

Rules = [
    ColumnValues "fare_amount" between 1 and 100,
    CustomSql "SELECT count(*) FROM primary WHERE fare_amount < 0" = 0
]

Comparison Tables

Feature	AWS Glue DataBrew	AWS Glue Data Quality
Primary Interface	Visual/No-code Console	Code-based (DQDL)
Best For	Data Analysts & Scientists	Data Engineers & Developers
Integration	Profile Jobs & Recipes	ETL Jobs & Glue Data Catalog
Complexity	Drag-and-drop transformations	Programmatic, SQL-like rules
Scale	Best for interactive cleaning	Best for high-volume automated pipelines

Checkpoint Questions

What is the main difference between IsComplete and Completeness? (Answer: IsComplete checks if 100% of data is present, while Completeness allows you to set a percentage threshold, e.g., > 0.95)
True or False: Logical operators and and or can be combined in a single DQDL expression. (Answer: False, based on current documentation limitations mentioned in the study guide.)
What are the two output options provided by the Data Quality node in Glue Studio? (Answer: rowLevelOutcomes and ruleOutcomes)
Why is the rowLevelOutcomes option useful for downstream processing? (Answer: It adds diagnostic columns to each record, allowing you to filter out specifically failed rows while keeping the passing ones.)

Muddy Points & Cross-Refs

Case Sensitivity: One of the most common errors is writing isUnique instead of IsUnique. Always double-check the documentation for the exact casing.
Logical Operator Limitations: While you can use and or or, you cannot nest them (e.g., (A and B) or C). For complex logic, consider using multiple separate rules or a CustomSQL rule.
Further Study: For more on deploying these rules via code, look into the AWS::Glue::DataQualityRuleset resource in CloudFormation documentation.