Mastering Data Quality Rules: AWS Glue Data Quality & DataBrew
Define data quality rules (for example, DataBrew)
Mastering Data Quality Rules: AWS Glue Data Quality & DataBrew
Learning Objectives
After studying this guide, you should be able to:
- Define data quality rules using the Data Quality Definition Language (DQDL).
- Distinguish between visual data profiling in AWS Glue DataBrew and programmatic checks in AWS Glue Data Quality.
- Construct composite rules using logical operators to validate complex datasets.
- Interpret row-level outcomes to identify and filter "bad" data during ETL processes.
- Implement automated data quality checks within AWS Glue Studio and via infrastructure as code.
Key Terms & Glossary
- DQDL (Data Quality Definition Language): A case-sensitive domain-specific language used to write data quality rules in AWS Glue.
- Ruleset: A collection of one or more DQDL rules grouped together to evaluate a dataset.
- Data Profiling: The process of analyzing a dataset to generate statistics (like mean, null count, or correlation) used to recommend quality rules.
- In-Transit Validation: Running data quality checks during the execution of an ETL job before the data is written to a target.
- rowLevelOutcomes: An output option in Glue Studio that provides detailed pass/fail status for every individual record in a dataset.
The "Big Idea"
Data quality is the "immune system" of a data pipeline. Without it, your data lake becomes a "data swamp" where decisions are based on incomplete, inaccurate, or stale information. AWS provides two primary tools: DataBrew (for visual, profile-based cleaning) and Glue Data Quality (for rule-based, automated validation). By defining precise rules in DQDL, engineers can proactively intercept bad data before it pollutes downstream analytics.
Formula / Concept Box
| Concept | Syntax / Structure |
|---|---|
| DQDL Basic Format | <RuleType> <Parameter> <Expression> |
| Ruleset Container | Rules = [ Rule1, Rule2 ] |
| Composite Logic | (Rule 1) and (Rule 2) OR (Rule 1) or (Rule 2) |
| Row-Level Result | DataQualityEvaluationResult (Passed/Failed) |
[!IMPORTANT] DQDL is case-sensitive.
iscompleteis invalid; it must beIsComplete.
Hierarchical Outline
- AWS Glue DataBrew
- Visual Profiling: Automated discovery of data patterns.
- Rule Recommendations: Suggested checks based on the dataset profile.
- Customized Rules: Manual enforcement of business-specific constraints.
- AWS Glue Data Quality (DQDL)
- Core Rule Types:
- Completeness:
IsComplete,Completeness(checking for NULLs). - Uniqueness:
IsUnique,Uniqueness,IsPrimaryKey. - Integrity:
ReferentialIntegrity,DatasetMatch,SchemaMatch. - Statistical:
Mean,StandardDeviation,RowCount.
- Completeness:
- Custom Logic:
CustomSQLfor requirements not covered by built-in types.
- Core Rule Types:
- Integration & Automation
- Glue Studio: Adding data quality nodes to visual ETL jobs.
- API/SDK: Programmatic access for custom applications.
- IaC: Deploying rulesets via AWS CloudFormation.
Visual Anchors
Data Quality Workflow in ETL
DQDL Syntax Anatomy
\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, rounded corners, inner sep=5pt, fill=blue!10}] \node (type) {\textbf{RuleType} \ e.g., IsUnique}; \node (param) [right of=type, xshift=1.5cm] {\textbf{Parameter} \ e.g., "user_id"}; \node (expr) [right of=param, xshift=1.5cm] {\textbf{Expression} \ e.g., > 0.95};
\draw[->, thick] (type) -- (param);
\draw[->, thick] (param) -- (expr);
\node[draw=none, fill=none, below of=param, yshift=1cm] {\small \textit{Syntax: <RuleType> <Parameter> <Expression>}};\end{tikzpicture}
Definition-Example Pairs
- ColumnCorrelation: Checks the statistical relationship between two columns.
- Example: Verifying that
Unit_PriceandTotal_Costhave a high correlation to ensure no calculation errors occurred.
- Example: Verifying that
- DataFreshness: Compares a date column against the current time.
- Example: Ensuring that the
last_updatedtimestamp is no more than 24 hours old for a daily sales report.
- Example: Ensuring that the
- IsPrimaryKey: A composite check for both uniqueness and completeness.
- Example: Validating that
transaction_idis never null and never repeats in the database.
- Example: Validating that
Worked Examples
Example 1: Validating a Taxi Dataset
Scenario: You need to ensure that 90% of taxi rides have passengers and that the trip ID is unique and complete.
DQDL Solution:
Rules = [
Completeness "passenger_count" > 0.90,
(IsComplete "trip_id") and (IsUnique "trip_id")
]Example 2: Range and Custom SQL
Scenario: You need to check if the fare_amount is between 1 and 100, and use a custom query for a specific business logic check.
DQDL Solution:
Rules = [
ColumnValues "fare_amount" between 1 and 100,
CustomSql "SELECT count(*) FROM primary WHERE fare_amount < 0" = 0
]Comparison Tables
| Feature | AWS Glue DataBrew | AWS Glue Data Quality |
|---|---|---|
| Primary Interface | Visual/No-code Console | Code-based (DQDL) |
| Best For | Data Analysts & Scientists | Data Engineers & Developers |
| Integration | Profile Jobs & Recipes | ETL Jobs & Glue Data Catalog |
| Complexity | Drag-and-drop transformations | Programmatic, SQL-like rules |
| Scale | Best for interactive cleaning | Best for high-volume automated pipelines |
Checkpoint Questions
- What is the main difference between
IsCompleteandCompleteness? (Answer: IsComplete checks if 100% of data is present, while Completeness allows you to set a percentage threshold, e.g., > 0.95) - True or False: Logical operators
andandorcan be combined in a single DQDL expression. (Answer: False, based on current documentation limitations mentioned in the study guide.) - What are the two output options provided by the Data Quality node in Glue Studio? (Answer: rowLevelOutcomes and ruleOutcomes)
- Why is the
rowLevelOutcomesoption useful for downstream processing? (Answer: It adds diagnostic columns to each record, allowing you to filter out specifically failed rows while keeping the passing ones.)
Muddy Points & Cross-Refs
- Case Sensitivity: One of the most common errors is writing
isUniqueinstead ofIsUnique. Always double-check the documentation for the exact casing. - Logical Operator Limitations: While you can use
andoror, you cannot nest them (e.g.,(A and B) or C). For complex logic, consider using multiple separate rules or aCustomSQLrule. - Further Study: For more on deploying these rules via code, look into the
AWS::Glue::DataQualityRulesetresource in CloudFormation documentation.