Data Quality Engineering on AWS: Checks and Validation
Run data quality checks while processing the data (for example, checking for empty fields)
Data Quality Engineering on AWS: Checks and Validation
[!IMPORTANT] Data quality is not a one-time event but a continuous process. Running data quality checks "in-transit" prevents corrupted or incomplete data from reaching downstream data lakes or warehouses, where the cost of remediation is significantly higher.
Learning Objectives
After studying this guide, you should be able to:
- Define and implement DQDL (Data Quality Definition Language) rules within AWS Glue jobs.
- Differentiate between row-level and rule-level outcomes in AWS Glue Data Quality.
- Utilize AWS Glue DataBrew for visual data profiling and cleansing.
- Explain the role of the Amazon Deequ library for large-scale Spark-based data validation.
- Configure automated alerts and retry mechanisms for data quality failures using AWS Step Functions and SNS.
Key Terms & Glossary
- DQDL (Data Quality Definition Language): A domain-specific language used to define rules for evaluating the quality of data in AWS Glue.
- Completeness: A metric measuring the percentage of non-null or non-blank values in a specific column.
- Data Profiling: The process of examining source data to understand its structure, content, and quality statistics (e.g., mean, max, null count).
- Deequ: An open-source library built on top of Apache Spark by Amazon for defining "unit tests for data."
- Data Skew: A condition where data is unevenly distributed across partitions, often leading to performance bottlenecks in distributed systems like EMR or Glue.
The "Big Idea"
In modern data engineering, Data Quality in Transit is the golden standard. Instead of ingesting everything and cleaning it later (which leads to "data swamps"), data engineers insert validation gates directly into the ETL pipeline. This ensures that only data meeting specific "contracts" (e.g., no empty IDs, valid date ranges) enters the curated zone. If data fails, it is routed to a Dead Letter Queue (DLQ) for manual inspection, preserving the integrity of the primary data lake.
Formula / Concept Box
| Concept | Implementation Detail |
|---|---|
| DQDL Rule Format | RuleType "ColumnName" <Operator> <Value> (e.g., IsComplete "user_id") |
| Glue DQ Outputs | rowLevelOutcomes: Appends pass/fail status to every record.ruleOutcomes: Provides a summary of which rules passed/failed for the whole set. |
| Logical Operators | Rules can be combined using AND and OR (e.g., (IsComplete "id") AND (IsUnique "id")) |
| Deequ Metric Types | AnalysisRunner (Computes metrics) vs VerificationSuite (Checks metrics against thresholds) |
Hierarchical Outline
- I. AWS Glue Data Quality
- DQDL Implementation: Writing declarative rules for schema, row counts, and column values.
- In-Transit Validation: Integrating checks directly into Glue Studio or Spark scripts.
- Output Nodes: Utilizing
rowLevelOutcomesto filter out bad records.
- II. Visual Data Cleansing (DataBrew)
- Profiling: Generating over 40 metadata statistics automatically.
- Cleansing Recipes: Applying reusable transformations (e.g., filling missing values with the mean).
- III. Advanced Validation (Deequ & SageMaker)
- Deequ for Spark: Programmatic validation using
VerificationSuitefor complex distributions. - SageMaker Model Monitor: Detecting Data Drift by comparing production data against a baseline.
- Deequ for Spark: Programmatic validation using
- IV. Automation & Error Handling
- Step Functions: Retrying failed tasks or routing to DLQ.
- CloudWatch Alarms: Triggering notifications via SNS when quality scores drop below a threshold.
Visual Anchors
Data Quality Pipeline Flow
Data Validation Gate Logic
\begin{tikzpicture}[node distance=2cm, auto] \draw[thick, fill=blue!10] (0,0) rectangle (3,1.5) node[pos=.5] {Incoming Data}; \draw[->, thick] (3,0.75) -- (5,0.75); \draw[thick, fill=red!10] (5,0.75) circle (1cm) node {DQ Gate}; \draw[->, thick] (6,0.75) -- (8,1.5) node[right] {Verified (Pass)}; \draw[->, thick] (6,0.75) -- (8,0) node[right] {Quarantine (Fail)}; \draw[dashed] (5,-0.5) -- (5,-1.5) node[below] {DQDL Rules Engine}; \end{tikzpicture}
Definition-Example Pairs
- Rule: Completeness
- Definition: Checks if 100% of the data in a column is non-null.
- Example: Ensuring the
order_idfield is never blank in a transaction dataset.
- Rule: ColumnCorrelation
- Definition: Measures the statistical relationship between two numerical columns.
- Example: Checking if
shipping_weightandshipping_costhave a strong positive correlation to detect billing anomalies.
- Rule: DataFreshness
- Definition: Evaluates the difference between the current time and the max value in a date column.
- Example: Failing a job if the newest record in the
last_updatedcolumn is older than 24 hours.
Worked Examples
Example 1: Defining a DQDL Rule Set
Suppose you are processing a New York Taxi dataset. You want to ensure that rides are realistic.
Rule Set:
Rules = [
RowCount > 1000,
IsComplete "vendor_id",
ColumnValues "passenger_count" > 0,
(Mean "trip_distance") < 50
]Explanation: This ensures the dataset isn't empty, every ride has a vendor ID, there is at least one passenger, and the average trip isn't an impossible distance (outlier detection).
Example 2: Using Deequ in PySpark
from com.amazon.deequ.checks import Check, CheckLevel
from com.amazon.deequ import VerificationSuite
# Define the check suite
check = Check(CheckLevel.Error, "Data Quality Check") \
.isComplete("email") \
.isUnique("user_id") \
.hasMin("age", lambda x: x >= 18)
# Run the suite
result = VerificationSuite() \
.onData(dataset_df) \
.addCheck(check) \
.run()Explanation: This programmatic approach allows for complex logic, such as ensuring users are over 18, directly within a distributed Spark environment.
Checkpoint Questions
- What is the difference between
rowLevelOutcomesandruleOutcomesin AWS Glue? - Which AWS service provides a visual interface for data profiling and over 250 built-in transformations?
- How does DQDL handle composite rules?
- Why might a data engineer use Amazon Deequ instead of standard SQL
WHEREclauses for quality checks?
Comparison Tables
| Feature | AWS Glue Data Quality | AWS Glue DataBrew | Amazon Deequ |
|---|---|---|---|
| User Persona | Data Engineer (Code/Studio) | Data Analyst (Visual/No-code) | Developer (Scala/Python) |
| Best For | In-transit ETL validation | Exploratory profiling & cleaning | Unit testing Spark DataFrames |
| Language | DQDL (Declarative) | Visual UI / Recipes | Spark API |
| Integration | Native to Glue Jobs | Standalone or Glue flow | EMR / Glue / SageMaker |
Muddy Points & Cross-Refs
- Row-Level Failures vs. Job Failures: A common point of confusion is whether a DQ failure stops the ETL job. By default, it does NOT. You must configure the "Action" in Glue Studio to "Fail the job" if the DQ score is below a certain percentage. Otherwise, the job will complete, but with marked records.
- Data Skew: While not a "quality" issue in terms of accuracy, skew affects "processing quality." Use
Implement data skew mechanisms(like salting keys) to fix performance. Cross-reference with Unit 1: Data Transformation. - Auditing: For compliance, always extract logs of DQ results to S3 for long-term storage and auditability. Cross-reference with Unit 4: Data Security and Governance.