Study Guide920 words

Mastering Data Quality Rules: AWS Glue Data Quality & DataBrew

Define data quality rules (for example, DataBrew)

Mastering Data Quality Rules: AWS Glue Data Quality & DataBrew

Learning Objectives

After studying this guide, you should be able to:

  • Define data quality rules using the Data Quality Definition Language (DQDL).
  • Distinguish between visual data profiling in AWS Glue DataBrew and programmatic checks in AWS Glue Data Quality.
  • Construct composite rules using logical operators to validate complex datasets.
  • Interpret row-level outcomes to identify and filter "bad" data during ETL processes.
  • Implement automated data quality checks within AWS Glue Studio and via infrastructure as code.

Key Terms & Glossary

  • DQDL (Data Quality Definition Language): A case-sensitive domain-specific language used to write data quality rules in AWS Glue.
  • Ruleset: A collection of one or more DQDL rules grouped together to evaluate a dataset.
  • Data Profiling: The process of analyzing a dataset to generate statistics (like mean, null count, or correlation) used to recommend quality rules.
  • In-Transit Validation: Running data quality checks during the execution of an ETL job before the data is written to a target.
  • rowLevelOutcomes: An output option in Glue Studio that provides detailed pass/fail status for every individual record in a dataset.

The "Big Idea"

Data quality is the "immune system" of a data pipeline. Without it, your data lake becomes a "data swamp" where decisions are based on incomplete, inaccurate, or stale information. AWS provides two primary tools: DataBrew (for visual, profile-based cleaning) and Glue Data Quality (for rule-based, automated validation). By defining precise rules in DQDL, engineers can proactively intercept bad data before it pollutes downstream analytics.

Formula / Concept Box

ConceptSyntax / Structure
DQDL Basic Format<RuleType> <Parameter> <Expression>
Ruleset ContainerRules = [ Rule1, Rule2 ]
Composite Logic(Rule 1) and (Rule 2) OR (Rule 1) or (Rule 2)
Row-Level ResultDataQualityEvaluationResult (Passed/Failed)

[!IMPORTANT] DQDL is case-sensitive. iscomplete is invalid; it must be IsComplete.

Hierarchical Outline

  • AWS Glue DataBrew
    • Visual Profiling: Automated discovery of data patterns.
    • Rule Recommendations: Suggested checks based on the dataset profile.
    • Customized Rules: Manual enforcement of business-specific constraints.
  • AWS Glue Data Quality (DQDL)
    • Core Rule Types:
      • Completeness: IsComplete, Completeness (checking for NULLs).
      • Uniqueness: IsUnique, Uniqueness, IsPrimaryKey.
      • Integrity: ReferentialIntegrity, DatasetMatch, SchemaMatch.
      • Statistical: Mean, StandardDeviation, RowCount.
    • Custom Logic: CustomSQL for requirements not covered by built-in types.
  • Integration & Automation
    • Glue Studio: Adding data quality nodes to visual ETL jobs.
    • API/SDK: Programmatic access for custom applications.
    • IaC: Deploying rulesets via AWS CloudFormation.

Visual Anchors

Data Quality Workflow in ETL

Loading Diagram...

DQDL Syntax Anatomy

\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, rounded corners, inner sep=5pt, fill=blue!10}] \node (type) {\textbf{RuleType} \ e.g., IsUnique}; \node (param) [right of=type, xshift=1.5cm] {\textbf{Parameter} \ e.g., "user_id"}; \node (expr) [right of=param, xshift=1.5cm] {\textbf{Expression} \ e.g., > 0.95};

code
\draw[->, thick] (type) -- (param); \draw[->, thick] (param) -- (expr); \node[draw=none, fill=none, below of=param, yshift=1cm] {\small \textit{Syntax: <RuleType> <Parameter> <Expression>}};

\end{tikzpicture}

Definition-Example Pairs

  • ColumnCorrelation: Checks the statistical relationship between two columns.
    • Example: Verifying that Unit_Price and Total_Cost have a high correlation to ensure no calculation errors occurred.
  • DataFreshness: Compares a date column against the current time.
    • Example: Ensuring that the last_updated timestamp is no more than 24 hours old for a daily sales report.
  • IsPrimaryKey: A composite check for both uniqueness and completeness.
    • Example: Validating that transaction_id is never null and never repeats in the database.

Worked Examples

Example 1: Validating a Taxi Dataset

Scenario: You need to ensure that 90% of taxi rides have passengers and that the trip ID is unique and complete.

DQDL Solution:

sql
Rules = [ Completeness "passenger_count" > 0.90, (IsComplete "trip_id") and (IsUnique "trip_id") ]

Example 2: Range and Custom SQL

Scenario: You need to check if the fare_amount is between 1 and 100, and use a custom query for a specific business logic check.

DQDL Solution:

sql
Rules = [ ColumnValues "fare_amount" between 1 and 100, CustomSql "SELECT count(*) FROM primary WHERE fare_amount < 0" = 0 ]

Comparison Tables

FeatureAWS Glue DataBrewAWS Glue Data Quality
Primary InterfaceVisual/No-code ConsoleCode-based (DQDL)
Best ForData Analysts & ScientistsData Engineers & Developers
IntegrationProfile Jobs & RecipesETL Jobs & Glue Data Catalog
ComplexityDrag-and-drop transformationsProgrammatic, SQL-like rules
ScaleBest for interactive cleaningBest for high-volume automated pipelines

Checkpoint Questions

  1. What is the main difference between IsComplete and Completeness? (Answer: IsComplete checks if 100% of data is present, while Completeness allows you to set a percentage threshold, e.g., > 0.95)
  2. True or False: Logical operators and and or can be combined in a single DQDL expression. (Answer: False, based on current documentation limitations mentioned in the study guide.)
  3. What are the two output options provided by the Data Quality node in Glue Studio? (Answer: rowLevelOutcomes and ruleOutcomes)
  4. Why is the rowLevelOutcomes option useful for downstream processing? (Answer: It adds diagnostic columns to each record, allowing you to filter out specifically failed rows while keeping the passing ones.)

Muddy Points & Cross-Refs

  • Case Sensitivity: One of the most common errors is writing isUnique instead of IsUnique. Always double-check the documentation for the exact casing.
  • Logical Operator Limitations: While you can use and or or, you cannot nest them (e.g., (A and B) or C). For complex logic, consider using multiple separate rules or a CustomSQL rule.
  • Further Study: For more on deploying these rules via code, look into the AWS::Glue::DataQualityRuleset resource in CloudFormation documentation.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free