Data Consistency and Quality with AWS Glue DataBrew
Investigate data consistency (for example, DataBrew)
Data Consistency and Quality with AWS Glue DataBrew
Maintaining high data quality and consistency is a pillar of the AWS Certified Data Engineer – Associate exam. This guide focuses on AWS Glue DataBrew, a visual data preparation tool that enables data engineers and analysts to clean, normalize, and investigate data without writing code.
Learning Objectives
After studying this guide, you should be able to:
- Define the role of AWS Glue DataBrew in the AWS data ecosystem.
- Describe data sampling techniques used to investigate large datasets.
- Explain how to identify and remediate data consistency issues (e.g., missing values, duplicates).
- Implement Data Quality Rules using the visual interface and DQDL.
- Differentiate between Profile Jobs and Recipe Jobs.
Key Terms & Glossary
- Dataset: A pointer to your data source (S3, Redshift, Glue Data Catalog, or JDBC).
- Project: The workspace where you define transformations and apply recipes to a sample of data.
- Recipe: A sequence of data transformation steps (e.g., filter, join, pivot) that can be saved and reused.
- Profile: A report generated by a Profile Job that provides over 40 statistics about a dataset (e.g., correlations, missing values, distribution).
- Imputation: The process of replacing missing data with substituted values (Mean, Median, Mode, or KNN).
- DQDL (Data Quality Definition Language): A declarative language used to define business rules for data validation.
The "Big Idea"
In modern data engineering, the "Garbage In, Garbage Out" principle remains the greatest risk. Data consistency ensures that data remains coherent and logically sound throughout the pipeline. AWS Glue DataBrew democratizes this process, moving data preparation from specialized Spark code to a visual, auditable, and reproducible workflow, allowing for rapid iteration and high-trust data lakes.
Formula / Concept Box
| Concept | Application/Formula | Key Tool |
|---|---|---|
| Success Metrics | VerificationResult.successMetricsAsDataFrame | Deequ / Glue Data Quality |
| Data Skew | Identifying uneven distribution across partitions. | DataBrew Profiling |
| Deduplication | FLAG_DUPLICATE_ROWS function | DataBrew Recipe |
| Consistency | Glue Schema Registry |
Hierarchical Outline
- I. Data Exploration & Profiling
- Data Profiling: Generates visual statistics; identifies anomalies and schema inconsistencies.
- Data Sampling: Techniques to work with subsets (First rows, Random, Stratified) for performance.
- II. Transformations & Recipes
- Visual Interface: Over 250 built-in transformations (Filter, Pivot, Join).
- Recipes: Versioned sets of steps; can be applied to entire datasets via a Job.
- III. Handling Inconsistency
- Missing Values: Filling with constants, aggregates, or using ML-based imputation (KNN).
- Duplicates: Flagging and removing exact matches using the
FLAG_DUPLICATE_ROWSfunction. - Outliers: Identifying and capping/removing values outside specific ranges.
- IV. Data Quality & Automation
- Rulesets: Custom validation checks (e.g., "Column A must not be NULL").
- PII Identification: Built-in masking and encryption for sensitive data compliance (GDPR/CCPA).
- Monitoring: Integration with CloudWatch for alarms and CloudTrail for audit trails.
Visual Anchors
DataBrew Operational Workflow
Data Sampling Concepts
Definition-Example Pairs
- Data Skew: A condition where data is not distributed evenly across partitions or values.
- Example: A sales dataset where 90% of transactions occur in 1 out of 50 states, causing processing bottlenecks for that specific partition.
- Recipe Imputation: Using statistical methods to fill "holes" in data.
- Example: A temperature sensor dataset has missing values for 2:00 PM; DataBrew fills them with the Median value of that day to maintain continuity.
- Sensitive Data Masking: Obscuring PII to ensure compliance.
- Example: Automatically replacing all characters in a "Social Security Number" column with
*while keeping the last four digits visible.
- Example: Automatically replacing all characters in a "Social Security Number" column with
Worked Examples
Example 1: Handling Missing Numerical Values
Scenario: You have an S3 dataset with a Price column where some rows are empty. You need to ensure the average is not skewed by zero-values.
- Open DataBrew Project: Connect to the S3 bucket.
- Profile: Run a Profile Job to see that 5% of
Pricevalues are missing. - Transform: Select the
Pricecolumn. - Action: Choose "Fill missing values" -> "Numerical aggregates" -> "Mean".
- Apply: Add to recipe. The missing rows now contain the average price of the remaining valid rows.
Example 2: Flagging Duplicates
Scenario: A CRM export has resulted in duplicate customer records.
- Function: Use the
FLAG_DUPLICATE_ROWSfunction in DataBrew. - Result: A new column is generated. The first occurrence of a record is marked
False, and subsequent exact matches are markedTrue. - Filter: Add a step to the recipe to "Delete rows where Flag_Duplicate is True".
Checkpoint Questions
- What is the difference between a Profile Job and a Recipe Job?
- Which service would you use to define data quality rules in a code-based (ETL) format versus a visual format?
- How does DataBrew handle PII (Personally Identifiable Information)?
- True or False: DataBrew transformations are applied to the entire dataset immediately upon selection in a Project.
▶Click to see answers
- A Profile Job analyzes the data to provide statistics (discovery); a Recipe Job applies transformations to the data (processing).
- AWS Glue Data Quality (DQDL) for code/scripts; DataBrew for visual rule definition.
- It uses built-in transformations for masking, encryption, and identification of sensitive patterns.
- False. Transformations in a Project are applied to a sample of data for preview; a Job must be run to apply them to the full dataset.
Comparison Tables
| Feature | AWS Glue DataBrew | AWS Glue Studio | AWS Lambda |
|---|---|---|---|
| User Persona | Analysts / Non-coders | Data Engineers | Developers |
| Interface | No-code, Visual | Low-code, Visual/Script | Code-only |
| Complexity | 250+ pre-built functions | Custom Spark/Python | Custom logic (short bursts) |
| Scaling | Serverless (managed) | Serverless (DPUs) | Serverless (Memory/Time) |
Muddy Points & Cross-Refs
- Sampling Bias: Be careful when using "First rows" for profiling. If your data is sorted by date, you might only see data from 2010 and miss schema changes that happened in 2023. Always consider Random Sampling for a representative profile.
- Integration: DataBrew recipes can be called as part of AWS Step Functions or AWS Glue Workflows for full automation.
- Cost: DataBrew is charged per session (for projects) and per node-hour (for jobs). Monitoring expensive long-running jobs in CloudWatch is essential.
[!TIP] For the exam, remember that DataBrew is the "Go-to" answer for visual data preparation and whenever "non-technical personas" or "no-code" solutions are mentioned.