AWS Data Engineering: Addressing Changes to Data Characteristics
Address changes to the characteristics of data
AWS Data Engineering: Addressing Changes to Data Characteristics
This guide covers Task 2.4.2 of the AWS Certified Data Engineer – Associate (DEA-C01) exam. It focuses on how data engineers manage the evolving nature of data, including schema drift, structural changes, and lifecycle management within the AWS ecosystem.
Learning Objectives
By the end of this guide, you will be able to:
- Define Schema Evolution and identify strategies for handling Schema Drift.
- Configure AWS Glue Crawlers to automatically detect and update metadata.
- Differentiate between tools used for schema conversion like AWS SCT and AWS DMS.
- Implement data lifecycle policies in Amazon S3 and Amazon DynamoDB to manage data aging.
- Establish Data Lineage to track changes across the data environment.
Key Terms & Glossary
- Schema Drift: The phenomenon where source data systems change their structure (e.g., adding/removing columns) without notifying downstream consumers.
- Data Catalog: A persistent metadata store (like AWS Glue Data Catalog) that provides a unified view of data across various sources.
- Partition Projection: A technique in AWS Glue that speeds up query processing of highly partitioned tables by calculating partition information from configuration rather than S3 metadata.
- TTL (Time to Live): A mechanism in DynamoDB that automatically deletes items from a table after a specific timestamp to reduce storage costs.
- DQDL (Data Quality Definition Language): A declarative language used in AWS Glue to define rules for validating data quality.
The "Big Idea"
In a modern data architecture, change is the only constant. Data characteristics—such as its schema, volume, and velocity—evolve over time. A Data Engineer's primary responsibility is to build resilient pipelines that can gracefully handle these changes without manual intervention. This involves balancing automated discovery (Glue Crawlers) with rigid governance (Lake Formation) and cost-optimized storage (S3 Lifecycle).
Formula / Concept Box
| Concept | Tool / Rule | Impact |
|---|---|---|
| Schema Updates | AWS Glue Crawler UpdateTable | Automatically adds new columns to the Data Catalog. |
| Structural Mapping | AWS Schema Conversion Tool (SCT) | Converts source database schemas to a different target engine (e.g., Oracle to Aurora). |
| Data Aging | S3 Lifecycle Policies | Automates transitions: S3 Standard → S3 Glacier → Expiration. |
| Item Expiration | DynamoDB TTL | Deletes data based on an epoch timestamp attribute without using RCU/WCU. |
Hierarchical Outline
- I. Schema Evolution & Management
- AWS Glue Data Catalog: The central metadata repository for AWS Lake House architectures.
- Glue Crawlers: Automate schema discovery; can be configured to add new columns or mark deleted columns as deprecated.
- Schema Versioning: Keeping history of schema changes to ensure backward compatibility for Athena/Redshift Spectrum queries.
- II. Addressing Structural Changes
- AWS SCT: Used for heterogeneous migrations; transforms schema, functions, and stored procedures.
- AWS DMS: Performs the actual data movement; can handle simple schema changes during replication.
- III. Managing Data Characteristics over Time
- S3 Versioning: Protects against accidental deletes and allows rollbacks to previous states of data.
- Partitioning Strategies: Using date-based partitioning (
year=2023/month=10/day=24) to optimize query performance as data grows.
Visual Anchors
Data Cataloging Workflow
S3 Lifecycle Transition Logic
\begin{tikzpicture}[node distance=2cm] \node (start) [draw, rectangle, rounded corners] {Object Created (S3 Standard)}; \node (ia) [draw, rectangle, below of=start, rounded corners] {30 Days: S3 Standard-IA}; \node (glacier) [draw, rectangle, below of=ia, rounded corners] {90 Days: S3 Glacier}; \node (end) [draw, circle, below of=glacier, fill=red!20] {365 Days: Expired};
\draw [->, thick] (start) -- (ia);
\draw [->, thick] (ia) -- (glacier);
\draw [->, thick] (glacier) -- (end);
\node [right of=ia, xshift=2cm] {\tiny Lower Cost / Frequent Access};
\node [right of=glacier, xshift=2cm] {\tiny Archive / Long-term Storage};\end{tikzpicture}
Definition-Example Pairs
- Term: Heterogeneous Migration
- Definition: Moving data between different database engines where the schema must be converted.
- Example: Migrating an on-premises Microsoft SQL Server database to an Amazon Aurora PostgreSQL cluster using AWS SCT to rewrite the SQL syntax.
- Term: Data Lineage
- Definition: A visual map of the data's journey, showing where it originated and how it was transformed.
- Example: Using Amazon SageMaker ML Lineage Tracking to see which specific S3 dataset was used to train a specific version of an AI model.
Worked Examples
Example 1: Handling Added Columns in a CSV Batch
Scenario: A marketing team adds a promo_code column to their daily CSV upload in S3. Your Athena queries are failing because the Data Catalog doesn't know about this column.
Solution:
- Run the AWS Glue Crawler assigned to that S3 path.
- Set the crawler configuration to "Update the table definition in the data catalog" for any schema changes.
- The Crawler detects the new column and updates the Metadata. Athena can now query the new column immediately without manual SQL
ALTER TABLEcommands.
Example 2: Optimizing DynamoDB Storage Costs
Scenario: A gaming app stores temporary session data in DynamoDB. This data is only needed for 24 hours. Solution:
- Add a
TimeToLiveattribute to each item (format: Unix Epoch time). - Enable TTL on the DynamoDB table, selecting that attribute.
- Result: DynamoDB automatically deletes the sessions within 48 hours of expiration, and these deletes do not consume Write Capacity Units (WCU).
Checkpoint Questions
- What is the difference between AWS SCT and AWS DMS regarding schema changes?
- How does Partition Projection improve performance for highly partitioned data in S3?
- Which S3 feature allows you to recover a file that was overwritten by a script with incorrect data?
- When should you use AWS Glue DataBrew instead of a Glue ETL script?
Comparison Tables
| Feature | AWS Glue Crawler | AWS SCT |
|---|---|---|
| Primary Purpose | Metadata Discovery (S3/RDS/NoSQL) | Schema Conversion (Database-to-Database) |
| Target Output | Glue Data Catalog Tables | SQL DDL Scripts / Converted Schema |
| Handling Change | Detects schema drift automatically | Manual re-run for structural redesigns |
| Use Case | Populating Data Lakes | Database Migrations |
Muddy Points & Cross-Refs
- Crawler vs. Manual Entry: If your schema is extremely stable and you want to prevent unauthorized changes, manual entry is better. Crawlers are best for evolving datasets.
- Partitioning vs. Indexing: In Redshift, use Sort Keys for performance; in S3/Athena, use Partitions (folders) to limit the amount of data scanned.
- S3 Versioning vs. Backup: Versioning is for immediate recovery of specific objects; AWS Backup is for cross-region disaster recovery and compliance-level snapshots.