AWS Data Engineering: Addressing Changes to Data Characteristics

This guide covers Task 2.4.2 of the AWS Certified Data Engineer – Associate (DEA-C01) exam. It focuses on how data engineers manage the evolving nature of data, including schema drift, structural changes, and lifecycle management within the AWS ecosystem.

Learning Objectives

By the end of this guide, you will be able to:

Define Schema Evolution and identify strategies for handling Schema Drift.
Configure AWS Glue Crawlers to automatically detect and update metadata.
Differentiate between tools used for schema conversion like AWS SCT and AWS DMS.
Implement data lifecycle policies in Amazon S3 and Amazon DynamoDB to manage data aging.
Establish Data Lineage to track changes across the data environment.

Key Terms & Glossary

Schema Drift: The phenomenon where source data systems change their structure (e.g., adding/removing columns) without notifying downstream consumers.
Data Catalog: A persistent metadata store (like AWS Glue Data Catalog) that provides a unified view of data across various sources.
Partition Projection: A technique in AWS Glue that speeds up query processing of highly partitioned tables by calculating partition information from configuration rather than S3 metadata.
TTL (Time to Live): A mechanism in DynamoDB that automatically deletes items from a table after a specific timestamp to reduce storage costs.
DQDL (Data Quality Definition Language): A declarative language used in AWS Glue to define rules for validating data quality.

The "Big Idea"

In a modern data architecture, change is the only constant. Data characteristics—such as its schema, volume, and velocity—evolve over time. A Data Engineer's primary responsibility is to build resilient pipelines that can gracefully handle these changes without manual intervention. This involves balancing automated discovery (Glue Crawlers) with rigid governance (Lake Formation) and cost-optimized storage (S3 Lifecycle).

Formula / Concept Box

Concept	Tool / Rule	Impact
Schema Updates	AWS Glue Crawler `UpdateTable`	Automatically adds new columns to the Data Catalog.
Structural Mapping	AWS Schema Conversion Tool (SCT)	Converts source database schemas to a different target engine (e.g., Oracle to Aurora).
Data Aging	S3 Lifecycle Policies	Automates transitions: `S3 Standard` → `S3 Glacier` → `Expiration`.
Item Expiration	DynamoDB TTL	Deletes data based on an epoch timestamp attribute without using RCU/WCU.

Hierarchical Outline

I. Schema Evolution & Management
- AWS Glue Data Catalog: The central metadata repository for AWS Lake House architectures.
- Glue Crawlers: Automate schema discovery; can be configured to add new columns or mark deleted columns as deprecated.
- Schema Versioning: Keeping history of schema changes to ensure backward compatibility for Athena/Redshift Spectrum queries.
II. Addressing Structural Changes
- AWS SCT: Used for heterogeneous migrations; transforms schema, functions, and stored procedures.
- AWS DMS: Performs the actual data movement; can handle simple schema changes during replication.
III. Managing Data Characteristics over Time
- S3 Versioning: Protects against accidental deletes and allows rollbacks to previous states of data.
- Partitioning Strategies: Using date-based partitioning (year=2023/month=10/day=24) to optimize query performance as data grows.

Visual Anchors

Data Cataloging Workflow

Loading Diagram...

S3 Lifecycle Transition Logic

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Term: Heterogeneous Migration
- Definition: Moving data between different database engines where the schema must be converted.
- Example: Migrating an on-premises Microsoft SQL Server database to an Amazon Aurora PostgreSQL cluster using AWS SCT to rewrite the SQL syntax.
Term: Data Lineage
- Definition: A visual map of the data's journey, showing where it originated and how it was transformed.
- Example: Using Amazon SageMaker ML Lineage Tracking to see which specific S3 dataset was used to train a specific version of an AI model.

Worked Examples

Example 1: Handling Added Columns in a CSV Batch

Scenario: A marketing team adds a promo_code column to their daily CSV upload in S3. Your Athena queries are failing because the Data Catalog doesn't know about this column. Solution:

Run the AWS Glue Crawler assigned to that S3 path.
Set the crawler configuration to "Update the table definition in the data catalog" for any schema changes.
The Crawler detects the new column and updates the Metadata. Athena can now query the new column immediately without manual SQL ALTER TABLE commands.

Example 2: Optimizing DynamoDB Storage Costs

Scenario: A gaming app stores temporary session data in DynamoDB. This data is only needed for 24 hours. Solution:

Add a TimeToLive attribute to each item (format: Unix Epoch time).
Enable TTL on the DynamoDB table, selecting that attribute.
Result: DynamoDB automatically deletes the sessions within 48 hours of expiration, and these deletes do not consume Write Capacity Units (WCU).

Checkpoint Questions

What is the difference between AWS SCT and AWS DMS regarding schema changes?
How does Partition Projection improve performance for highly partitioned data in S3?
Which S3 feature allows you to recover a file that was overwritten by a script with incorrect data?
When should you use AWS Glue DataBrew instead of a Glue ETL script?

Comparison Tables

Feature	AWS Glue Crawler	AWS SCT
Primary Purpose	Metadata Discovery (S3/RDS/NoSQL)	Schema Conversion (Database-to-Database)
Target Output	Glue Data Catalog Tables	SQL DDL Scripts / Converted Schema
Handling Change	Detects schema drift automatically	Manual re-run for structural redesigns
Use Case	Populating Data Lakes	Database Migrations

Muddy Points & Cross-Refs

Crawler vs. Manual Entry: If your schema is extremely stable and you want to prevent unauthorized changes, manual entry is better. Crawlers are best for evolving datasets.
Partitioning vs. Indexing: In Redshift, use Sort Keys for performance; in S3/Athena, use Partitions (folders) to limit the amount of data scanned.
S3 Versioning vs. Backup: Versioning is for immediate recovery of specific objects; AWS Backup is for cross-region disaster recovery and compliance-level snapshots.