Study Guide945 words

AWS Data Engineering: Addressing Changes to Data Characteristics

Address changes to the characteristics of data

AWS Data Engineering: Addressing Changes to Data Characteristics

This guide covers Task 2.4.2 of the AWS Certified Data Engineer – Associate (DEA-C01) exam. It focuses on how data engineers manage the evolving nature of data, including schema drift, structural changes, and lifecycle management within the AWS ecosystem.

Learning Objectives

By the end of this guide, you will be able to:

  • Define Schema Evolution and identify strategies for handling Schema Drift.
  • Configure AWS Glue Crawlers to automatically detect and update metadata.
  • Differentiate between tools used for schema conversion like AWS SCT and AWS DMS.
  • Implement data lifecycle policies in Amazon S3 and Amazon DynamoDB to manage data aging.
  • Establish Data Lineage to track changes across the data environment.

Key Terms & Glossary

  • Schema Drift: The phenomenon where source data systems change their structure (e.g., adding/removing columns) without notifying downstream consumers.
  • Data Catalog: A persistent metadata store (like AWS Glue Data Catalog) that provides a unified view of data across various sources.
  • Partition Projection: A technique in AWS Glue that speeds up query processing of highly partitioned tables by calculating partition information from configuration rather than S3 metadata.
  • TTL (Time to Live): A mechanism in DynamoDB that automatically deletes items from a table after a specific timestamp to reduce storage costs.
  • DQDL (Data Quality Definition Language): A declarative language used in AWS Glue to define rules for validating data quality.

The "Big Idea"

In a modern data architecture, change is the only constant. Data characteristics—such as its schema, volume, and velocity—evolve over time. A Data Engineer's primary responsibility is to build resilient pipelines that can gracefully handle these changes without manual intervention. This involves balancing automated discovery (Glue Crawlers) with rigid governance (Lake Formation) and cost-optimized storage (S3 Lifecycle).

Formula / Concept Box

ConceptTool / RuleImpact
Schema UpdatesAWS Glue Crawler UpdateTableAutomatically adds new columns to the Data Catalog.
Structural MappingAWS Schema Conversion Tool (SCT)Converts source database schemas to a different target engine (e.g., Oracle to Aurora).
Data AgingS3 Lifecycle PoliciesAutomates transitions: S3 StandardS3 GlacierExpiration.
Item ExpirationDynamoDB TTLDeletes data based on an epoch timestamp attribute without using RCU/WCU.

Hierarchical Outline

  • I. Schema Evolution & Management
    • AWS Glue Data Catalog: The central metadata repository for AWS Lake House architectures.
    • Glue Crawlers: Automate schema discovery; can be configured to add new columns or mark deleted columns as deprecated.
    • Schema Versioning: Keeping history of schema changes to ensure backward compatibility for Athena/Redshift Spectrum queries.
  • II. Addressing Structural Changes
    • AWS SCT: Used for heterogeneous migrations; transforms schema, functions, and stored procedures.
    • AWS DMS: Performs the actual data movement; can handle simple schema changes during replication.
  • III. Managing Data Characteristics over Time
    • S3 Versioning: Protects against accidental deletes and allows rollbacks to previous states of data.
    • Partitioning Strategies: Using date-based partitioning (year=2023/month=10/day=24) to optimize query performance as data grows.

Visual Anchors

Data Cataloging Workflow

Loading Diagram...

S3 Lifecycle Transition Logic

\begin{tikzpicture}[node distance=2cm] \node (start) [draw, rectangle, rounded corners] {Object Created (S3 Standard)}; \node (ia) [draw, rectangle, below of=start, rounded corners] {30 Days: S3 Standard-IA}; \node (glacier) [draw, rectangle, below of=ia, rounded corners] {90 Days: S3 Glacier}; \node (end) [draw, circle, below of=glacier, fill=red!20] {365 Days: Expired};

code
\draw [->, thick] (start) -- (ia); \draw [->, thick] (ia) -- (glacier); \draw [->, thick] (glacier) -- (end); \node [right of=ia, xshift=2cm] {\tiny Lower Cost / Frequent Access}; \node [right of=glacier, xshift=2cm] {\tiny Archive / Long-term Storage};

\end{tikzpicture}

Definition-Example Pairs

  • Term: Heterogeneous Migration
    • Definition: Moving data between different database engines where the schema must be converted.
    • Example: Migrating an on-premises Microsoft SQL Server database to an Amazon Aurora PostgreSQL cluster using AWS SCT to rewrite the SQL syntax.
  • Term: Data Lineage
    • Definition: A visual map of the data's journey, showing where it originated and how it was transformed.
    • Example: Using Amazon SageMaker ML Lineage Tracking to see which specific S3 dataset was used to train a specific version of an AI model.

Worked Examples

Example 1: Handling Added Columns in a CSV Batch

Scenario: A marketing team adds a promo_code column to their daily CSV upload in S3. Your Athena queries are failing because the Data Catalog doesn't know about this column. Solution:

  1. Run the AWS Glue Crawler assigned to that S3 path.
  2. Set the crawler configuration to "Update the table definition in the data catalog" for any schema changes.
  3. The Crawler detects the new column and updates the Metadata. Athena can now query the new column immediately without manual SQL ALTER TABLE commands.

Example 2: Optimizing DynamoDB Storage Costs

Scenario: A gaming app stores temporary session data in DynamoDB. This data is only needed for 24 hours. Solution:

  1. Add a TimeToLive attribute to each item (format: Unix Epoch time).
  2. Enable TTL on the DynamoDB table, selecting that attribute.
  3. Result: DynamoDB automatically deletes the sessions within 48 hours of expiration, and these deletes do not consume Write Capacity Units (WCU).

Checkpoint Questions

  1. What is the difference between AWS SCT and AWS DMS regarding schema changes?
  2. How does Partition Projection improve performance for highly partitioned data in S3?
  3. Which S3 feature allows you to recover a file that was overwritten by a script with incorrect data?
  4. When should you use AWS Glue DataBrew instead of a Glue ETL script?

Comparison Tables

FeatureAWS Glue CrawlerAWS SCT
Primary PurposeMetadata Discovery (S3/RDS/NoSQL)Schema Conversion (Database-to-Database)
Target OutputGlue Data Catalog TablesSQL DDL Scripts / Converted Schema
Handling ChangeDetects schema drift automaticallyManual re-run for structural redesigns
Use CasePopulating Data LakesDatabase Migrations

Muddy Points & Cross-Refs

  • Crawler vs. Manual Entry: If your schema is extremely stable and you want to prevent unauthorized changes, manual entry is better. Crawlers are best for evolving datasets.
  • Partitioning vs. Indexing: In Redshift, use Sort Keys for performance; in S3/Athena, use Partitions (folders) to limit the amount of data scanned.
  • S3 Versioning vs. Backup: Versioning is for immediate recovery of specific objects; AWS Backup is for cross-region disaster recovery and compliance-level snapshots.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free