Establishing Data Lineage with AWS Tools

Data lineage provides a chronological record of data's journey, from its origin through various transformations to its final consumption. In the AWS ecosystem, this is critical for governance, auditing, and debugging.

Learning Objectives

Define the core components of data lineage: provenance, transformation history, and downstream usage.
Identify appropriate AWS services for different lineage use cases (e.g., Business vs. ML-specific).
Configure SageMaker ML Lineage Tracking to ensure model reproducibility.
Explain the role of Amazon DataZone and SageMaker Catalog in unified metadata management.

Key Terms & Glossary

Data Lineage: The process of tracking the flow of data over time, providing visibility into where data originated and what happened to it.
Provenance: The documentation of the origin and source of a specific data asset.
Metadata: Data that provides information about other data (e.g., schema, author, creation date).
Immutable Tracking: A system where records of data movements cannot be changed after being written, essential for compliance.
OpenLineage: An open standard for lineage metadata collection that Amazon DataZone is compatible with.

The "Big Idea"

[!IMPORTANT] Data lineage is the "Audit Trail" for information. Just as a financial audit tracks every dollar, data lineage tracks every byte. It transforms data from a "black box" into a transparent, trustworthy asset. Without lineage, data-driven decisions are built on a foundation of mystery; with it, every insight is defensible and reproducible.

Formula / Concept Box

Feature	Primary Service	Key Strength
Business Visibility	Amazon DataZone	API-driven, OpenLineage compatible, business glossary integration.
ML Governance	SageMaker ML Lineage	Tracks artifacts (data, models) and actions (training, deployment).
Technical ETL	AWS Glue + Spline	Captures runtime lineage from Spark jobs at the column level.
Cross-Account Discovery	SageMaker Catalog	Centralized discovery for lakehouses and federated sources.

Hierarchical Outline

Amazon DataZone Lineage
- OpenLineage Integration: Compatible with open standards for cross-platform visibility.
- Visual Discovery: Graphical interface for navigating relationships between assets.
- Scope: Captures activities in the business catalog, subscribers, and API-captured events.
Amazon SageMaker ML Lineage Tracking
- Artifact Tracking: Stores information on data preparation, training, and model artifacts.
- Reproducibility: Essential for auditing model performance and regulatory compliance.
- Integration: Natively works with SageMaker Pipelines to automate lineage capture.
Amazon SageMaker Catalog
- Unified Metadata: Integrates with Lake Formation for fine-grained access control.
- Automation: Uses GenAI to add business context to table attributes automatically.
- Storage Support: Handles Redshift, S3 table buckets, and federated sources like Snowflake.
Custom Lineage Solutions
- Graph-Based: Using AWS Glue (ETL) + Spline (Agent) + Amazon Neptune (Graph Database).

Visual Anchors

ML Workflow Lineage Flowchart

Loading Diagram...

Custom Lineage Architecture

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Entity: A specific object in the lineage (e.g., a file, a job, or a model).
- Example: A CSV file stored in s3://raw-data-bucket/sales.csv is an entity.
Action: A transformation step that links two entities.
- Example: An AWS Glue job that converts that CSV into Parquet format is the action.
Relationship: The connection established between entities and actions.
- Example: The Parquet file "is derived from" the CSV file via the Glue job.

Worked Examples

Scenario: Auditing a Biased ML Model

Identify the Issue: A production model is showing unexpected bias in predictions.
Trace with SageMaker ML Lineage: Use the LineageQueryResult API to find the Model Artifact.
Backtrack to Training: Trace the artifact back to the specific training job ID.
Locate Input Data: Identify the exact S3 URI of the training dataset used for that specific job run.
Root Cause Analysis: Inspect the training data and find that a specific demographic was underrepresented, proving why the bias occurred.

Checkpoint Questions

Which service is best suited for providing a non-technical business user with a view of data provenance? (Answer: Amazon DataZone)
How does SageMaker ML Lineage Tracking assist with regulatory compliance? (Answer: By providing a reproducible, auditable record of every step from data prep to model deployment)
What technology combination is used for custom, highly-connected lineage graphs in AWS? (Answer: AWS Glue, Spline, and Amazon Neptune)

Comparison Tables

Feature	SageMaker Catalog	Amazon DataZone	AWS Glue Data Catalog
Primary User	ML Engineers / Data Scientists	Business Data Consumers	Data Engineers
Key Content	ML Models, AI Workflows	Business Assets, Glossary	Technical Schemas, Partitions
Governance	Integrated with Lake Formation	Subscription-based workflow	IAM & Lake Formation
Automation	GenAI for attributes	Automated asset discovery	Glue Crawlers

Muddy Points & Cross-Refs

SageMaker Catalog vs. DataZone: Users often confuse these. Tip: If the task involves ML model governance and AI lifecycle, think SageMaker Catalog. If it involves business units sharing data across the organization with a common glossary, think DataZone.
Spline Agent: Note that Spline is not an AWS service; it is an open-source tool often used with AWS Glue to capture granular Spark metadata that Glue doesn't natively expose in its standard catalog.