AWS Data Governance Frameworks and Sharing Patterns

This guide covers the architectural patterns and AWS services used to manage, govern, and share data across accounts and organizations, specifically focused on the AWS Certified Data Engineer - Associate (DEA-C01) curriculum.

Learning Objectives

Differentiate between Centralized, Hub-and-Spoke, and Data Mesh governance models.
Identify the correct AWS services for cross-account and B2B data sharing (Lake Formation, DataZone, Clean Rooms).
Explain the four core principles of a Data Mesh.
Describe mechanisms for live data sharing without ETL, such as Redshift Data Sharing and Federated Queries.
Implement strategies for data sovereignty and PII protection.

Key Terms & Glossary

Data Mesh: A decentralized architectural pattern where data is treated as a product and managed by specific business domains rather than a central team.
PII (Personally Identifiable Information): Sensitive data that can identify an individual; protected using tools like Amazon Macie and Lake Formation masking.
Data Sovereignty: The concept that data is subject to the laws and governance structures within the nation/region it is collected.
Zero-ETL: An AWS integration approach that allows for point-to-point data movement/querying without building complex manual ETL pipelines.
Publisher-Subscriber Workflow: A governance pattern where data producers "publish" assets to a catalog, and consumers "subscribe" to gain authorized access.

The "Big Idea"

Modern data engineering is shifting from Centralized Monoliths (one team managing a massive lake) to Federated Governance. The goal is to empower business units to own their data while providing a central "Guardrail" layer that ensures compliance, security, and discoverability across the entire organization.

Formula / Concept Box

Feature	Centralized (Hub-and-Spoke)	Data Mesh (Decentralized)
Ownership	Central IT / Data Team	Business Units (Domains)
Scalability	High (Bottleneck potential)	Very High (Distributed)
Tooling	Lake Formation (Central)	DataZone + Lake Formation (Local)
Data Sharing	Cross-account grants	Publisher/Subscriber model

Hierarchical Outline

I. Governance Frameworks
- Centralized (Hub-and-Spoke): A central account holds the Data Lake; consumer accounts request access via Lake Formation.
- Data Mesh: Decentralized domains; requires Amazon DataZone for a unified business catalog.
II. Data Sharing Patterns
- Single Account: Managed via IAM and Lake Formation fine-grained access control (FGAC).
- Cross-Account: Uses Lake Formation cross-account grants or Redshift Data Sharing.
- Cross-Organization (B2B): Uses AWS Clean Rooms for privacy-safe collaboration.
III. Compliance & Protection
- PII Masking: Using Amazon Macie to discover PII and Lake Formation to redact it.
- Data Residency: S3 Lifecycle policies and Region restrictions to ensure data stays within legal boundaries.

Visual Anchors

Loading Diagram...

Data Governance Lifecycle

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Federated Query: Querying data across different sources (S3, RDS, Aurora) without moving it to a central warehouse.
- Example: An Amazon Redshift cluster querying live customer data sitting in an Amazon Aurora PostgreSQL database using an External Schema.
Data Lineage: The trackable history of data from its origin through all transformations.
- Example: Using Amazon SageMaker ML Lineage Tracking to see which S3 dataset was used to train a specific model version.
Fine-Grained Access Control (FGAC): Restricting data access at the row or column level rather than the whole table.
- Example: A policy in Lake Formation that allows a HR analyst to see the employee_name column but masks the salary column.

Worked Examples

Goal: Share a specific schema from a Producer Redshift cluster to a Consumer cluster in a different account without copying data.

On Producer Cluster: Create a datashare and add objects.
sql
CREATE DATASHARE sales_share; ALTER DATASHARE sales_share ADD SCHEMA public; ALTER DATASHARE sales_share ADD ALL TABLES IN SCHEMA public;
Authorize Consumer: Grant access to the consumer's AWS Account ID.
sql
GRANT USAGE ON DATASHARE sales_share TO ACCOUNT '123456789012';
On Consumer Cluster: Create a local database from the share.
sql
CREATE DATABASE sales_db FROM DATASHARE sales_share OF ACCOUNT '987654321098';

Checkpoint Questions

Which AWS service is best suited for sharing data with a third-party partner while ensuring neither party can see the other's raw data? (Answer: AWS Clean Rooms)
What are the four pillars of a Data Mesh? (Answer: Domain-driven ownership, Data as a product, Federated governance, Self-serve data platform)
How can you automatically identify PII in an S3 Data Lake? (Answer: Use Amazon Macie to scan objects and integrate with Lake Formation)

Comparison Tables

Tool	Primary Purpose	Sharing Scope
AWS Lake Formation	Centralized security & permissions	Single/Multi-account
Amazon DataZone	Business-level cataloging & governance	Cross-domain/Data Mesh
AWS Clean Rooms	Privacy-safe data collaboration	B2B / Cross-Org
AWS Glue Catalog	Technical metadata repository	Intra-account/Shared Lake

Muddy Points & Cross-Refs

Lake Formation vs. IAM: Remember that Lake Formation complements IAM. IAM controls API access (Can I call StartCrawler?), while Lake Formation controls data access (Can I see Table_A?).
Data Sovereignty vs. Data Residency: Sovereignty is about legal jurisdiction (GDPR/CCPA), while residency is the physical location (Region). Cross-region replication must be carefully managed to satisfy sovereignty.
Cross-Reference: See "Unit 2: Data Store Management" for details on how S3 Lifecycle policies support the "Retire" phase of governance.

Learning Objectives

Differentiate between Centralized, Hub-and-Spoke, and Data Mesh governance models.
Identify the correct AWS services for cross-account and B2B data sharing (Lake Formation, DataZone, Clean Rooms).
Explain the four core principles of a Data Mesh.
Describe mechanisms for live data sharing without ETL, such as Redshift Data Sharing and Federated Queries.
Implement strategies for data sovereignty and PII protection.

Key Terms & Glossary

Data Mesh: A decentralized architectural pattern where data is treated as a product and managed by specific business domains rather than a central team.
PII (Personally Identifiable Information): Sensitive data that can identify an individual; protected using tools like Amazon Macie and Lake Formation masking.
Data Sovereignty: The concept that data is subject to the laws and governance structures within the nation/region it is collected.
Zero-ETL: An AWS integration approach that allows for point-to-point data movement/querying without building complex manual ETL pipelines.
Publisher-Subscriber Workflow: A governance pattern where data producers "publish" assets to a catalog, and consumers "subscribe" to gain authorized access.

The "Big Idea"

Formula / Concept Box

Feature	Centralized (Hub-and-Spoke)	Data Mesh (Decentralized)
Ownership	Central IT / Data Team	Business Units (Domains)
Scalability	High (Bottleneck potential)	Very High (Distributed)
Tooling	Lake Formation (Central)	DataZone + Lake Formation (Local)
Data Sharing	Cross-account grants	Publisher/Subscriber model

Hierarchical Outline

I. Governance Frameworks
- Centralized (Hub-and-Spoke): A central account holds the Data Lake; consumer accounts request access via Lake Formation.
- Data Mesh: Decentralized domains; requires Amazon DataZone for a unified business catalog.
II. Data Sharing Patterns
- Single Account: Managed via IAM and Lake Formation fine-grained access control (FGAC).
- Cross-Account: Uses Lake Formation cross-account grants or Redshift Data Sharing.
- Cross-Organization (B2B): Uses AWS Clean Rooms for privacy-safe collaboration.
III. Compliance & Protection
- PII Masking: Using Amazon Macie to discover PII and Lake Formation to redact it.
- Data Residency: S3 Lifecycle policies and Region restrictions to ensure data stays within legal boundaries.

Visual Anchors

Loading Diagram...

Data Governance Lifecycle

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Federated Query: Querying data across different sources (S3, RDS, Aurora) without moving it to a central warehouse.
- Example: An Amazon Redshift cluster querying live customer data sitting in an Amazon Aurora PostgreSQL database using an External Schema.
Data Lineage: The trackable history of data from its origin through all transformations.
- Example: Using Amazon SageMaker ML Lineage Tracking to see which S3 dataset was used to train a specific model version.
Fine-Grained Access Control (FGAC): Restricting data access at the row or column level rather than the whole table.
- Example: A policy in Lake Formation that allows a HR analyst to see the employee_name column but masks the salary column.

Worked Examples

Goal: Share a specific schema from a Producer Redshift cluster to a Consumer cluster in a different account without copying data.

On Producer Cluster: Create a datashare and add objects.
sql
CREATE DATASHARE sales_share; ALTER DATASHARE sales_share ADD SCHEMA public; ALTER DATASHARE sales_share ADD ALL TABLES IN SCHEMA public;
Authorize Consumer: Grant access to the consumer's AWS Account ID.
sql
GRANT USAGE ON DATASHARE sales_share TO ACCOUNT '123456789012';
On Consumer Cluster: Create a local database from the share.
sql
CREATE DATABASE sales_db FROM DATASHARE sales_share OF ACCOUNT '987654321098';

Checkpoint Questions

Which AWS service is best suited for sharing data with a third-party partner while ensuring neither party can see the other's raw data? (Answer: AWS Clean Rooms)
What are the four pillars of a Data Mesh? (Answer: Domain-driven ownership, Data as a product, Federated governance, Self-serve data platform)
How can you automatically identify PII in an S3 Data Lake? (Answer: Use Amazon Macie to scan objects and integrate with Lake Formation)

Comparison Tables

Tool	Primary Purpose	Sharing Scope
AWS Lake Formation	Centralized security & permissions	Single/Multi-account
Amazon DataZone	Business-level cataloging & governance	Cross-domain/Data Mesh
AWS Clean Rooms	Privacy-safe data collaboration	B2B / Cross-Org
AWS Glue Catalog	Technical metadata repository	Intra-account/Shared Lake

Muddy Points & Cross-Refs

Lake Formation vs. IAM: Remember that Lake Formation complements IAM. IAM controls API access (Can I call StartCrawler?), while Lake Formation controls data access (Can I see Table_A?).
Data Sovereignty vs. Data Residency: Sovereignty is about legal jurisdiction (GDPR/CCPA), while residency is the physical location (Region). Cross-region replication must be carefully managed to satisfy sovereignty.
Cross-Reference: See "Unit 2: Data Store Management" for details on how S3 Lifecycle policies support the "Retire" phase of governance.

AWS Data Governance Frameworks and Sharing Patterns

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

Data Sharing Architectures

Data Governance Lifecycle

Definition-Example Pairs

Worked Examples

Scenario: Setting up Amazon Redshift Data Sharing

Checkpoint Questions

Comparison Tables

Muddy Points & Cross-Refs

AWS Data Governance Frameworks and Sharing Patterns

AWS Data Governance Frameworks and Sharing Patterns

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

Data Sharing Architectures

Data Governance Lifecycle

Definition-Example Pairs

Worked Examples

Scenario: Setting up Amazon Redshift Data Sharing

Checkpoint Questions

Comparison Tables

Muddy Points & Cross-Refs