AWS Data Governance Frameworks and Sharing Patterns
Describe governance data framework and data sharing patterns
AWS Data Governance Frameworks and Sharing Patterns
This guide covers the architectural patterns and AWS services used to manage, govern, and share data across accounts and organizations, specifically focused on the AWS Certified Data Engineer - Associate (DEA-C01) curriculum.
Learning Objectives
- Differentiate between Centralized, Hub-and-Spoke, and Data Mesh governance models.
- Identify the correct AWS services for cross-account and B2B data sharing (Lake Formation, DataZone, Clean Rooms).
- Explain the four core principles of a Data Mesh.
- Describe mechanisms for live data sharing without ETL, such as Redshift Data Sharing and Federated Queries.
- Implement strategies for data sovereignty and PII protection.
Key Terms & Glossary
- Data Mesh: A decentralized architectural pattern where data is treated as a product and managed by specific business domains rather than a central team.
- PII (Personally Identifiable Information): Sensitive data that can identify an individual; protected using tools like Amazon Macie and Lake Formation masking.
- Data Sovereignty: The concept that data is subject to the laws and governance structures within the nation/region it is collected.
- Zero-ETL: An AWS integration approach that allows for point-to-point data movement/querying without building complex manual ETL pipelines.
- Publisher-Subscriber Workflow: A governance pattern where data producers "publish" assets to a catalog, and consumers "subscribe" to gain authorized access.
The "Big Idea"
Modern data engineering is shifting from Centralized Monoliths (one team managing a massive lake) to Federated Governance. The goal is to empower business units to own their data while providing a central "Guardrail" layer that ensures compliance, security, and discoverability across the entire organization.
Formula / Concept Box
| Feature | Centralized (Hub-and-Spoke) | Data Mesh (Decentralized) |
|---|---|---|
| Ownership | Central IT / Data Team | Business Units (Domains) |
| Scalability | High (Bottleneck potential) | Very High (Distributed) |
| Tooling | Lake Formation (Central) | DataZone + Lake Formation (Local) |
| Data Sharing | Cross-account grants | Publisher/Subscriber model |
Hierarchical Outline
- I. Governance Frameworks
- Centralized (Hub-and-Spoke): A central account holds the Data Lake; consumer accounts request access via Lake Formation.
- Data Mesh: Decentralized domains; requires Amazon DataZone for a unified business catalog.
- II. Data Sharing Patterns
- Single Account: Managed via IAM and Lake Formation fine-grained access control (FGAC).
- Cross-Account: Uses Lake Formation cross-account grants or Redshift Data Sharing.
- Cross-Organization (B2B): Uses AWS Clean Rooms for privacy-safe collaboration.
- III. Compliance & Protection
- PII Masking: Using Amazon Macie to discover PII and Lake Formation to redact it.
- Data Residency: S3 Lifecycle policies and Region restrictions to ensure data stays within legal boundaries.
Visual Anchors
Data Sharing Architectures
Data Governance Lifecycle
\begin{tikzpicture}[node distance=2cm, auto] \node [circle, draw, fill=blue!10] (collect) {Collect}; \node [circle, draw, fill=green!10, right of=collect, xshift=1.5cm] (catalog) {Catalog}; \node [circle, draw, fill=orange!10, right of=catalog, xshift=1.5cm] (share) {Share}; \node [circle, draw, fill=red!10, below of=catalog] (archive) {Retire};
\draw [->, thick] (collect) -- (catalog);
\draw [->, thick] (catalog) -- (share);
\draw [->, thick] (share) -- (archive);
\draw [->, thick, dashed] (archive) -- (collect);
\node [below of=collect, yshift=1cm] {\tiny Source Data};
\node [below of=share, yshift=1cm] {\tiny Redshift/Lake Formation};\end{tikzpicture}
Definition-Example Pairs
- Federated Query: Querying data across different sources (S3, RDS, Aurora) without moving it to a central warehouse.
- Example: An Amazon Redshift cluster querying live customer data sitting in an Amazon Aurora PostgreSQL database using an External Schema.
- Data Lineage: The trackable history of data from its origin through all transformations.
- Example: Using Amazon SageMaker ML Lineage Tracking to see which S3 dataset was used to train a specific model version.
- Fine-Grained Access Control (FGAC): Restricting data access at the row or column level rather than the whole table.
- Example: A policy in Lake Formation that allows a HR analyst to see the
employee_namecolumn but masks thesalarycolumn.
- Example: A policy in Lake Formation that allows a HR analyst to see the
Worked Examples
Scenario: Setting up Amazon Redshift Data Sharing
Goal: Share a specific schema from a Producer Redshift cluster to a Consumer cluster in a different account without copying data.
- On Producer Cluster: Create a datashare and add objects.
sql
CREATE DATASHARE sales_share; ALTER DATASHARE sales_share ADD SCHEMA public; ALTER DATASHARE sales_share ADD ALL TABLES IN SCHEMA public; - Authorize Consumer: Grant access to the consumer's AWS Account ID.
sql
GRANT USAGE ON DATASHARE sales_share TO ACCOUNT '123456789012'; - On Consumer Cluster: Create a local database from the share.
sql
CREATE DATABASE sales_db FROM DATASHARE sales_share OF ACCOUNT '987654321098';
Checkpoint Questions
- Which AWS service is best suited for sharing data with a third-party partner while ensuring neither party can see the other's raw data? (Answer: AWS Clean Rooms)
- What are the four pillars of a Data Mesh? (Answer: Domain-driven ownership, Data as a product, Federated governance, Self-serve data platform)
- How can you automatically identify PII in an S3 Data Lake? (Answer: Use Amazon Macie to scan objects and integrate with Lake Formation)
Comparison Tables
| Tool | Primary Purpose | Sharing Scope |
|---|---|---|
| AWS Lake Formation | Centralized security & permissions | Single/Multi-account |
| Amazon DataZone | Business-level cataloging & governance | Cross-domain/Data Mesh |
| AWS Clean Rooms | Privacy-safe data collaboration | B2B / Cross-Org |
| AWS Glue Catalog | Technical metadata repository | Intra-account/Shared Lake |
Muddy Points & Cross-Refs
- Lake Formation vs. IAM: Remember that Lake Formation complements IAM. IAM controls API access (Can I call
StartCrawler?), while Lake Formation controls data access (Can I seeTable_A?). - Data Sovereignty vs. Data Residency: Sovereignty is about legal jurisdiction (GDPR/CCPA), while residency is the physical location (Region). Cross-region replication must be carefully managed to satisfy sovereignty.
- Cross-Reference: See "Unit 2: Data Store Management" for details on how S3 Lifecycle policies support the "Retire" phase of governance.