Mastering AWS Data Schemas: Redshift, DynamoDB, and Lake Formation
Design schemas for Amazon Redshift, DynamoDB, and Lake Formation
Mastering AWS Data Schemas: Redshift, DynamoDB, and Lake Formation
This study guide covers the architectural principles and schema design patterns for the primary data storage and governance services in AWS: Amazon Redshift (OLAP), Amazon DynamoDB (NoSQL), and AWS Lake Formation (Data Lake Governance).
Learning Objectives
After studying this guide, you should be able to:
- Differentiate between Star and Snowflake schemas for Redshift analytical workloads.
- Select appropriate distribution styles and sort keys for Redshift table optimization.
- Design DynamoDB schemas based on specific application access patterns using Partition and Sort keys.
- Implement fine-grained access control and centralized metadata management using AWS Lake Formation.
- Apply schema evolution techniques using AWS Glue and the Schema Conversion Tool (SCT).
Key Terms & Glossary
- OLAP (Online Analytical Processing): Databases optimized for complex queries and data analysis rather than transactional updates (e.g., Redshift).
- Denormalization: The process of combining tables to reduce joins, typically used in Redshift to improve read performance.
- GSI (Global Secondary Index): An index with a partition key and a sort key that can be different from those on the base DynamoDB table.
- TTL (Time to Live): A DynamoDB feature that automatically deletes items after a specific timestamp to manage data lifecycle.
- Data Catalog: A persistent metadata store that contains table definitions and job statistics (e.g., AWS Glue Data Catalog).
- PII (Personally Identifiable Information): Sensitive data that requires masking or restricted access, often managed via Lake Formation.
The "Big Idea"
In the AWS ecosystem, schema design is driven by the access pattern, not just the data structure. In Redshift, we optimize for massive aggregations over billions of rows. In DynamoDB, we optimize for single-digit millisecond responses to specific queries. In Lake Formation, we focus on decoupling storage (S3) from security, creating a unified governance layer that works across multiple analytical tools.
Formula / Concept Box
| Feature | Amazon Redshift | Amazon DynamoDB | AWS Lake Formation |
|---|---|---|---|
| Primary Use | Complex Analytics / Data Warehousing | High-performance NoSQL / Apps | Data Lake Governance / Security |
| Schema Strategy | Star / Snowflake (Structured) | Schema-less (Access Pattern Driven) | Metadata Mapping (S3-based) |
| Optimization | Distribution & Sort Keys | Partition & Sort Keys (GSI/LSI) | Partition Projection / Blueprints |
| Scaling | Provisioned/Serverless Clusters | Provisioned/On-Demand Capacity | Scalable S3 Storage |
Hierarchical Outline
- Amazon Redshift: Analytical Schema Design
- Dimensional Modeling: Fact tables (quantitative) vs. Dimension tables (descriptive).
- Schema Patterns:
- Star Schema: Highly denormalized, fewer joins, faster performance.
- Snowflake Schema: Normalized dimensions, saves space but increases join complexity.
- Physical Optimization:
- Distribution Styles: KEY, ALL, EVEN, and AUTO.
- Sort Keys: Compound (hierarchical) vs. Interleaved (equal weight).
- Compression: Columnar storage encodings (ZSTD, LZO) to reduce I/O.
- Amazon DynamoDB: NoSQL Modeling
- Key Selection: Partition Key (PK) for distribution; Sort Key (SK) for range queries.
- Indexing: GSIs for cross-partition queries; LSIs for same-partition alternate sorts.
- Lifecycle: Using TTL to automate data expiration and reduce costs.
- AWS Lake Formation: Governance & Discovery
- Centralized Metadata: Using Glue Crawlers to discover schemas from S3/RDS.
- Security: Column-level and cell-level permissions; PII identification with Macie.
- Schema Evolution: Handling changing data structures with Partition Projection.
Visual Anchors
Schema Selection Flowchart
Redshift Distribution Styles
Definition-Example Pairs
- Fact Table: A table containing the quantitative metrics of a business process.
- Example: A
Salestable in Redshift containingprice,quantity, andtimestamp.
- Example: A
- Dimension Table: A table containing descriptive attributes that provide context to facts.
- Example: A
Producttable containingcategory,color, andbrandnames.
- Example: A
- Partition Projection: A Glue Data Catalog feature that calculates partition metadata instead of querying the index.
- Example: Using a date pattern in S3 (e.g.,
year=2023/month=10) to speed up Athena queries without frequent metadata updates.
- Example: Using a date pattern in S3 (e.g.,
Worked Examples
Example 1: Redshift Table Design for Performance
Scenario: You have a Transactions table (1 billion rows) and a Users table (1 million rows). You frequently join them on user_id.
Solution:
- Distribution Style: Set
DISTSTYLE KEYonuser_idfor both tables. This ensures rows with the sameuser_idreside on the same compute node, eliminating "shuffling" during joins. - Sort Key: Set
COMPOUND SORTKEY(transaction_date, user_id). This speeds up queries filtered by date. - Compression: Use
ENCODING AUTOto let Redshift manage the ZSTD/LZO compression.
Example 2: DynamoDB Design for an IoT Application
Scenario: You need to store sensor data. You always query by SensorID and want to see the most recent data first.
Solution:
- Partition Key:
SensorID(to distribute load across partitions). - Sort Key:
Timestamp(to allow range queries and sorting). - Query Pattern:
Queryoperation wherePK=SensorIDandScanIndexForward=Falseto get the latest readings first.
Checkpoint Questions
- What is the main benefit of a Star Schema over a Snowflake Schema in Amazon Redshift?
- Which Redshift distribution style should be used for a small lookup table to avoid data movement during joins?
- In DynamoDB, what is the difference between a Global Secondary Index (GSI) and a Local Secondary Index (LSI)?
- How does AWS Lake Formation simplify cross-account data sharing compared to standard S3 bucket policies?
Comparison Tables
Normalization vs. Denormalization (OLAP context)
| Attribute | Normalized (Snowflake) | Denormalized (Star) |
|---|---|---|
| Storage Efficiency | High (less redundancy) | Lower (duplicated data) |
| Query Complexity | High (multiple joins) | Low (fewer joins) |
| Read Performance | Slower | Faster |
| Maintainability | Easier (single source of truth) | Harder (updates in multiple places) |
Muddy Points & Cross-Refs
- Compound vs. Interleaved Sort Keys: Choose Compound for hierarchical filters (e.g.,
City > Store > Date). Choose Interleaved if you filter by any combination of columns with equal frequency. - GSI vs. LSI: LSI can only be created at table creation and shares the same PK. GSI can be created anytime and can have a completely different PK/SK.
- Glue vs. Lake Formation: Think of Glue as the engine (crawlers, ETL, catalog) and Lake Formation as the security layer (permissions, sharing, PII masking) on top of that engine.
[!TIP] When preparing for the exam, remember: COPY is the fastest way to load data into Redshift, and UNLOAD is the fastest way to export. Always use S3 as the staging area for these operations.