AWS Data Store Selection & Configuration Guide
Configure the appropriate storage services for specific access patterns and requirements (for example, Amazon Redshift, Amazon EMR, Lake Formation, Amazon RDS, DynamoDB)
AWS Data Store Selection & Configuration Guide
This guide covers the critical task of selecting and configuring storage services for the AWS Certified Data Engineer Associate exam. It focuses on aligning specific AWS services with workload requirements, access patterns, and cost-performance trade-offs.
Learning Objectives
After studying this guide, you should be able to:
- Select the optimal AWS storage service (RDS, DynamoDB, Redshift, EMR, or S3) based on specific use cases.
- Configure storage services to meet performance requirements (e.g., distribution keys in Redshift).
- Implement fine-grained access control and centralized management using AWS Lake Formation.
- Identify the appropriate data structures and indexing types for various access patterns.
Key Terms & Glossary
- OLTP (Online Transactional Processing): Systems designed for high volumes of small, fast transactions (e.g., Amazon RDS).
- OLAP (Online Analytical Processing): Systems designed for complex queries and data analysis on large datasets (e.g., Amazon Redshift).
- Key-Value Store: A NoSQL database type that uses a simple key-value method to store data (e.g., Amazon DynamoDB).
- Data Lake: A centralized repository that allows you to store all your structured and unstructured data at any scale (centered on Amazon S3).
- HNSW (Hierarchical Navigable Small Worlds): A graph-based algorithm used for efficient vector similarity search.
The "Big Idea"
In AWS Data Engineering, there is no "one-size-fits-all" storage. The core architectural challenge is decoupling storage from compute (using S3 as a foundation) while choosing the right purpose-built database engine for the specific access pattern. Success requires balancing ACID compliance (RDS) against massive horizontal scalability (DynamoDB) or petabyte-scale analytics (Redshift).
Formula / Concept Box
| Feature | Amazon Redshift | Amazon DynamoDB | Amazon RDS | AWS Lake Formation |
|---|---|---|---|---|
| Primary Use | Data Warehousing/OLAP | NoSQL/Key-Value | Relational/OLTP | Governance/Data Lake |
| Scaling | Vertical & Horizontal | Seamless Horizontal | Vertical (Read Replicas) | Management Layer |
| Access Pattern | Complex Joins/Aggregates | High-frequency Key lookups | Standard SQL/Transactions | Cross-service Permissioning |
Hierarchical Outline
- I. Relational Data Stores
- Amazon RDS: Managed SQL databases (MySQL, PostgreSQL, Oracle). Best for ACID compliance and traditional applications.
- Amazon Aurora: High-performance relational database; supports HNSW indexing in PostgreSQL for vector data.
- II. Non-Relational Data Stores
- Amazon DynamoDB: Fully managed NoSQL. Uses TTL (Time to Live) for lifecycle management and DAX for microsecond latency.
- Amazon EMR: Distributed processing (Spark/Hive). Uses S3 (EMRFS) as the persistent storage layer.
- III. Analytical Data Stores
- Amazon Redshift: Columnar storage. Requires configuration of Distribution Styles (Auto, Even, Key, All) and Sort Keys.
- Redshift Spectrum: Queries data directly in S3 without loading it into Redshift tables.
- IV. Management & Governance
- AWS Lake Formation: Simplifies building data lakes. Manages fine-grained access control (column-level permissions).
Visual Anchors
Storage Selection Decision Tree
Lakehouse Architecture Diagram
\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, minimum width=3cm, minimum height=1cm, align=center}]
% Define positions \node (S3) {Amazon S3 \ (Data Lake Foundation)}; \node (LF) [above=of S3, fill=gray!20] {AWS Lake Formation \ (Governance/Security)}; \node (RS) [right=of S3] {Amazon Redshift \ (Data Warehouse)}; \node (EMR) [left=of S3] {Amazon EMR \ (Big Data Processing)};
% Draw connections \draw[<->, thick] (S3) -- (RS) node[midway, above] {Spectrum}; \draw[<->, thick] (S3) -- (EMR) node[midway, above] {EMRFS}; \draw[->, dashed] (LF) -- (S3); \draw[->, dashed] (LF) -- (RS); \draw[->, dashed] (LF) -- (EMR);
\node[draw=none, below=0.5cm of S3] {\textbf{Centralized Metadata & Security Management}}; \end{tikzpicture}
Definition-Example Pairs
- Distribution Key (Redshift): A column used to distribute data across compute nodes.
- Example: Using
customer_idas a distribution key to ensure all data for one customer resides on the same node, speeding up joins on that ID.
- Example: Using
- Time to Live (TTL) (DynamoDB): A mechanism to automatically delete items after a specific timestamp.
- Example: Automatically deleting session tokens after 24 hours to reduce storage costs.
- Partitioning (S3/Glue): Organizing data into folders based on values like date or region.
- Example: Storing logs in
s3://my-bucket/year=2023/month=10/day=27/to limit the data scanned by Athena.
- Example: Storing logs in
Worked Examples
Problem 1: IoT Sensor Data Scale-Out
Scenario: A company generates millions of small JSON files daily from IoT sensors. They need high-performance querying and cost-effective storage. Solution:
- Storage: Store data in Amazon S3.
- Format: Convert files to Apache Parquet (columnar) using AWS Glue to improve query performance and reduce scan costs.
- Access: Use Amazon Athena or Redshift Spectrum for SQL queries directly on S3.
Problem 2: Global Retail Session Management
Scenario: A retail app needs to store user shopping carts with sub-10ms response times and unpredictable traffic spikes. Solution:
- Service: Amazon DynamoDB.
- Configuration: Use On-demand capacity mode to handle unpredictable traffic and DynamoDB Accelerator (DAX) if microsecond read performance is required.
Checkpoint Questions
- Which service allows you to define column-level permissions for data stored in S3 and accessed via Athena?
- What Redshift feature allows you to query data in S3 without moving it into the Redshift cluster?
- Which DynamoDB feature helps manage data lifecycle by automatically deleting old records?
- When should you choose Amazon RDS over Amazon Redshift?
[!NOTE] Answers: 1. AWS Lake Formation. 2. Redshift Spectrum. 3. TTL (Time to Live). 4. Use RDS for transactional (OLTP) workloads; use Redshift for analytical (OLAP) workloads.
Comparison Tables
Redshift Distribution Styles
| Style | Description | Best Use Case |
|---|---|---|
| KEY | Distributes rows based on values in one column. | Joining large tables on that column. |
| ALL | A copy of the entire table is on every node. | Small dimension tables. |
| EVEN | Round-robin distribution across nodes. | Tables that aren't joined or have no clear key. |
| AUTO | Redshift decides based on table size. | Default for new tables. |
Muddy Points & Cross-Refs
- Athena vs. Redshift Spectrum: Both query S3. Use Athena for ad-hoc, serverless queries. Use Redshift Spectrum if you already have a Redshift cluster and want to join S3 data with local Redshift tables.
- HNSW vs. IVF: These are vector index types. HNSW is generally faster and more accurate for high-dimensional data (common in AI/ML apps using Aurora PostgreSQL).
- Lake Formation vs. IAM: IAM manages who can access a service; Lake Formation manages what data (tables/rows/columns) they can see inside the data lake.