AWS Data Store Selection & Configuration Guide

This guide covers the critical task of selecting and configuring storage services for the AWS Certified Data Engineer Associate exam. It focuses on aligning specific AWS services with workload requirements, access patterns, and cost-performance trade-offs.

Learning Objectives

After studying this guide, you should be able to:

Select the optimal AWS storage service (RDS, DynamoDB, Redshift, EMR, or S3) based on specific use cases.
Configure storage services to meet performance requirements (e.g., distribution keys in Redshift).
Implement fine-grained access control and centralized management using AWS Lake Formation.
Identify the appropriate data structures and indexing types for various access patterns.

Key Terms & Glossary

OLTP (Online Transactional Processing): Systems designed for high volumes of small, fast transactions (e.g., Amazon RDS).
OLAP (Online Analytical Processing): Systems designed for complex queries and data analysis on large datasets (e.g., Amazon Redshift).
Key-Value Store: A NoSQL database type that uses a simple key-value method to store data (e.g., Amazon DynamoDB).
Data Lake: A centralized repository that allows you to store all your structured and unstructured data at any scale (centered on Amazon S3).
HNSW (Hierarchical Navigable Small Worlds): A graph-based algorithm used for efficient vector similarity search.

The "Big Idea"

In AWS Data Engineering, there is no "one-size-fits-all" storage. The core architectural challenge is decoupling storage from compute (using S3 as a foundation) while choosing the right purpose-built database engine for the specific access pattern. Success requires balancing ACID compliance (RDS) against massive horizontal scalability (DynamoDB) or petabyte-scale analytics (Redshift).

Formula / Concept Box

Feature	Amazon Redshift	Amazon DynamoDB	Amazon RDS	AWS Lake Formation
Primary Use	Data Warehousing/OLAP	NoSQL/Key-Value	Relational/OLTP	Governance/Data Lake
Scaling	Vertical & Horizontal	Seamless Horizontal	Vertical (Read Replicas)	Management Layer
Access Pattern	Complex Joins/Aggregates	High-frequency Key lookups	Standard SQL/Transactions	Cross-service Permissioning

Hierarchical Outline

I. Relational Data Stores
- Amazon RDS: Managed SQL databases (MySQL, PostgreSQL, Oracle). Best for ACID compliance and traditional applications.
- Amazon Aurora: High-performance relational database; supports HNSW indexing in PostgreSQL for vector data.
II. Non-Relational Data Stores
- Amazon DynamoDB: Fully managed NoSQL. Uses TTL (Time to Live) for lifecycle management and DAX for microsecond latency.
- Amazon EMR: Distributed processing (Spark/Hive). Uses S3 (EMRFS) as the persistent storage layer.
III. Analytical Data Stores
- Amazon Redshift: Columnar storage. Requires configuration of Distribution Styles (Auto, Even, Key, All) and Sort Keys.
- Redshift Spectrum: Queries data directly in S3 without loading it into Redshift tables.
IV. Management & Governance
- AWS Lake Formation: Simplifies building data lakes. Manages fine-grained access control (column-level permissions).

Visual Anchors

Storage Selection Decision Tree

Loading Diagram...

Lakehouse Architecture Diagram

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Distribution Key (Redshift): A column used to distribute data across compute nodes.
- Example: Using customer_id as a distribution key to ensure all data for one customer resides on the same node, speeding up joins on that ID.
Time to Live (TTL) (DynamoDB): A mechanism to automatically delete items after a specific timestamp.
- Example: Automatically deleting session tokens after 24 hours to reduce storage costs.
Partitioning (S3/Glue): Organizing data into folders based on values like date or region.
- Example: Storing logs in s3://my-bucket/year=2023/month=10/day=27/ to limit the data scanned by Athena.

Worked Examples

Problem 1: IoT Sensor Data Scale-Out

Scenario: A company generates millions of small JSON files daily from IoT sensors. They need high-performance querying and cost-effective storage. Solution:

Storage: Store data in Amazon S3.
Format: Convert files to Apache Parquet (columnar) using AWS Glue to improve query performance and reduce scan costs.
Access: Use Amazon Athena or Redshift Spectrum for SQL queries directly on S3.

Problem 2: Global Retail Session Management

Scenario: A retail app needs to store user shopping carts with sub-10ms response times and unpredictable traffic spikes. Solution:

Service: Amazon DynamoDB.
Configuration: Use On-demand capacity mode to handle unpredictable traffic and DynamoDB Accelerator (DAX) if microsecond read performance is required.

Checkpoint Questions

Which service allows you to define column-level permissions for data stored in S3 and accessed via Athena?
What Redshift feature allows you to query data in S3 without moving it into the Redshift cluster?
Which DynamoDB feature helps manage data lifecycle by automatically deleting old records?
When should you choose Amazon RDS over Amazon Redshift?

[!NOTE] Answers: 1. AWS Lake Formation. 2. Redshift Spectrum. 3. TTL (Time to Live). 4. Use RDS for transactional (OLTP) workloads; use Redshift for analytical (OLAP) workloads.

Comparison Tables

Redshift Distribution Styles

Style	Description	Best Use Case
KEY	Distributes rows based on values in one column.	Joining large tables on that column.
ALL	A copy of the entire table is on every node.	Small dimension tables.
EVEN	Round-robin distribution across nodes.	Tables that aren't joined or have no clear key.
AUTO	Redshift decides based on table size.	Default for new tables.

Muddy Points & Cross-Refs

Athena vs. Redshift Spectrum: Both query S3. Use Athena for ad-hoc, serverless queries. Use Redshift Spectrum if you already have a Redshift cluster and want to join S3 data with local Redshift tables.
HNSW vs. IVF: These are vector index types. HNSW is generally faster and more accurate for high-dimensional data (common in AI/ML apps using Aurora PostgreSQL).
Lake Formation vs. IAM: IAM manages who can access a service; Lake Formation manages what data (tables/rows/columns) they can see inside the data lake.