AWS Data Store Selection: Cost and Performance Optimization

This guide covers the critical task of selecting and implementing the appropriate AWS storage services to meet specific business requirements for cost, performance, and access patterns, as required for the AWS Certified Data Engineer - Associate (DEA-C01) exam.

Learning Objectives

After studying this guide, you should be able to:

Differentiate between OLTP and OLAP workloads and select the correct AWS service for each.
Choose between Amazon Kinesis Data Streams and Amazon MSK based on operational overhead and integration needs.
Identify use cases for Amazon Redshift, Amazon EMR, and AWS Lake Formation in a data lake/warehouse architecture.
Select the most cost-effective S3 storage class based on data access frequency.
Understand when to use specialized indexing like HNSW for vector data in Amazon Aurora.

Key Terms & Glossary

OLTP (Online Transactional Processing): Databases optimized for frequent, small transactions (e.g., Amazon RDS).
OLAP (Online Analytical Processing): Databases optimized for complex queries and large-scale data analysis (e.g., Amazon Redshift).
HNSW (Hierarchical Navigable Small Worlds): A high-performance graph-based indexing algorithm used for vector searches in Amazon Aurora PostgreSQL.
Cold vs. Hot Storage: "Hot" storage is for frequently accessed data with low latency (S3 Standard), while "Cold" storage is for archival data (S3 Glacier).
Provisioned vs. On-Demand: Capacity modes where you either pre-allocate resources for a fixed cost or pay per request for variable workloads.

The "Big Idea"

In AWS Data Engineering, there is no "one-size-fits-all" storage. Success depends on balancing the Iron Triangle of Data Storage: Performance (Latency/Throughput), Cost (Storage/Compute), and Operational Effort (Managed vs. Self-managed). Selecting the wrong service (e.g., using RDS for big data analytics) leads to ballooning costs and poor user experience.

Formula / Concept Box

Concept	Metric / Rule of Thumb
S3 Cost Optimization	Move data to Glacier Instant Retrieval if accessed < once per month but requires ms latency.
DynamoDB Scaling	Use On-Demand for unpredictable spikes; use Provisioned for steady-state workloads.
Redshift Performance	Use Distribution Keys to minimize data movement across nodes during joins.
Streaming Choice	Choose Kinesis for AWS-native, easy setup; choose MSK for Kafka compatibility and migration.

Hierarchical Outline

I. Transactional Storage (OLTP)
- Amazon RDS/Aurora: Structured data, ACID compliance, complex joins.
- Amazon DynamoDB: NoSQL, key-value, millisecond response at any scale.
II. Analytical Storage (OLAP)
- Amazon Redshift: Petabyte-scale warehousing, columnar storage.
- Amazon EMR: Big data processing (Spark/Hadoop) using EMRFS on S3.
III. Streaming Storage
- Amazon Kinesis Data Streams: Low-latency ingestion for real-time AWS apps.
- Amazon MSK: Managed Kafka for open-source ecosystem compatibility.
IV. Governance & Organization
- AWS Lake Formation: Centralized permissions and management for S3 data lakes.

Visual Anchors

Storage Selection Decision Tree

Loading Diagram...

Performance vs. Cost Mapping

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Federated Query: Querying data across different sources without moving it.
- Example: Using Redshift Federated Query to join live sales data in RDS with historical logs in Redshift.
Materialized View: Pre-computed results of a query stored for performance.
- Example: A dashboard that aggregates millions of rows into a daily total; the Redshift Materialized View updates incrementally to save compute.
TTL (Time to Live): Automatic deletion of items after a specific time.
- Example: DynamoDB TTL automatically deleting 30-day-old session tokens to keep the table size small and costs low.

Worked Examples

Example 1: IoT Sensor Data

Scenario: A factory generates 5 million small JSON files daily from IoT sensors. They need to perform ad-hoc SQL queries and keep data for 5 years.

Solution: Ingest via Kinesis Data Firehose $\rightarrow$ Convert to Parquet $\rightarrow$ Store in Amazon S3. Use Amazon Athena for queries and S3 Lifecycle Policies to move data to Glacier Deep Archive after 1 year.
Why: Parquet reduces query costs in Athena; S3 provides the cheapest long-term storage.

Example 2: Real-time Leaderboard

Scenario: A gaming app requires sub-10ms response times to read/write player scores for a global leaderboard.

Solution: Use Amazon DynamoDB with DynamoDB Accelerator (DAX).
Why: DAX provides an in-memory cache for DynamoDB, ensuring microsecond latency even during traffic spikes.

Checkpoint Questions

Which service is best suited for migrating an existing on-premises Apache Kafka cluster with minimal code changes?
To minimize costs for data that is rarely accessed but must be available immediately when requested, which S3 storage class should you use?
What is the advantage of using Redshift Spectrum instead of loading all data into Redshift local storage?

▶Click to see answers

Amazon MSK (Managed Streaming for Apache Kafka).
S3 Glacier Instant Retrieval.
It allows you to query data directly from Amazon S3, saving on provisioned Redshift storage costs and allowing for a "Lakehouse" architecture.

Comparison Tables

Feature	Amazon Kinesis Data Streams	Amazon MSK
Management	Fully Managed (Serverless available)	Managed, but requires cluster sizing
Ecosystem	AWS-Native (IAM, CloudWatch)	Apache Kafka (Standard APIs)
Retention	Up to 365 days	Configurable (unlimited with Tiered Storage)
Best For	New AWS-only apps	Migrating Kafka or complex open-source needs

Muddy Points & Cross-Refs

Redshift vs. Athena: Use Redshift for complex, heavy-duty reporting and frequent queries. Use Athena for ad-hoc exploration of S3 data where you don't want to manage a cluster.
EMR vs. Glue: Use EMR if you need deep control over the Spark/Hadoop environment or specific versions. Use AWS Glue for serverless, event-driven ETL.
HNSW Indexing: This is a niche but important exam topic. Remember it's for Vector Data (AI/ML) specifically within Aurora PostgreSQL.