Study Guide920 words

AWS Data Store Selection: Cost and Performance Optimization

Implement the appropriate storage services for specific cost and performance requirements (for example, Amazon Redshift, Amazon EMR, AWS Lake Formation, Amazon RDS, Amazon DynamoDB, Amazon Kinesis Data Streams, Amazon Managed Streaming for Apache Kafka [Amazon MSK])

AWS Data Store Selection: Cost and Performance Optimization

This guide covers the critical task of selecting and implementing the appropriate AWS storage services to meet specific business requirements for cost, performance, and access patterns, as required for the AWS Certified Data Engineer - Associate (DEA-C01) exam.

Learning Objectives

After studying this guide, you should be able to:

  • Differentiate between OLTP and OLAP workloads and select the correct AWS service for each.
  • Choose between Amazon Kinesis Data Streams and Amazon MSK based on operational overhead and integration needs.
  • Identify use cases for Amazon Redshift, Amazon EMR, and AWS Lake Formation in a data lake/warehouse architecture.
  • Select the most cost-effective S3 storage class based on data access frequency.
  • Understand when to use specialized indexing like HNSW for vector data in Amazon Aurora.

Key Terms & Glossary

  • OLTP (Online Transactional Processing): Databases optimized for frequent, small transactions (e.g., Amazon RDS).
  • OLAP (Online Analytical Processing): Databases optimized for complex queries and large-scale data analysis (e.g., Amazon Redshift).
  • HNSW (Hierarchical Navigable Small Worlds): A high-performance graph-based indexing algorithm used for vector searches in Amazon Aurora PostgreSQL.
  • Cold vs. Hot Storage: "Hot" storage is for frequently accessed data with low latency (S3 Standard), while "Cold" storage is for archival data (S3 Glacier).
  • Provisioned vs. On-Demand: Capacity modes where you either pre-allocate resources for a fixed cost or pay per request for variable workloads.

The "Big Idea"

In AWS Data Engineering, there is no "one-size-fits-all" storage. Success depends on balancing the Iron Triangle of Data Storage: Performance (Latency/Throughput), Cost (Storage/Compute), and Operational Effort (Managed vs. Self-managed). Selecting the wrong service (e.g., using RDS for big data analytics) leads to ballooning costs and poor user experience.

Formula / Concept Box

ConceptMetric / Rule of Thumb
S3 Cost OptimizationMove data to Glacier Instant Retrieval if accessed < once per month but requires ms latency.
DynamoDB ScalingUse On-Demand for unpredictable spikes; use Provisioned for steady-state workloads.
Redshift PerformanceUse Distribution Keys to minimize data movement across nodes during joins.
Streaming ChoiceChoose Kinesis for AWS-native, easy setup; choose MSK for Kafka compatibility and migration.

Hierarchical Outline

  • I. Transactional Storage (OLTP)
    • Amazon RDS/Aurora: Structured data, ACID compliance, complex joins.
    • Amazon DynamoDB: NoSQL, key-value, millisecond response at any scale.
  • II. Analytical Storage (OLAP)
    • Amazon Redshift: Petabyte-scale warehousing, columnar storage.
    • Amazon EMR: Big data processing (Spark/Hadoop) using EMRFS on S3.
  • III. Streaming Storage
    • Amazon Kinesis Data Streams: Low-latency ingestion for real-time AWS apps.
    • Amazon MSK: Managed Kafka for open-source ecosystem compatibility.
  • IV. Governance & Organization
    • AWS Lake Formation: Centralized permissions and management for S3 data lakes.

Visual Anchors

Storage Selection Decision Tree

Loading Diagram...

Performance vs. Cost Mapping

\begin{tikzpicture}[scale=0.8] \draw[->] (0,0) -- (8,0) node[right] {Latency (Speed)}; \draw[->] (0,0) -- (0,6) node[above] {Storage Cost}; \draw[fill=blue!20] (1,5) circle (0.5) node {DDB}; \draw[fill=green!20] (3,4) circle (0.6) node {Aurora}; \draw[fill=orange!20] (5,2) circle (0.8) node {Redshift}; \draw[fill=red!20] (7,1) circle (1.0) node {S3}; \node at (4,-1) {\mbox{High Speed \leftarrow \dots \rightarrow High Capacity/Low Cost}}; \end{tikzpicture}

Definition-Example Pairs

  • Federated Query: Querying data across different sources without moving it.
    • Example: Using Redshift Federated Query to join live sales data in RDS with historical logs in Redshift.
  • Materialized View: Pre-computed results of a query stored for performance.
    • Example: A dashboard that aggregates millions of rows into a daily total; the Redshift Materialized View updates incrementally to save compute.
  • TTL (Time to Live): Automatic deletion of items after a specific time.
    • Example: DynamoDB TTL automatically deleting 30-day-old session tokens to keep the table size small and costs low.

Worked Examples

Example 1: IoT Sensor Data

Scenario: A factory generates 5 million small JSON files daily from IoT sensors. They need to perform ad-hoc SQL queries and keep data for 5 years.

  • Solution: Ingest via Kinesis Data Firehose \rightarrow Convert to Parquet \rightarrow Store in Amazon S3. Use Amazon Athena for queries and S3 Lifecycle Policies to move data to Glacier Deep Archive after 1 year.
  • Why: Parquet reduces query costs in Athena; S3 provides the cheapest long-term storage.

Example 2: Real-time Leaderboard

Scenario: A gaming app requires sub-10ms response times to read/write player scores for a global leaderboard.

  • Solution: Use Amazon DynamoDB with DynamoDB Accelerator (DAX).
  • Why: DAX provides an in-memory cache for DynamoDB, ensuring microsecond latency even during traffic spikes.

Checkpoint Questions

  1. Which service is best suited for migrating an existing on-premises Apache Kafka cluster with minimal code changes?
  2. To minimize costs for data that is rarely accessed but must be available immediately when requested, which S3 storage class should you use?
  3. What is the advantage of using Redshift Spectrum instead of loading all data into Redshift local storage?
Click to see answers
  1. Amazon MSK (Managed Streaming for Apache Kafka).
  2. S3 Glacier Instant Retrieval.
  3. It allows you to query data directly from Amazon S3, saving on provisioned Redshift storage costs and allowing for a "Lakehouse" architecture.

Comparison Tables

FeatureAmazon Kinesis Data StreamsAmazon MSK
ManagementFully Managed (Serverless available)Managed, but requires cluster sizing
EcosystemAWS-Native (IAM, CloudWatch)Apache Kafka (Standard APIs)
RetentionUp to 365 daysConfigurable (unlimited with Tiered Storage)
Best ForNew AWS-only appsMigrating Kafka or complex open-source needs

Muddy Points & Cross-Refs

  • Redshift vs. Athena: Use Redshift for complex, heavy-duty reporting and frequent queries. Use Athena for ad-hoc exploration of S3 data where you don't want to manage a cluster.
  • EMR vs. Glue: Use EMR if you need deep control over the Spark/Hadoop environment or specific versions. Use AWS Glue for serverless, event-driven ETL.
  • HNSW Indexing: This is a niche but important exam topic. Remember it's for Vector Data (AI/ML) specifically within Aurora PostgreSQL.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free