Curriculum Overview: Selecting Optimal Data Stores

This curriculum is designed to prepare data engineers to choose, configure, and manage the most effective storage solutions on AWS. Based on the AWS Certified Data Engineer – Associate (DEA-C01) objectives, this guide focuses on balancing performance requirements with cost-efficiency across the data lifecycle.

Prerequisites

Before beginning this module, candidates should possess the following foundational knowledge:

General IT Experience: 2–3 years of experience in data engineering, including building and maintaining ETL pipelines.
Cloud Fundamentals: 1–2 years of hands-on experience with AWS core services (Compute, Networking, and IAM).
Technical Skills:
- Familiarity with SQL, Python, or Scala for data manipulation.
- Basic understanding of Data Lakes and unstructured vs. structured data.
- Understanding of Networking & Security concepts, including VPCs and encryption.

Module Breakdown

This curriculum is divided into four primary modules, moving from fundamental storage characteristics to automated lifecycle management.

Module	Focus Area	Difficulty
1. Storage Archetypes	Object, Block, and File storage differences.	Beginner
2. Hot Data Solutions	RDS, DynamoDB, ElastiCache, and EBS.	Intermediate
3. Cold Data & Archival	S3 Standard-IA, Glacier, and One Zone-IA.	Intermediate
4. Lifecycle Strategy	S3 Lifecycle policies and Intelligent-Tiering.	Advanced

Module Objectives

Module 1: Cloud Storage Infrastructure

Objective: Differentiate between Object, Block, and File storage.
Key Skill: Select the correct infrastructure based on shared access vs. low-latency local requirements.

Module 2: High-Performance (Hot) Data Stores

Objective: Implement storage services for sub-second latency needs.
Key Skill: Configure Amazon DynamoDB for NoSQL workloads and Amazon RDS for structured relational data.

Module 3: Cold Storage & Cost Tiers

Objective: Categorize data based on access frequency (frequent, infrequent, rare).
Key Skill: Utilize S3 Glacier Instant Retrieval vs. Flexible Retrieval based on recovery time objectives (RTO).

Module 4: Automated Data Management

Objective: Automate the movement of data to reduce operational overhead.
Key Skill: Write S3 Lifecycle Policies to transition objects to lower-cost tiers automatically after $X$ days of inactivity.

Visual Anchors

Decision Logic for Data Storage

Loading Diagram...

Data Lifecycle Flow

Loading Diagram...

Success Metrics

To master this curriculum, students must be able to meet the following benchmarks:

Cost Efficiency: Reduce storage costs by at least 30% by correctly applying S3 Intelligent-Tiering to unpredictable datasets.
Performance SLAs: Achieve single-digit millisecond latency for transactional applications using DynamoDB or ElastiCache.
Durability Compliance: Design architectures that provide 99.999999999% (11 nines) of durability for critical data using Amazon S3.
Operational Excellence: Replace manual data movement with S3 Lifecycle Policies and DynamoDB TTL (Time to Live) to automate data expiration.

Real-World Application

[!TIP] Scenario: E-Commerce Platform

Hot Storage: Use Amazon DynamoDB for user shopping carts to ensure lightning-fast checkout experiences.

Warm Storage: Store invoice PDFs from the last 6 months in S3 Standard-IA; users rarely check them, but expect immediate access if they do.

Cold Storage: Move transaction logs older than 2 years to S3 Glacier Deep Archive to satisfy tax regulations at the lowest possible cost.

Comparison Table: Block vs. File vs. Object

Feature	Block (EBS)	File (EFS/FSx)	Object (S3)
Access	Attached to 1 Instance	Shared (Multiple Instances)	Web-based (Internet)
Scalability	Fixed Volume Size	Elastic / Auto-scaling	Virtually Unlimited
Best For	Databases, ERP Systems	Team Shares, ML Training	Data Lakes, Backups
Metadata	Minimal	Basic File System Props	Rich/Custom Metadata

Checkpoint Questions

▶Which S3 class is best for data that can be easily recreated but requires low-cost infrequent access?

S3 One Zone-IA. It offers lower costs than Standard-IA by storing data in a single Availability Zone, making it ideal for non-critical, reproducible data.

▶What is the primary benefit of S3 Intelligent-Tiering?

It automatically moves data between frequent and infrequent access tiers based on changing access patterns without operational overhead or retrieval fees.

Curriculum Overview: Selecting Optimal Data Stores

Prerequisites

Before beginning this module, candidates should possess the following foundational knowledge:

General IT Experience: 2–3 years of experience in data engineering, including building and maintaining ETL pipelines.
Cloud Fundamentals: 1–2 years of hands-on experience with AWS core services (Compute, Networking, and IAM).
Technical Skills:
- Familiarity with SQL, Python, or Scala for data manipulation.
- Basic understanding of Data Lakes and unstructured vs. structured data.
- Understanding of Networking & Security concepts, including VPCs and encryption.

Module Breakdown

This curriculum is divided into four primary modules, moving from fundamental storage characteristics to automated lifecycle management.

Module	Focus Area	Difficulty
1. Storage Archetypes	Object, Block, and File storage differences.	Beginner
2. Hot Data Solutions	RDS, DynamoDB, ElastiCache, and EBS.	Intermediate
3. Cold Data & Archival	S3 Standard-IA, Glacier, and One Zone-IA.	Intermediate
4. Lifecycle Strategy	S3 Lifecycle policies and Intelligent-Tiering.	Advanced

Module Objectives

Module 1: Cloud Storage Infrastructure

Objective: Differentiate between Object, Block, and File storage.
Key Skill: Select the correct infrastructure based on shared access vs. low-latency local requirements.

Module 2: High-Performance (Hot) Data Stores

Objective: Implement storage services for sub-second latency needs.
Key Skill: Configure Amazon DynamoDB for NoSQL workloads and Amazon RDS for structured relational data.

Module 3: Cold Storage & Cost Tiers

Objective: Categorize data based on access frequency (frequent, infrequent, rare).
Key Skill: Utilize S3 Glacier Instant Retrieval vs. Flexible Retrieval based on recovery time objectives (RTO).

Module 4: Automated Data Management

Objective: Automate the movement of data to reduce operational overhead.
Key Skill: Write S3 Lifecycle Policies to transition objects to lower-cost tiers automatically after $X$ days of inactivity.

Visual Anchors

Decision Logic for Data Storage

Loading Diagram...

Data Lifecycle Flow

Loading Diagram...

Success Metrics

To master this curriculum, students must be able to meet the following benchmarks:

Cost Efficiency: Reduce storage costs by at least 30% by correctly applying S3 Intelligent-Tiering to unpredictable datasets.
Performance SLAs: Achieve single-digit millisecond latency for transactional applications using DynamoDB or ElastiCache.
Durability Compliance: Design architectures that provide 99.999999999% (11 nines) of durability for critical data using Amazon S3.
Operational Excellence: Replace manual data movement with S3 Lifecycle Policies and DynamoDB TTL (Time to Live) to automate data expiration.

Real-World Application

[!TIP] Scenario: E-Commerce Platform

Hot Storage: Use Amazon DynamoDB for user shopping carts to ensure lightning-fast checkout experiences.

Warm Storage: Store invoice PDFs from the last 6 months in S3 Standard-IA; users rarely check them, but expect immediate access if they do.

Cold Storage: Move transaction logs older than 2 years to S3 Glacier Deep Archive to satisfy tax regulations at the lowest possible cost.

Comparison Table: Block vs. File vs. Object

Feature	Block (EBS)	File (EFS/FSx)	Object (S3)
Access	Attached to 1 Instance	Shared (Multiple Instances)	Web-based (Internet)
Scalability	Fixed Volume Size	Elastic / Auto-scaling	Virtually Unlimited
Best For	Databases, ERP Systems	Team Shares, ML Training	Data Lakes, Backups
Metadata	Minimal	Basic File System Props	Rich/Custom Metadata

Checkpoint Questions

▶Which S3 class is best for data that can be easily recreated but requires low-cost infrequent access?

S3 One Zone-IA. It offers lower costs than Standard-IA by storing data in a single Availability Zone, making it ideal for non-critical, reproducible data.

▶What is the primary benefit of S3 Intelligent-Tiering?

It automatically moves data between frequent and infrequent access tiers based on changing access patterns without operational overhead or retrieval fees.

Curriculum Overview: Selecting Optimal Data Stores (AWS DEA-C01)

Curriculum Overview: Selecting Optimal Data Stores

Prerequisites

Module Breakdown

Module Objectives

Module 1: Cloud Storage Infrastructure

Module 2: High-Performance (Hot) Data Stores

Module 3: Cold Storage & Cost Tiers

Module 4: Automated Data Management

Visual Anchors

Decision Logic for Data Storage

Data Lifecycle Flow

Success Metrics

Real-World Application

Comparison Table: Block vs. File vs. Object

Checkpoint Questions

Curriculum Overview: Selecting Optimal Data Stores (AWS DEA-C01)

Curriculum Overview: Selecting Optimal Data Stores

Prerequisites

Module Breakdown

Module Objectives

Module 1: Cloud Storage Infrastructure

Module 2: High-Performance (Hot) Data Stores

Module 3: Cold Storage & Cost Tiers

Module 4: Automated Data Management

Visual Anchors

Decision Logic for Data Storage

Data Lifecycle Flow

Success Metrics

Real-World Application

Comparison Table: Block vs. File vs. Object

Checkpoint Questions