Curriculum Overview: Performing Data Ingestion

This curriculum provides a comprehensive roadmap for mastering Task 1.1: Perform Data Ingestion as defined in the AWS Certified Data Engineer - Associate (DEA-C01) exam. It focuses on the ability to collect, move, and integrate data from diverse sources—including on-premises, SaaS, and real-time streams—into the AWS ecosystem.

## Prerequisites

Before starting this module, students should possess the following foundational knowledge:

Cloud Fundamentals: Basic understanding of AWS global infrastructure (Regions, Availability Zones).
Data Structures: Familiarity with structured (SQL), semi-structured (JSON/CSV), and unstructured data.
Networking Basics: Understanding of IP addressing, APIs (REST), and basic security concepts like allowlists.
Database Concepts: Knowledge of CRUD operations and the difference between OLTP and OLAP systems.

## Module Breakdown

Module	Focus Area	Key Services	Difficulty
1. Ingestion Patterns	Real-time vs. Batch vs. Near Real-time	Kinesis, S3, Glue	Beginner
2. Streaming Architecture	High-velocity event processing	MSK, Kinesis Data Streams	Advanced
3. Batch & SaaS Ingestion	Scheduled loads and API integration	AppFlow, Glue, Lambda	Intermediate
4. Hybrid & Third-Party	On-premises and Marketplace data	DataSync, Data Exchange	Intermediate
5. Triggering & Scaling	Event-driven architecture and throttling	EventBridge, S3 Notifications	Intermediate

## Visualizing the Ingestion Landscape

Loading Diagram...

## Learning Objectives per Module

Module 1: Fundamental Patterns

Differentiate between Batch (periodic loads, high throughput) and Streaming (continuous flow, low latency) requirements.
Identify when Zero-ETL integrations are the optimal choice for cost and simplicity.

Module 2: Streaming Ingestion (Skills 1.1.1, 1.1.10)

Configure producers and consumers for Amazon Kinesis and Amazon MSK.
Manage Fan-in (multiple sources to one stream) and Fan-out (one stream to multiple consumers) patterns.
Explain Replayability: the ability to re-process data from a specific point in time in a stream.

Module 3: Batch and API Ingestion (Skills 1.1.2, 1.1.4)

Use Amazon AppFlow to ingest data from SaaS applications (e.g., Salesforce, Slack).
Implement AWS Lambda to consume Data APIs and push results to S3 or DynamoDB.

Module 4: Connectivity and Security (Skills 1.1.8, 1.1.9)

Implement Throttling and handle rate limits for high-demand services like DynamoDB and RDS.
Configure IP Allowlists to secure connections between AWS and external data sources.

## Comparison: Streaming vs. Batch Latency

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

## Success Metrics

To demonstrate mastery of this curriculum, the student must be able to:

Architecture Selection: Given a scenario (e.g., "Analyze clickstream data with < 1s delay"), select the correct service (Kinesis Data Streams).
Configuration Mastery: Correctiy identify the need for S3 Event Notifications vs. EventBridge for triggering downstream Lambda functions.
Efficiency: Explain why AWS DataSync is superior to traditional CLI copies for moving large on-premises datasets (parallelism, metadata preservation).
Error Handling: Design a pipeline that implements exponential backoff to handle service throttling.

## Real-World Application

[!IMPORTANT] Data Ingestion is the "front door" of any data platform. If ingestion fails or introduces poor data, the entire downstream analytics pipeline becomes unreliable.

Hybrid Cloud Strategy: Large enterprises use AWS DataSync to maintain a unified view across on-premises legacy systems and modern cloud warehouses.
IoT and Smart Cities: Real-time ingestion via Amazon MSK allows cities to process sensor data from thousands of traffic lights simultaneously to optimize traffic flow.
Market Insight: Financial firms use AWS Data Exchange to automatically subscribe to and integrate daily stock market trends directly into their S3 data lakes without writing custom scrapers.

▶Click to view a sample Interview Question

Scenario: You need to ingest daily 100GB CSV dumps from an on-premises SFTP server into Amazon S3. Which service should you use?

Answer: AWS DataSync. It is specifically designed for online data transfer between on-premises and AWS, handles scheduling, and only copies changed files after the initial seed, making it more efficient than manual scripts.

Curriculum Overview: Performing Data Ingestion

## Prerequisites

Before starting this module, students should possess the following foundational knowledge:

Cloud Fundamentals: Basic understanding of AWS global infrastructure (Regions, Availability Zones).
Data Structures: Familiarity with structured (SQL), semi-structured (JSON/CSV), and unstructured data.
Networking Basics: Understanding of IP addressing, APIs (REST), and basic security concepts like allowlists.
Database Concepts: Knowledge of CRUD operations and the difference between OLTP and OLAP systems.

## Module Breakdown

Module	Focus Area	Key Services	Difficulty
1. Ingestion Patterns	Real-time vs. Batch vs. Near Real-time	Kinesis, S3, Glue	Beginner
2. Streaming Architecture	High-velocity event processing	MSK, Kinesis Data Streams	Advanced
3. Batch & SaaS Ingestion	Scheduled loads and API integration	AppFlow, Glue, Lambda	Intermediate
4. Hybrid & Third-Party	On-premises and Marketplace data	DataSync, Data Exchange	Intermediate
5. Triggering & Scaling	Event-driven architecture and throttling	EventBridge, S3 Notifications	Intermediate

## Visualizing the Ingestion Landscape

Loading Diagram...

## Learning Objectives per Module

Module 1: Fundamental Patterns

Differentiate between Batch (periodic loads, high throughput) and Streaming (continuous flow, low latency) requirements.
Identify when Zero-ETL integrations are the optimal choice for cost and simplicity.

Module 2: Streaming Ingestion (Skills 1.1.1, 1.1.10)

Configure producers and consumers for Amazon Kinesis and Amazon MSK.
Manage Fan-in (multiple sources to one stream) and Fan-out (one stream to multiple consumers) patterns.
Explain Replayability: the ability to re-process data from a specific point in time in a stream.

Module 3: Batch and API Ingestion (Skills 1.1.2, 1.1.4)

Use Amazon AppFlow to ingest data from SaaS applications (e.g., Salesforce, Slack).
Implement AWS Lambda to consume Data APIs and push results to S3 or DynamoDB.

Module 4: Connectivity and Security (Skills 1.1.8, 1.1.9)

Implement Throttling and handle rate limits for high-demand services like DynamoDB and RDS.
Configure IP Allowlists to secure connections between AWS and external data sources.

## Comparison: Streaming vs. Batch Latency

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

## Success Metrics

To demonstrate mastery of this curriculum, the student must be able to:

Architecture Selection: Given a scenario (e.g., "Analyze clickstream data with < 1s delay"), select the correct service (Kinesis Data Streams).
Configuration Mastery: Correctiy identify the need for S3 Event Notifications vs. EventBridge for triggering downstream Lambda functions.
Efficiency: Explain why AWS DataSync is superior to traditional CLI copies for moving large on-premises datasets (parallelism, metadata preservation).
Error Handling: Design a pipeline that implements exponential backoff to handle service throttling.

## Real-World Application

[!IMPORTANT] Data Ingestion is the "front door" of any data platform. If ingestion fails or introduces poor data, the entire downstream analytics pipeline becomes unreliable.

Hybrid Cloud Strategy: Large enterprises use AWS DataSync to maintain a unified view across on-premises legacy systems and modern cloud warehouses.
IoT and Smart Cities: Real-time ingestion via Amazon MSK allows cities to process sensor data from thousands of traffic lights simultaneously to optimize traffic flow.
Market Insight: Financial firms use AWS Data Exchange to automatically subscribe to and integrate daily stock market trends directly into their S3 data lakes without writing custom scrapers.

▶Click to view a sample Interview Question

Scenario: You need to ingest daily 100GB CSV dumps from an on-premises SFTP server into Amazon S3. Which service should you use?

Curriculum Overview: Performing Data Ingestion (AWS DEA-C01)

Curriculum Overview: Performing Data Ingestion

## Prerequisites

## Module Breakdown

## Visualizing the Ingestion Landscape

## Learning Objectives per Module

Module 1: Fundamental Patterns

Module 2: Streaming Ingestion (Skills 1.1.1, 1.1.10)

Module 3: Batch and API Ingestion (Skills 1.1.2, 1.1.4)

Module 4: Connectivity and Security (Skills 1.1.8, 1.1.9)

## Comparison: Streaming vs. Batch Latency

## Success Metrics

## Real-World Application

Curriculum Overview: Performing Data Ingestion (AWS DEA-C01)

Curriculum Overview: Performing Data Ingestion

## Prerequisites

## Module Breakdown

## Visualizing the Ingestion Landscape

## Learning Objectives per Module

Module 1: Fundamental Patterns

Module 2: Streaming Ingestion (Skills 1.1.1, 1.1.10)

Module 3: Batch and API Ingestion (Skills 1.1.2, 1.1.4)

Module 4: Connectivity and Security (Skills 1.1.8, 1.1.9)

## Comparison: Streaming vs. Batch Latency

## Success Metrics

## Real-World Application