Curriculum Overview: Performing Data Ingestion (AWS DEA-C01)
Perform data ingestion
Curriculum Overview: Performing Data Ingestion
This curriculum provides a comprehensive roadmap for mastering Task 1.1: Perform Data Ingestion as defined in the AWS Certified Data Engineer - Associate (DEA-C01) exam. It focuses on the ability to collect, move, and integrate data from diverse sources—including on-premises, SaaS, and real-time streams—into the AWS ecosystem.
## Prerequisites
Before starting this module, students should possess the following foundational knowledge:
- Cloud Fundamentals: Basic understanding of AWS global infrastructure (Regions, Availability Zones).
- Data Structures: Familiarity with structured (SQL), semi-structured (JSON/CSV), and unstructured data.
- Networking Basics: Understanding of IP addressing, APIs (REST), and basic security concepts like allowlists.
- Database Concepts: Knowledge of CRUD operations and the difference between OLTP and OLAP systems.
## Module Breakdown
| Module | Focus Area | Key Services | Difficulty |
|---|---|---|---|
| 1. Ingestion Patterns | Real-time vs. Batch vs. Near Real-time | Kinesis, S3, Glue | Beginner |
| 2. Streaming Architecture | High-velocity event processing | MSK, Kinesis Data Streams | Advanced |
| 3. Batch & SaaS Ingestion | Scheduled loads and API integration | AppFlow, Glue, Lambda | Intermediate |
| 4. Hybrid & Third-Party | On-premises and Marketplace data | DataSync, Data Exchange | Intermediate |
| 5. Triggering & Scaling | Event-driven architecture and throttling | EventBridge, S3 Notifications | Intermediate |
## Visualizing the Ingestion Landscape
## Learning Objectives per Module
Module 1: Fundamental Patterns
- Differentiate between Batch (periodic loads, high throughput) and Streaming (continuous flow, low latency) requirements.
- Identify when Zero-ETL integrations are the optimal choice for cost and simplicity.
Module 2: Streaming Ingestion (Skills 1.1.1, 1.1.10)
- Configure producers and consumers for Amazon Kinesis and Amazon MSK.
- Manage Fan-in (multiple sources to one stream) and Fan-out (one stream to multiple consumers) patterns.
- Explain Replayability: the ability to re-process data from a specific point in time in a stream.
Module 3: Batch and API Ingestion (Skills 1.1.2, 1.1.4)
- Use Amazon AppFlow to ingest data from SaaS applications (e.g., Salesforce, Slack).
- Implement AWS Lambda to consume Data APIs and push results to S3 or DynamoDB.
Module 4: Connectivity and Security (Skills 1.1.8, 1.1.9)
- Implement Throttling and handle rate limits for high-demand services like DynamoDB and RDS.
- Configure IP Allowlists to secure connections between AWS and external data sources.
## Comparison: Streaming vs. Batch Latency
## Success Metrics
To demonstrate mastery of this curriculum, the student must be able to:
- Architecture Selection: Given a scenario (e.g., "Analyze clickstream data with < 1s delay"), select the correct service (Kinesis Data Streams).
- Configuration Mastery: Correctiy identify the need for S3 Event Notifications vs. EventBridge for triggering downstream Lambda functions.
- Efficiency: Explain why AWS DataSync is superior to traditional CLI copies for moving large on-premises datasets (parallelism, metadata preservation).
- Error Handling: Design a pipeline that implements exponential backoff to handle service throttling.
## Real-World Application
[!IMPORTANT] Data Ingestion is the "front door" of any data platform. If ingestion fails or introduces poor data, the entire downstream analytics pipeline becomes unreliable.
- Hybrid Cloud Strategy: Large enterprises use AWS DataSync to maintain a unified view across on-premises legacy systems and modern cloud warehouses.
- IoT and Smart Cities: Real-time ingestion via Amazon MSK allows cities to process sensor data from thousands of traffic lights simultaneously to optimize traffic flow.
- Market Insight: Financial firms use AWS Data Exchange to automatically subscribe to and integrate daily stock market trends directly into their S3 data lakes without writing custom scrapers.
▶Click to view a sample Interview Question
Scenario: You need to ingest daily 100GB CSV dumps from an on-premises SFTP server into Amazon S3. Which service should you use?
Answer: AWS DataSync. It is specifically designed for online data transfer between on-premises and AWS, handles scheduling, and only copies changed files after the initial seed, making it more efficient than manual scripts.