Reading Data from Streaming Sources

This guide covers the core mechanisms, services, and best practices for ingesting and consuming real-time data within the AWS ecosystem, specifically tailored for the AWS Certified Data Engineer – Associate (DEA-C01) exam.

Learning Objectives

After studying this guide, you should be able to:

Differentiate between Amazon Kinesis Data Streams (KDS) and Amazon Managed Streaming for Apache Kafka (Amazon MSK).
Explain how Change Data Capture (CDC) works using Amazon DynamoDB Streams and AWS DMS.
Configure Amazon Redshift Streaming Ingestion for low-latency analytics.
Identify the appropriate consumer for specific latency and transformation requirements (e.g., AWS Glue Streaming, AWS Lambda).
Manage streaming performance through sharding, fan-out, and throttling.

Key Terms & Glossary

Producer: An application or service that sends data records to a stream (e.g., Kinesis Producer Library, AWS IoT Core).
Consumer: An application or service that processes data from a stream (e.g., AWS Lambda, Kinesis Client Library).
Shard: The base throughput unit of a Kinesis data stream. It provides a fixed capacity of 1MB/sec ingress and 2MB/sec egress.
Partition: In MSK/Kafka, a unit of parallelism similar to a Kinesis shard.
Change Data Capture (CDC): A process that identifies and captures changes made to a database (inserts, updates, deletes) and delivers them to a downstream target.
Checkpointing: The process of recording the last successfully processed record in a stream to ensure processing can resume after a failure.

The "Big Idea"

Streaming data ingestion is about decoupling data producers from data consumers. In a traditional batch system, the system waits for a file to land; in a streaming system, data flows continuously. The "Big Idea" is to enable near-real-time insights by reducing the time from data generation to data availability in a warehouse or lake, allowing businesses to react to events (like fraud or stock price shifts) as they happen.

Formula / Concept Box

Concept	Kinesis Data Streams	Amazon MSK
Scalability	Shard-based (Auto-scaling available)	Partition-based (Manual/Auto-scaling)
Retention	24 hours (default) up to 365 days	Configurable (unlimited storage via Tiered Storage)
Delivery	At-least-once (Standard)	At-least-once (Standard) / Exactly-once (Kafka API)
Throttling	`ProvisionedThroughputExceededException`	Resource-based (CPU/Memory/Disk)

Hierarchical Outline

Streaming Storage Layer
- Amazon Kinesis Data Streams (KDS): Serverless, AWS-native, deep integration with AWS ecosystem.
- Amazon Managed Streaming for Apache Kafka (MSK): Managed open-source Kafka, ideal for migrations or existing Kafka-based codebases.
Change Data Capture (CDC) Sources
- DynamoDB Streams: Captures item-level changes; integrates directly with Kinesis or Lambda.
- AWS Database Migration Service (DMS): Replicates transactional database logs (RDS, On-prem) to Kinesis/MSK.
Consumption & Transformation
- AWS Lambda: Best for simple, per-record transformations or small batches.
- AWS Glue Streaming: Spark-based; best for complex ETL, join operations, and schema management.
- Amazon Redshift Streaming Ingestion: Least operational overhead for loading Kinesis/MSK data directly into Redshift.
- Amazon Data Firehose: Fully managed "delivery stream" to S3, Redshift, or OpenSearch.

Visual Anchors

High-Level Streaming Architecture

Loading Diagram...

Kinesis Shard Structure

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Standard Iterators: A method to read data from a shard from a specific point.
- Example: Using TRIM_HORIZON to read all data currently in the stream (from the oldest record) to catch up on history after a consumer outage.
Enhanced Fan-out: A feature that provides dedicated 2MB/sec throughput per consumer.
- Example: Multiple independent applications (e.g., one for fraud detection, one for real-time dashboarding) reading from the same Kinesis stream without competing for throughput.
MSK Connect: A feature of Amazon MSK to run Kafka Connect connectors.
- Example: Continuously ingesting data from a MongoDB database into an MSK topic without writing custom producer code.

Worked Examples

Example 1: Loading Kinesis into Redshift with "Least Operational Overhead"

Scenario: An e-commerce company wants to ingest clickstream data from Kinesis into Redshift for SQL analysis. Solution:

Create an External Schema in Redshift pointing to the Kinesis Stream.
Create a Materialized View in Redshift that selects from the Kinesis stream.
Redshift consumes the data directly via the Kinesis API.
Why? No intermediate Lambda or Firehose is required, reducing latency and infrastructure management.

Example 2: Handling DynamoDB Changes

Scenario: A profile update in a DynamoDB table needs to trigger an email notification. Solution:

Enable DynamoDB Streams on the table (New Image or New and Old Images).
Create an AWS Lambda function and map the DynamoDB Stream as the event source.
The Lambda function receives a batch of records representing the changes and calls Amazon SES to send the email.

Checkpoint Questions

What is the default data retention period for a Kinesis Data Stream?
Which service would you use to move data from an on-premises Oracle database to Kinesis in near-real-time?
How many MB/sec of read throughput does a single Kinesis Shard provide to all shared consumers combined?
When should you choose Amazon MSK over Kinesis Data Streams?

▶Click for Answers

24 hours.
AWS Database Migration Service (DMS) with Change Data Capture (CDC).
2 MB/sec.
Choose MSK if you have an existing Kafka codebase, require open-source compatibility, or need exactly-once processing semantics.

Comparison Tables

Comparison: Kinesis Data Streams vs. Amazon Data Firehose

Feature	Kinesis Data Streams	Amazon Data Firehose
Purpose	Real-time stream storage	Fully managed delivery to sinks
Latency	< 1 second	60 seconds (min) or 1MB buffer
Coding	Requires custom consumer code	Zero-code (config only)
Scaling	Manual/Auto shard management	Fully automated scaling

Muddy Points & Cross-Refs

Shard Splitting/Merging: This is how you scale KDS. Splitting a shard increases capacity; merging shards decreases cost. Both require careful handling of partition keys.
Throttling vs. Lag: Throttling (ProvisionedThroughputExceeded) happens at the producer/stream level. Consumer Lag happens when the consumer cannot keep up with the incoming data velocity (often visible in CloudWatch metrics).
Cross-Reference: For more on how to transform this data after reading it, see the AWS Glue ETL & Spark study guide.

[!IMPORTANT] For the exam, always look for the phrase "least operational overhead." If the target is Redshift or S3, Amazon Data Firehose or Redshift Streaming Ingestion are usually the preferred answers over custom Lambda code.

Reading Data from Streaming Sources

Learning Objectives

After studying this guide, you should be able to:

Differentiate between Amazon Kinesis Data Streams (KDS) and Amazon Managed Streaming for Apache Kafka (Amazon MSK).
Explain how Change Data Capture (CDC) works using Amazon DynamoDB Streams and AWS DMS.
Configure Amazon Redshift Streaming Ingestion for low-latency analytics.
Identify the appropriate consumer for specific latency and transformation requirements (e.g., AWS Glue Streaming, AWS Lambda).
Manage streaming performance through sharding, fan-out, and throttling.

Key Terms & Glossary

Producer: An application or service that sends data records to a stream (e.g., Kinesis Producer Library, AWS IoT Core).
Consumer: An application or service that processes data from a stream (e.g., AWS Lambda, Kinesis Client Library).
Shard: The base throughput unit of a Kinesis data stream. It provides a fixed capacity of 1MB/sec ingress and 2MB/sec egress.
Partition: In MSK/Kafka, a unit of parallelism similar to a Kinesis shard.
Change Data Capture (CDC): A process that identifies and captures changes made to a database (inserts, updates, deletes) and delivers them to a downstream target.
Checkpointing: The process of recording the last successfully processed record in a stream to ensure processing can resume after a failure.

The "Big Idea"

Formula / Concept Box

Concept	Kinesis Data Streams	Amazon MSK
Scalability	Shard-based (Auto-scaling available)	Partition-based (Manual/Auto-scaling)
Retention	24 hours (default) up to 365 days	Configurable (unlimited storage via Tiered Storage)
Delivery	At-least-once (Standard)	At-least-once (Standard) / Exactly-once (Kafka API)
Throttling	`ProvisionedThroughputExceededException`	Resource-based (CPU/Memory/Disk)

Hierarchical Outline

Streaming Storage Layer
- Amazon Kinesis Data Streams (KDS): Serverless, AWS-native, deep integration with AWS ecosystem.
- Amazon Managed Streaming for Apache Kafka (MSK): Managed open-source Kafka, ideal for migrations or existing Kafka-based codebases.
Change Data Capture (CDC) Sources
- DynamoDB Streams: Captures item-level changes; integrates directly with Kinesis or Lambda.
- AWS Database Migration Service (DMS): Replicates transactional database logs (RDS, On-prem) to Kinesis/MSK.
Consumption & Transformation
- AWS Lambda: Best for simple, per-record transformations or small batches.
- AWS Glue Streaming: Spark-based; best for complex ETL, join operations, and schema management.
- Amazon Redshift Streaming Ingestion: Least operational overhead for loading Kinesis/MSK data directly into Redshift.
- Amazon Data Firehose: Fully managed "delivery stream" to S3, Redshift, or OpenSearch.

Visual Anchors

High-Level Streaming Architecture

Loading Diagram...

Kinesis Shard Structure

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Standard Iterators: A method to read data from a shard from a specific point.
- Example: Using TRIM_HORIZON to read all data currently in the stream (from the oldest record) to catch up on history after a consumer outage.
Enhanced Fan-out: A feature that provides dedicated 2MB/sec throughput per consumer.
- Example: Multiple independent applications (e.g., one for fraud detection, one for real-time dashboarding) reading from the same Kinesis stream without competing for throughput.
MSK Connect: A feature of Amazon MSK to run Kafka Connect connectors.
- Example: Continuously ingesting data from a MongoDB database into an MSK topic without writing custom producer code.

Worked Examples

Example 1: Loading Kinesis into Redshift with "Least Operational Overhead"

Scenario: An e-commerce company wants to ingest clickstream data from Kinesis into Redshift for SQL analysis. Solution:

Create an External Schema in Redshift pointing to the Kinesis Stream.
Create a Materialized View in Redshift that selects from the Kinesis stream.
Redshift consumes the data directly via the Kinesis API.
Why? No intermediate Lambda or Firehose is required, reducing latency and infrastructure management.

Example 2: Handling DynamoDB Changes

Scenario: A profile update in a DynamoDB table needs to trigger an email notification. Solution:

Enable DynamoDB Streams on the table (New Image or New and Old Images).
Create an AWS Lambda function and map the DynamoDB Stream as the event source.
The Lambda function receives a batch of records representing the changes and calls Amazon SES to send the email.

Checkpoint Questions

What is the default data retention period for a Kinesis Data Stream?
Which service would you use to move data from an on-premises Oracle database to Kinesis in near-real-time?
How many MB/sec of read throughput does a single Kinesis Shard provide to all shared consumers combined?
When should you choose Amazon MSK over Kinesis Data Streams?

▶Click for Answers

24 hours.
AWS Database Migration Service (DMS) with Change Data Capture (CDC).
2 MB/sec.
Choose MSK if you have an existing Kafka codebase, require open-source compatibility, or need exactly-once processing semantics.

Comparison Tables

Comparison: Kinesis Data Streams vs. Amazon Data Firehose

Feature	Kinesis Data Streams	Amazon Data Firehose
Purpose	Real-time stream storage	Fully managed delivery to sinks
Latency	< 1 second	60 seconds (min) or 1MB buffer
Coding	Requires custom consumer code	Zero-code (config only)
Scaling	Manual/Auto shard management	Fully automated scaling

Muddy Points & Cross-Refs

Shard Splitting/Merging: This is how you scale KDS. Splitting a shard increases capacity; merging shards decreases cost. Both require careful handling of partition keys.
Throttling vs. Lag: Throttling (ProvisionedThroughputExceeded) happens at the producer/stream level. Consumer Lag happens when the consumer cannot keep up with the incoming data velocity (often visible in CloudWatch metrics).
Cross-Reference: For more on how to transform this data after reading it, see the AWS Glue ETL & Spark study guide.

[!IMPORTANT] For the exam, always look for the phrase "least operational overhead." If the target is Redshift or S3, Amazon Data Firehose or Redshift Streaming Ingestion are usually the preferred answers over custom Lambda code.

Reading Data from Streaming Sources: AWS Data Engineer Study Guide

Reading Data from Streaming Sources

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

High-Level Streaming Architecture

Kinesis Shard Structure

Definition-Example Pairs

Worked Examples

Example 1: Loading Kinesis into Redshift with "Least Operational Overhead"

Example 2: Handling DynamoDB Changes

Checkpoint Questions

Comparison Tables

Comparison: Kinesis Data Streams vs. Amazon Data Firehose

Muddy Points & Cross-Refs

Reading Data from Streaming Sources: AWS Data Engineer Study Guide

Reading Data from Streaming Sources

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

High-Level Streaming Architecture

Kinesis Shard Structure

Definition-Example Pairs

Worked Examples

Example 1: Loading Kinesis into Redshift with "Least Operational Overhead"

Example 2: Handling DynamoDB Changes

Checkpoint Questions

Comparison Tables

Comparison: Kinesis Data Streams vs. Amazon Data Firehose

Muddy Points & Cross-Refs