Study Guide1,142 words

Reading Data from Streaming Sources: AWS Data Engineer Study Guide

Read data from streaming sources (for example, Amazon Kinesis, Amazon Managed Streaming for Apache Kafka [Amazon MSK], Amazon DynamoDB Streams, AWS Database Migration Service [AWS DMS], AWS Glue, Amazon Redshift)

Reading Data from Streaming Sources

This guide covers the core mechanisms, services, and best practices for ingesting and consuming real-time data within the AWS ecosystem, specifically tailored for the AWS Certified Data Engineer – Associate (DEA-C01) exam.

Learning Objectives

After studying this guide, you should be able to:

  • Differentiate between Amazon Kinesis Data Streams (KDS) and Amazon Managed Streaming for Apache Kafka (Amazon MSK).
  • Explain how Change Data Capture (CDC) works using Amazon DynamoDB Streams and AWS DMS.
  • Configure Amazon Redshift Streaming Ingestion for low-latency analytics.
  • Identify the appropriate consumer for specific latency and transformation requirements (e.g., AWS Glue Streaming, AWS Lambda).
  • Manage streaming performance through sharding, fan-out, and throttling.

Key Terms & Glossary

  • Producer: An application or service that sends data records to a stream (e.g., Kinesis Producer Library, AWS IoT Core).
  • Consumer: An application or service that processes data from a stream (e.g., AWS Lambda, Kinesis Client Library).
  • Shard: The base throughput unit of a Kinesis data stream. It provides a fixed capacity of 1MB/sec ingress and 2MB/sec egress.
  • Partition: In MSK/Kafka, a unit of parallelism similar to a Kinesis shard.
  • Change Data Capture (CDC): A process that identifies and captures changes made to a database (inserts, updates, deletes) and delivers them to a downstream target.
  • Checkpointing: The process of recording the last successfully processed record in a stream to ensure processing can resume after a failure.

The "Big Idea"

Streaming data ingestion is about decoupling data producers from data consumers. In a traditional batch system, the system waits for a file to land; in a streaming system, data flows continuously. The "Big Idea" is to enable near-real-time insights by reducing the time from data generation to data availability in a warehouse or lake, allowing businesses to react to events (like fraud or stock price shifts) as they happen.

Formula / Concept Box

ConceptKinesis Data StreamsAmazon MSK
ScalabilityShard-based (Auto-scaling available)Partition-based (Manual/Auto-scaling)
Retention24 hours (default) up to 365 daysConfigurable (unlimited storage via Tiered Storage)
DeliveryAt-least-once (Standard)At-least-once (Standard) / Exactly-once (Kafka API)
ThrottlingProvisionedThroughputExceededExceptionResource-based (CPU/Memory/Disk)

Hierarchical Outline

  1. Streaming Storage Layer
    • Amazon Kinesis Data Streams (KDS): Serverless, AWS-native, deep integration with AWS ecosystem.
    • Amazon Managed Streaming for Apache Kafka (MSK): Managed open-source Kafka, ideal for migrations or existing Kafka-based codebases.
  2. Change Data Capture (CDC) Sources
    • DynamoDB Streams: Captures item-level changes; integrates directly with Kinesis or Lambda.
    • AWS Database Migration Service (DMS): Replicates transactional database logs (RDS, On-prem) to Kinesis/MSK.
  3. Consumption & Transformation
    • AWS Lambda: Best for simple, per-record transformations or small batches.
    • AWS Glue Streaming: Spark-based; best for complex ETL, join operations, and schema management.
    • Amazon Redshift Streaming Ingestion: Least operational overhead for loading Kinesis/MSK data directly into Redshift.
    • Amazon Data Firehose: Fully managed "delivery stream" to S3, Redshift, or OpenSearch.

Visual Anchors

High-Level Streaming Architecture

Loading Diagram...

Kinesis Shard Structure

\begin{tikzpicture}[node distance=1cm, every node/.style={draw, rectangle, minimum height=1cm}] \node (S1) {Shard 1 (Partition Key Range A)}; \node (S2) [below of=S1] {Shard 2 (Partition Key Range B)}; \node (S3) [below of=S2] {Shard 3 (Partition Key Range C)};

code
\draw [->, thick] (-3, -1) -- node[above, draw=none] {Producers (PUT Records)} (S1.west); \draw [->, thick] (-3, -1) -- (S2.west); \draw [->, thick] (-3, -1) -- (S3.west); \draw [->, thick] (S1.east) -- node[above, draw=none] {Consumers (GET Records)} (4, -1); \draw [->, thick] (S2.east) -- (4, -1); \draw [->, thick] (S3.east) -- (4, -1);

\end{tikzpicture}

Definition-Example Pairs

  • Standard Iterators: A method to read data from a shard from a specific point.
    • Example: Using TRIM_HORIZON to read all data currently in the stream (from the oldest record) to catch up on history after a consumer outage.
  • Enhanced Fan-out: A feature that provides dedicated 2MB/sec throughput per consumer.
    • Example: Multiple independent applications (e.g., one for fraud detection, one for real-time dashboarding) reading from the same Kinesis stream without competing for throughput.
  • MSK Connect: A feature of Amazon MSK to run Kafka Connect connectors.
    • Example: Continuously ingesting data from a MongoDB database into an MSK topic without writing custom producer code.

Worked Examples

Example 1: Loading Kinesis into Redshift with "Least Operational Overhead"

Scenario: An e-commerce company wants to ingest clickstream data from Kinesis into Redshift for SQL analysis. Solution:

  1. Create an External Schema in Redshift pointing to the Kinesis Stream.
  2. Create a Materialized View in Redshift that selects from the Kinesis stream.
  3. Redshift consumes the data directly via the Kinesis API.
  4. Why? No intermediate Lambda or Firehose is required, reducing latency and infrastructure management.

Example 2: Handling DynamoDB Changes

Scenario: A profile update in a DynamoDB table needs to trigger an email notification. Solution:

  1. Enable DynamoDB Streams on the table (New Image or New and Old Images).
  2. Create an AWS Lambda function and map the DynamoDB Stream as the event source.
  3. The Lambda function receives a batch of records representing the changes and calls Amazon SES to send the email.

Checkpoint Questions

  1. What is the default data retention period for a Kinesis Data Stream?
  2. Which service would you use to move data from an on-premises Oracle database to Kinesis in near-real-time?
  3. How many MB/sec of read throughput does a single Kinesis Shard provide to all shared consumers combined?
  4. When should you choose Amazon MSK over Kinesis Data Streams?
Click for Answers
  1. 24 hours.
  2. AWS Database Migration Service (DMS) with Change Data Capture (CDC).
  3. 2 MB/sec.
  4. Choose MSK if you have an existing Kafka codebase, require open-source compatibility, or need exactly-once processing semantics.

Comparison Tables

Comparison: Kinesis Data Streams vs. Amazon Data Firehose

FeatureKinesis Data StreamsAmazon Data Firehose
PurposeReal-time stream storageFully managed delivery to sinks
Latency< 1 second60 seconds (min) or 1MB buffer
CodingRequires custom consumer codeZero-code (config only)
ScalingManual/Auto shard managementFully automated scaling

Muddy Points & Cross-Refs

  • Shard Splitting/Merging: This is how you scale KDS. Splitting a shard increases capacity; merging shards decreases cost. Both require careful handling of partition keys.
  • Throttling vs. Lag: Throttling (ProvisionedThroughputExceeded) happens at the producer/stream level. Consumer Lag happens when the consumer cannot keep up with the incoming data velocity (often visible in CloudWatch metrics).
  • Cross-Reference: For more on how to transform this data after reading it, see the AWS Glue ETL & Spark study guide.

[!IMPORTANT] For the exam, always look for the phrase "least operational overhead." If the target is Redshift or S3, Amazon Data Firehose or Redshift Streaming Ingestion are usually the preferred answers over custom Lambda code.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free