Handling Streaming Data with AWS Services

This guide covers the essential components, architectural patterns, and services used to ingest, process, and analyze streaming data in real-time on AWS, focusing on Amazon Kinesis, DynamoDB Streams, and AWS Lambda integration.

Learning Objectives

After studying this material, you should be able to:

Differentiate between Kinesis Data Streams, Kinesis Data Firehose, and Kinesis Data Analytics.
Explain how DynamoDB Streams capture item-level changes to trigger downstream processing.
Identify the role of AWS Lambda as a consumer for streaming data sources.
Calculate shard requirements for Kinesis Data Streams based on throughput needs.
Describe the Kinesis Client Library (KCL) and its role in distributed processing.

Key Terms & Glossary

Shard: The base throughput unit of a Kinesis Data Stream. It provides a fixed capacity (1MB/sec in, 2MB/sec out).
Partition Key: A value used by producers to group data into specific shards within a stream.
Sequence Number: A unique identifier assigned by Kinesis to each data record within a shard.
Fan-out: An architectural pattern where a single event is sent to multiple destinations or processed by multiple consumers simultaneously.
Checkpointing: The process in which a consumer records its progress (the last sequence number processed) to ensure it can resume after a failure.

The "Big Idea"

The shift from Batch Processing (handling data in large chunks at scheduled intervals) to Stream Processing (handling data record-by-record as it arrives) allows businesses to react to events in sub-second or sub-minute intervals. In the AWS ecosystem, this is achieved by decoupling producers and consumers using "buffer" services like Kinesis, which ensure data is durable and available for multiple downstream applications to consume at their own pace.

Formula / Concept Box

Concept	Metric / Rule	Notes
Kinesis Shard Capacity (Write)	$1\text{ MB/sec}$ or $1,000 records/sec$	Whichever limit is hit first.
Kinesis Shard Capacity (Read)	$2 MB/sec$	Shared across all non-enhanced consumers.
Data Retention	$24 hours$ to $365 days$	Default is 24 hours.
Shard Count Formula	$$\max(\frac{Incoming Write BW}{1 MB/s} $,$ \frac{Outgoing Read BW}{2 MB/s} $)$	Used to determine initial stream size.

Hierarchical Outline

I. Amazon Kinesis Suite
- A. Kinesis Data Streams (KDS): Low-latency ingestion for custom processing.
  - Capacity Modes: Provisioned (manual scaling) vs. On-Demand (auto-scaling).
- B. Kinesis Data Firehose: Near real-time delivery to S3, Redshift, or OpenSearch.
  - Buffering: Delivers based on size (1– $128 MB$ ) or interval (60– $900 sec$ ).
- C. Kinesis Data Analytics: SQL or Apache Flink processing on top of streams.
II. DynamoDB Streams
- A. Functionality: Captures INSERT, MODIFY, and REMOVE events.
- B. Integration: Seamlessly triggers Lambda for "Side Effects" (e.g., sending a welcome email).
III. AWS Lambda Integration
- A. Event Source Mapping: Lambda polls the stream and invokes the function with batches of records.
- B. Error Handling: Use of Dead Letter Queues (DLQ) or Lambda Destinations for failed stream records.

Visual Anchors

Stream Processing Pipeline

Loading Diagram...

Shard Throughput Visualization

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Consumer Lag: The delay between when a record is written to the stream and when it is processed by a consumer.
- Example: A Lambda function taking 5 seconds to process a batch when new data arrives every 1 second causes the "IteratorAge" metric in CloudWatch to rise.
Record De-aggregation: The process of breaking down a large KPL (Kinesis Producer Library) batch into individual user records.
- Example: A Lambda function using the Kinesis Client Library to extract 100 individual sensor readings from a single large Kinesis record.
Stream Windowing: Processing data in time-based blocks (sliding or tumbling).
- Example: Calculating the average temperature from a sensor stream every 5 minutes using Kinesis Data Analytics.

Worked Examples

Example 1: Calculating Shard Count

Scenario: An application generates $5 MB/s of data with an average record size of 2 KB$ . How many shards are required for a Kinesis Data Stream?

Solution:

Calculate by Bandwidth: $$\frac{5 MB/s}{1 MB/s per shard} = 5 $shards$ .
Calculate by Record Count: Total records per second = $$\frac{5,000 KB/s}{2 KB/record} = 2 $,500 records/s$ .
Required Shards (Records): $$\frac{2,500 records/s}{1,000 records/s per shard} = 2.5 $shards$ .
Final Answer: We take the higher of the two values. You need 5 shards.

Example 2: DynamoDB Streams to Lambda

Goal: Implement a system that updates a "Total Sales" counter whenever a new order is added to the Orders table.

Workflow:

Enable DynamoDB Streams on the Orders table (Select "New Image" to see the data after the change).
Create an AWS Lambda function that parses the event record.
The Lambda function uses UpdateItem on the Summary table to increment the TotalSales attribute.
Configure the Trigger in the Lambda console, pointing to the DynamoDB Stream ARN.

Checkpoint Questions

What is the difference between Kinesis Data Streams and Kinesis Data Firehose regarding data retention?
If a Lambda function is throttled while consuming a Kinesis stream, what happens to the data records?
Which Kinesis service is best suited for loading data into an Amazon Redshift cluster with minimal coding?
What metric should you monitor to identify if your consumers are falling behind the producers in a Kinesis stream?
Explain why a partition key with low cardinality (e.g., only two possible values) is a poor choice for a stream with 10 shards.

[!TIP] Exam Hack: When the question mentions "Real-time" with "Sub-second latency," think Kinesis Data Streams. If it mentions "Minutes of latency" or "Delivery to S3/Redshift," think Firehose.

[!WARNING] Always ensure your Lambda execution role has kinesis:GetRecords, kinesis:GetShardIterator, and kinesis:DescribeStream permissions when using it as a consumer.

Handling Streaming Data with AWS Services

Learning Objectives

After studying this material, you should be able to:

Differentiate between Kinesis Data Streams, Kinesis Data Firehose, and Kinesis Data Analytics.
Explain how DynamoDB Streams capture item-level changes to trigger downstream processing.
Identify the role of AWS Lambda as a consumer for streaming data sources.
Calculate shard requirements for Kinesis Data Streams based on throughput needs.
Describe the Kinesis Client Library (KCL) and its role in distributed processing.

Key Terms & Glossary

Shard: The base throughput unit of a Kinesis Data Stream. It provides a fixed capacity (1MB/sec in, 2MB/sec out).
Partition Key: A value used by producers to group data into specific shards within a stream.
Sequence Number: A unique identifier assigned by Kinesis to each data record within a shard.
Fan-out: An architectural pattern where a single event is sent to multiple destinations or processed by multiple consumers simultaneously.
Checkpointing: The process in which a consumer records its progress (the last sequence number processed) to ensure it can resume after a failure.

The "Big Idea"

Formula / Concept Box

Concept	Metric / Rule	Notes
Kinesis Shard Capacity (Write)	$1\text{ MB/sec}$ or $1,000 records/sec$	Whichever limit is hit first.
Kinesis Shard Capacity (Read)	$2 MB/sec$	Shared across all non-enhanced consumers.
Data Retention	$24 hours$ to $365 days$	Default is 24 hours.
Shard Count Formula	$$\max(\frac{Incoming Write BW}{1 MB/s} $,$ \frac{Outgoing Read BW}{2 MB/s} $)$	Used to determine initial stream size.

Hierarchical Outline

I. Amazon Kinesis Suite
- A. Kinesis Data Streams (KDS): Low-latency ingestion for custom processing.
  - Capacity Modes: Provisioned (manual scaling) vs. On-Demand (auto-scaling).
- B. Kinesis Data Firehose: Near real-time delivery to S3, Redshift, or OpenSearch.
  - Buffering: Delivers based on size (1– $128 MB$ ) or interval (60– $900 sec$ ).
- C. Kinesis Data Analytics: SQL or Apache Flink processing on top of streams.
II. DynamoDB Streams
- A. Functionality: Captures INSERT, MODIFY, and REMOVE events.
- B. Integration: Seamlessly triggers Lambda for "Side Effects" (e.g., sending a welcome email).
III. AWS Lambda Integration
- A. Event Source Mapping: Lambda polls the stream and invokes the function with batches of records.
- B. Error Handling: Use of Dead Letter Queues (DLQ) or Lambda Destinations for failed stream records.

Visual Anchors

Stream Processing Pipeline

Loading Diagram...

Shard Throughput Visualization

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Consumer Lag: The delay between when a record is written to the stream and when it is processed by a consumer.
- Example: A Lambda function taking 5 seconds to process a batch when new data arrives every 1 second causes the "IteratorAge" metric in CloudWatch to rise.
Record De-aggregation: The process of breaking down a large KPL (Kinesis Producer Library) batch into individual user records.
- Example: A Lambda function using the Kinesis Client Library to extract 100 individual sensor readings from a single large Kinesis record.
Stream Windowing: Processing data in time-based blocks (sliding or tumbling).
- Example: Calculating the average temperature from a sensor stream every 5 minutes using Kinesis Data Analytics.

Worked Examples

Example 1: Calculating Shard Count

Scenario: An application generates $5 MB/s of data with an average record size of 2 KB$ . How many shards are required for a Kinesis Data Stream?

Solution:

Calculate by Bandwidth: $$\frac{5 MB/s}{1 MB/s per shard} = 5 $shards$ .
Calculate by Record Count: Total records per second = $$\frac{5,000 KB/s}{2 KB/record} = 2 $,500 records/s$ .
Required Shards (Records): $$\frac{2,500 records/s}{1,000 records/s per shard} = 2.5 $shards$ .
Final Answer: We take the higher of the two values. You need 5 shards.

Example 2: DynamoDB Streams to Lambda

Goal: Implement a system that updates a "Total Sales" counter whenever a new order is added to the Orders table.

Workflow:

Enable DynamoDB Streams on the Orders table (Select "New Image" to see the data after the change).
Create an AWS Lambda function that parses the event record.
The Lambda function uses UpdateItem on the Summary table to increment the TotalSales attribute.
Configure the Trigger in the Lambda console, pointing to the DynamoDB Stream ARN.

Checkpoint Questions

What is the difference between Kinesis Data Streams and Kinesis Data Firehose regarding data retention?
If a Lambda function is throttled while consuming a Kinesis stream, what happens to the data records?
Which Kinesis service is best suited for loading data into an Amazon Redshift cluster with minimal coding?
What metric should you monitor to identify if your consumers are falling behind the producers in a Kinesis stream?
Explain why a partition key with low cardinality (e.g., only two possible values) is a poor choice for a stream with 10 shards.

[!TIP] Exam Hack: When the question mentions "Real-time" with "Sub-second latency," think Kinesis Data Streams. If it mentions "Minutes of latency" or "Delivery to S3/Redshift," think Firehose.

[!WARNING] Always ensure your Lambda execution role has kinesis:GetRecords, kinesis:GetShardIterator, and kinesis:DescribeStream permissions when using it as a consumer.

Study Guide: Handling Streaming Data with AWS Services

Handling Streaming Data with AWS Services

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

Stream Processing Pipeline

Shard Throughput Visualization

Definition-Example Pairs

Worked Examples

Example 1: Calculating Shard Count

Example 2: DynamoDB Streams to Lambda

Checkpoint Questions

Study Guide: Handling Streaming Data with AWS Services

Handling Streaming Data with AWS Services

Learning Objectives

Key Terms & Glossary

The "Big Idea"

Formula / Concept Box

Hierarchical Outline

Visual Anchors

Stream Processing Pipeline

Shard Throughput Visualization

Definition-Example Pairs

Worked Examples

Example 1: Calculating Shard Count

Example 2: DynamoDB Streams to Lambda

Checkpoint Questions