Study Guide945 words

Study Guide: Handling Streaming Data with AWS Services

Handle streaming data using AWS services

Handling Streaming Data with AWS Services

This guide covers the essential components, architectural patterns, and services used to ingest, process, and analyze streaming data in real-time on AWS, focusing on Amazon Kinesis, DynamoDB Streams, and AWS Lambda integration.

Learning Objectives

After studying this material, you should be able to:

  • Differentiate between Kinesis Data Streams, Kinesis Data Firehose, and Kinesis Data Analytics.
  • Explain how DynamoDB Streams capture item-level changes to trigger downstream processing.
  • Identify the role of AWS Lambda as a consumer for streaming data sources.
  • Calculate shard requirements for Kinesis Data Streams based on throughput needs.
  • Describe the Kinesis Client Library (KCL) and its role in distributed processing.

Key Terms & Glossary

  • Shard: The base throughput unit of a Kinesis Data Stream. It provides a fixed capacity (1MB/sec in, 2MB/sec out).
  • Partition Key: A value used by producers to group data into specific shards within a stream.
  • Sequence Number: A unique identifier assigned by Kinesis to each data record within a shard.
  • Fan-out: An architectural pattern where a single event is sent to multiple destinations or processed by multiple consumers simultaneously.
  • Checkpointing: The process in which a consumer records its progress (the last sequence number processed) to ensure it can resume after a failure.

The "Big Idea"

The shift from Batch Processing (handling data in large chunks at scheduled intervals) to Stream Processing (handling data record-by-record as it arrives) allows businesses to react to events in sub-second or sub-minute intervals. In the AWS ecosystem, this is achieved by decoupling producers and consumers using "buffer" services like Kinesis, which ensure data is durable and available for multiple downstream applications to consume at their own pace.

Formula / Concept Box

ConceptMetric / RuleNotes
Kinesis Shard Capacity (Write)1 MB/sec1\text{ MB/sec} or $1,000 records/sec$Whichever limit is hit first.
Kinesis Shard Capacity (Read)$2 MB/secShared across all non-enhanced consumers.
Data Retention24 hourstoto365 daysDefault is 24 hours.
Shard Count Formula\max(\frac{Incoming Write BW}{1 MB/s}, \frac{Outgoing Read BW}{2 MB/s})Used to determine initial stream size.

Hierarchical Outline

  • I. Amazon Kinesis Suite
    • A. Kinesis Data Streams (KDS): Low-latency ingestion for custom processing.
      • Capacity Modes: Provisioned (manual scaling) vs. On-Demand (auto-scaling).
    • B. Kinesis Data Firehose: Near real-time delivery to S3, Redshift, or OpenSearch.
      • Buffering: Delivers based on size (1–128 MB)orinterval(60) or interval (60–900 sec$).
    • C. Kinesis Data Analytics: SQL or Apache Flink processing on top of streams.
  • II. DynamoDB Streams
    • A. Functionality: Captures INSERT, MODIFY, and REMOVE events.
    • B. Integration: Seamlessly triggers Lambda for "Side Effects" (e.g., sending a welcome email).
  • III. AWS Lambda Integration
    • A. Event Source Mapping: Lambda polls the stream and invokes the function with batches of records.
    • B. Error Handling: Use of Dead Letter Queues (DLQ) or Lambda Destinations for failed stream records.

Visual Anchors

Stream Processing Pipeline

Loading Diagram...

Shard Throughput Visualization

\begin{tikzpicture}[scale=0.8] \draw[thick, ->] (0,0) -- (6,0) node[right] {Time}; \draw[thick, ->] (0,0) -- (0,4) node[above] {Throughput};

code
% Shard 1 box \draw[fill=blue!10] (0.5,0.5) rectangle (4.5,1.5); \node at (2.5,1) {Shard 1 (1MB/s)}; % Shard 2 box \draw[fill=green!10] (0.5,1.7) rectangle (4.5,2.7); \node at (2.5,2.2) {Shard 2 (1MB/s)}; % Total capacity line \draw[dashed, red] (0,3) -- (5,3) node[right] {Limit (2MB/s total)}; % Text description \node[draw, text width=4cm, font=\small] at (8,2) {\\textbf{Partition Key} determines which shard data flows into via a hash function.};

\end{tikzpicture}

Definition-Example Pairs

  • Consumer Lag: The delay between when a record is written to the stream and when it is processed by a consumer.
    • Example: A Lambda function taking 5 seconds to process a batch when new data arrives every 1 second causes the "IteratorAge" metric in CloudWatch to rise.
  • Record De-aggregation: The process of breaking down a large KPL (Kinesis Producer Library) batch into individual user records.
    • Example: A Lambda function using the Kinesis Client Library to extract 100 individual sensor readings from a single large Kinesis record.
  • Stream Windowing: Processing data in time-based blocks (sliding or tumbling).
    • Example: Calculating the average temperature from a sensor stream every 5 minutes using Kinesis Data Analytics.

Worked Examples

Example 1: Calculating Shard Count

Scenario: An application generates $5 MB/s of data with an average record size of 2 KB. How many shards are required for a Kinesis Data Stream?

Solution:

  1. Calculate by Bandwidth: \frac{5 MB/s}{1 MB/s per shard} = 5 shards.
  2. Calculate by Record Count: Total records per second = \frac{5,000 KB/s}{2 KB/record} = 2,500 records/s.
  3. Required Shards (Records): \frac{2,500 records/s}{1,000 records/s per shard} = 2.5 shards$.
  4. Final Answer: We take the higher of the two values. You need 5 shards.

Example 2: DynamoDB Streams to Lambda

Goal: Implement a system that updates a "Total Sales" counter whenever a new order is added to the Orders table.

Workflow:

  1. Enable DynamoDB Streams on the Orders table (Select "New Image" to see the data after the change).
  2. Create an AWS Lambda function that parses the event record.
  3. The Lambda function uses UpdateItem on the Summary table to increment the TotalSales attribute.
  4. Configure the Trigger in the Lambda console, pointing to the DynamoDB Stream ARN.

Checkpoint Questions

  1. What is the difference between Kinesis Data Streams and Kinesis Data Firehose regarding data retention?
  2. If a Lambda function is throttled while consuming a Kinesis stream, what happens to the data records?
  3. Which Kinesis service is best suited for loading data into an Amazon Redshift cluster with minimal coding?
  4. What metric should you monitor to identify if your consumers are falling behind the producers in a Kinesis stream?
  5. Explain why a partition key with low cardinality (e.g., only two possible values) is a poor choice for a stream with 10 shards.

[!TIP] Exam Hack: When the question mentions "Real-time" with "Sub-second latency," think Kinesis Data Streams. If it mentions "Minutes of latency" or "Delivery to S3/Redshift," think Firehose.

[!WARNING] Always ensure your Lambda execution role has kinesis:GetRecords, kinesis:GetShardIterator, and kinesis:DescribeStream permissions when using it as a consumer.

Ready to study AWS Certified Developer - Associate (DVA-C02)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free