Mastering AWS Data Streaming Architectures

This guide covers the design and implementation of real-time data streaming solutions on AWS, focusing on the Amazon Kinesis ecosystem, Managed Streaming for Apache Kafka (MSK), and key architectural trade-offs.

Learning Objectives

After studying this guide, you should be able to:

Differentiate between Kinesis Data Streams, Firehose, and Video Streams.
Calculate the required number of shards for a Kinesis Data Stream based on throughput requirements.
Select the appropriate ingestion and transformation tools (KPL, Kinesis Agent, AWS Glue, Lambda).
Evaluate when to use Amazon SQS versus Kinesis Data Streams for decoupling.
Design cost-optimized and resilient streaming pipelines.

Key Terms & Glossary

Producer: The application or device that sends data into a stream (e.g., Kinesis Agent, SDK).
Consumer: An application or service that reads and processes data from a stream.
Shard: The base throughput unit of a Kinesis Data Stream. It provides a fixed capacity for ingestion and egress.
Fan-out: The ability for multiple consumers to read from the same stream concurrently without interfering with each other.
Partition Key: A value used to group data by shard within a stream; it ensures all records with the same key go to the same shard.
Sequence Number: A unique identifier assigned by Kinesis to each data record within a shard to maintain order.

The "Big Idea"

[!IMPORTANT] Data streaming is about decoupling the speed of ingestion from the speed of processing. In a modern architecture, producers generate data at varying rates (bursty traffic); streaming services act as a durable buffer that allows multiple specialized consumers to analyze that data in real-time or near-real-time without losing a single record.

Formula / Concept Box

Kinesis Shard Capacity Math

Action	Limit per Shard
Write (Ingest)	1 MB/sec OR 1,000 records/sec
Read (Egress)	2 MB/sec OR 5 read transactions/sec
Retention	Default 24 hours (Up to 7 days per source, 365 days in current AWS)

To calculate required shards: $\text{Number of Shards} = \max\left( \frac{\text{Incoming Write Bandwidth (MB/s)}}{1}, \frac{\text{Outgoing Read Bandwidth (MB/s)}}{2} \right)$

Hierarchical Outline

I. Amazon Kinesis Ecosystem
- Kinesis Data Streams (KDS): Real-time, durable, and order-preserving (within a shard). Requires manual scaling of shards.
- Kinesis Data Firehose (KDF): Fully managed delivery service. No manual scaling; loads data into S3, Redshift, or OpenSearch. Supports Lambda transformation.
- Kinesis Video Streams (KVS): Specialized for time-indexed binary data like video, audio, and RADAR imagery.
II. Data Ingestion & Transformation
- Ingestion Tools: Kinesis Producer Library (KPL), Kinesis Agent (Linux-based logs), and Kinesis SDK.
- Transformation: AWS Glue (ETL) and Lambda (used by Firehose for inline record transformation like JSON to Parquet).
III. Strategic Comparison
- Kinesis vs. SQS: SQS is for small, short-lived messages (single consumer deletes on read). Kinesis is for large-scale data playback (multiple consumers).
- Kinesis vs. MSK: Use MSK for managed Apache Kafka compatibility; use Kinesis for native AWS integration and simpler operational overhead.

Visual Anchors

Kinesis Data Firehose Workflow

Loading Diagram...

Shard Architecture Visualization

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Fan-out: The ability for one stream to support multiple independent applications.
- Example: A web server logs stream is read by a Security App (detecting SQL injection) and an Analytics App (tracking user clicks) simultaneously.
Data Transformation: Converting data formats for better storage or analysis.
- Example: Using Kinesis Firehose + Lambda to convert high-volume JSON logs into Apache Parquet format to save 80% on storage costs in Amazon S3.
Kinesis Agent: A standalone Java application that monitors files.
- Example: Installing the agent on an EC2 instance running NGINX to automatically push access.log updates to a Kinesis Stream without writing code.

Worked Examples

Example 1: Calculating Shards

Scenario: A fleet of IoT devices produces 1.5 MB/sec of data. You have one consumer application that requires 2.5 MB/sec of read throughput.

Write Requirement: 1.5 MB/s. Since one shard supports 1 MB/s, you need $\lceil 1.5 / 1 \rceil = 2$ shards.
Read Requirement: 2.5 MB/s. Since one shard supports 2 MB/s, you need $\lceil 2.5 / 2 \rceil = 2$ shards.
Result: You must provision 2 shards.

Example 2: SQS vs. Kinesis

Scenario: You need to process credit card transactions. Each transaction must be processed exactly once by a single billing worker. If a worker fails, the message should stay in the queue.

Choice: Amazon SQS.
Reason: SQS is designed for task-based decoupling where a single consumer processes and deletes a message. Kinesis is better for tracking a continuous stream for multiple analytics purposes.

Checkpoint Questions

What is the maximum data rate per second for writing to a single Kinesis shard?
Which service would you use to automatically deliver streaming data to Amazon Redshift with zero coding required for ingestion?
True or False: Kinesis Data Streams are time-indexed, similar to Video Streams.
How can you extend the retention period of Kinesis data beyond 7 days if required for compliance?
What tool allows you to convert CSV data to Parquet within a Kinesis Firehose delivery stream?

▶Click to see answers

1 MB per second.
Kinesis Data Firehose.
False (KDS is indexed by partition key and sequence number; KVS is time-indexed).
Use Kinesis Data Firehose to dump the data into an S3 bucket for long-term archival.
AWS Lambda (via Firehose transformation) or AWS Glue.