Mastering AWS Data Streaming Architectures
Designing data streaming architectures
Mastering AWS Data Streaming Architectures
This guide covers the design and implementation of real-time data streaming solutions on AWS, focusing on the Amazon Kinesis ecosystem, Managed Streaming for Apache Kafka (MSK), and key architectural trade-offs.
Learning Objectives
After studying this guide, you should be able to:
- Differentiate between Kinesis Data Streams, Firehose, and Video Streams.
- Calculate the required number of shards for a Kinesis Data Stream based on throughput requirements.
- Select the appropriate ingestion and transformation tools (KPL, Kinesis Agent, AWS Glue, Lambda).
- Evaluate when to use Amazon SQS versus Kinesis Data Streams for decoupling.
- Design cost-optimized and resilient streaming pipelines.
Key Terms & Glossary
- Producer: The application or device that sends data into a stream (e.g., Kinesis Agent, SDK).
- Consumer: An application or service that reads and processes data from a stream.
- Shard: The base throughput unit of a Kinesis Data Stream. It provides a fixed capacity for ingestion and egress.
- Fan-out: The ability for multiple consumers to read from the same stream concurrently without interfering with each other.
- Partition Key: A value used to group data by shard within a stream; it ensures all records with the same key go to the same shard.
- Sequence Number: A unique identifier assigned by Kinesis to each data record within a shard to maintain order.
The "Big Idea"
[!IMPORTANT] Data streaming is about decoupling the speed of ingestion from the speed of processing. In a modern architecture, producers generate data at varying rates (bursty traffic); streaming services act as a durable buffer that allows multiple specialized consumers to analyze that data in real-time or near-real-time without losing a single record.
Formula / Concept Box
Kinesis Shard Capacity Math
| Action | Limit per Shard |
|---|---|
| Write (Ingest) | 1 MB/sec OR 1,000 records/sec |
| Read (Egress) | 2 MB/sec OR 5 read transactions/sec |
| Retention | Default 24 hours (Up to 7 days per source, 365 days in current AWS) |
To calculate required shards:
Hierarchical Outline
- I. Amazon Kinesis Ecosystem
- Kinesis Data Streams (KDS): Real-time, durable, and order-preserving (within a shard). Requires manual scaling of shards.
- Kinesis Data Firehose (KDF): Fully managed delivery service. No manual scaling; loads data into S3, Redshift, or OpenSearch. Supports Lambda transformation.
- Kinesis Video Streams (KVS): Specialized for time-indexed binary data like video, audio, and RADAR imagery.
- II. Data Ingestion & Transformation
- Ingestion Tools: Kinesis Producer Library (KPL), Kinesis Agent (Linux-based logs), and Kinesis SDK.
- Transformation: AWS Glue (ETL) and Lambda (used by Firehose for inline record transformation like JSON to Parquet).
- III. Strategic Comparison
- Kinesis vs. SQS: SQS is for small, short-lived messages (single consumer deletes on read). Kinesis is for large-scale data playback (multiple consumers).
- Kinesis vs. MSK: Use MSK for managed Apache Kafka compatibility; use Kinesis for native AWS integration and simpler operational overhead.
Visual Anchors
Kinesis Data Firehose Workflow
Shard Architecture Visualization
\begin{tikzpicture}[node distance=2cm] \draw[thick] (0,0) rectangle (6,3); \node at (3,3.3) {\textbf{Kinesis Data Stream}};
% Shard 1
\draw[fill=blue!10] (0.5,0.5) rectangle (2.5,2.5);
\node at (1.5,1.5) {Shard 1};
\node[scale=0.7] at (1.5,1) {1MB/s In | 2MB/s Out};
% Shard 2
\draw[fill=blue!10] (3.5,0.5) rectangle (5.5,2.5);
\node at (4.5,1.5) {Shard 2};
\node[scale=0.7] at (4.5,1) {1MB/s In | 2MB/s Out};
% Arrows
\draw[->, thick] (-1.5,1.5) -- (0.5,1.5) node[midway, above] {Ingest};
\draw[->, thick] (5.5,1.5) -- (7.5,1.5) node[midway, above] {Consume};\end{tikzpicture}
Definition-Example Pairs
- Fan-out: The ability for one stream to support multiple independent applications.
- Example: A web server logs stream is read by a Security App (detecting SQL injection) and an Analytics App (tracking user clicks) simultaneously.
- Data Transformation: Converting data formats for better storage or analysis.
- Example: Using Kinesis Firehose + Lambda to convert high-volume JSON logs into Apache Parquet format to save 80% on storage costs in Amazon S3.
- Kinesis Agent: A standalone Java application that monitors files.
- Example: Installing the agent on an EC2 instance running NGINX to automatically push
access.logupdates to a Kinesis Stream without writing code.
- Example: Installing the agent on an EC2 instance running NGINX to automatically push
Worked Examples
Example 1: Calculating Shards
Scenario: A fleet of IoT devices produces 1.5 MB/sec of data. You have one consumer application that requires 2.5 MB/sec of read throughput.
- Write Requirement: 1.5 MB/s. Since one shard supports 1 MB/s, you need shards.
- Read Requirement: 2.5 MB/s. Since one shard supports 2 MB/s, you need shards.
- Result: You must provision 2 shards.
Example 2: SQS vs. Kinesis
Scenario: You need to process credit card transactions. Each transaction must be processed exactly once by a single billing worker. If a worker fails, the message should stay in the queue.
- Choice: Amazon SQS.
- Reason: SQS is designed for task-based decoupling where a single consumer processes and deletes a message. Kinesis is better for tracking a continuous stream for multiple analytics purposes.
Checkpoint Questions
- What is the maximum data rate per second for writing to a single Kinesis shard?
- Which service would you use to automatically deliver streaming data to Amazon Redshift with zero coding required for ingestion?
- True or False: Kinesis Data Streams are time-indexed, similar to Video Streams.
- How can you extend the retention period of Kinesis data beyond 7 days if required for compliance?
- What tool allows you to convert CSV data to Parquet within a Kinesis Firehose delivery stream?
▶Click to see answers
- 1 MB per second.
- Kinesis Data Firehose.
- False (KDS is indexed by partition key and sequence number; KVS is time-indexed).
- Use Kinesis Data Firehose to dump the data into an S3 bucket for long-term archival.
- AWS Lambda (via Firehose transformation) or AWS Glue.