Study Guide890 words

Mastering AWS Data Streaming Architectures

Designing data streaming architectures

Mastering AWS Data Streaming Architectures

This guide covers the design and implementation of real-time data streaming solutions on AWS, focusing on the Amazon Kinesis ecosystem, Managed Streaming for Apache Kafka (MSK), and key architectural trade-offs.

Learning Objectives

After studying this guide, you should be able to:

  • Differentiate between Kinesis Data Streams, Firehose, and Video Streams.
  • Calculate the required number of shards for a Kinesis Data Stream based on throughput requirements.
  • Select the appropriate ingestion and transformation tools (KPL, Kinesis Agent, AWS Glue, Lambda).
  • Evaluate when to use Amazon SQS versus Kinesis Data Streams for decoupling.
  • Design cost-optimized and resilient streaming pipelines.

Key Terms & Glossary

  • Producer: The application or device that sends data into a stream (e.g., Kinesis Agent, SDK).
  • Consumer: An application or service that reads and processes data from a stream.
  • Shard: The base throughput unit of a Kinesis Data Stream. It provides a fixed capacity for ingestion and egress.
  • Fan-out: The ability for multiple consumers to read from the same stream concurrently without interfering with each other.
  • Partition Key: A value used to group data by shard within a stream; it ensures all records with the same key go to the same shard.
  • Sequence Number: A unique identifier assigned by Kinesis to each data record within a shard to maintain order.

The "Big Idea"

[!IMPORTANT] Data streaming is about decoupling the speed of ingestion from the speed of processing. In a modern architecture, producers generate data at varying rates (bursty traffic); streaming services act as a durable buffer that allows multiple specialized consumers to analyze that data in real-time or near-real-time without losing a single record.

Formula / Concept Box

Kinesis Shard Capacity Math

ActionLimit per Shard
Write (Ingest)1 MB/sec OR 1,000 records/sec
Read (Egress)2 MB/sec OR 5 read transactions/sec
RetentionDefault 24 hours (Up to 7 days per source, 365 days in current AWS)

To calculate required shards: Number of Shards=max(Incoming Write Bandwidth (MB/s)1,Outgoing Read Bandwidth (MB/s)2)\text{Number of Shards} = \max\left( \frac{\text{Incoming Write Bandwidth (MB/s)}}{1}, \frac{\text{Outgoing Read Bandwidth (MB/s)}}{2} \right)

Hierarchical Outline

  • I. Amazon Kinesis Ecosystem
    • Kinesis Data Streams (KDS): Real-time, durable, and order-preserving (within a shard). Requires manual scaling of shards.
    • Kinesis Data Firehose (KDF): Fully managed delivery service. No manual scaling; loads data into S3, Redshift, or OpenSearch. Supports Lambda transformation.
    • Kinesis Video Streams (KVS): Specialized for time-indexed binary data like video, audio, and RADAR imagery.
  • II. Data Ingestion & Transformation
    • Ingestion Tools: Kinesis Producer Library (KPL), Kinesis Agent (Linux-based logs), and Kinesis SDK.
    • Transformation: AWS Glue (ETL) and Lambda (used by Firehose for inline record transformation like JSON to Parquet).
  • III. Strategic Comparison
    • Kinesis vs. SQS: SQS is for small, short-lived messages (single consumer deletes on read). Kinesis is for large-scale data playback (multiple consumers).
    • Kinesis vs. MSK: Use MSK for managed Apache Kafka compatibility; use Kinesis for native AWS integration and simpler operational overhead.

Visual Anchors

Kinesis Data Firehose Workflow

Loading Diagram...

Shard Architecture Visualization

\begin{tikzpicture}[node distance=2cm] \draw[thick] (0,0) rectangle (6,3); \node at (3,3.3) {\textbf{Kinesis Data Stream}};

code
% Shard 1 \draw[fill=blue!10] (0.5,0.5) rectangle (2.5,2.5); \node at (1.5,1.5) {Shard 1}; \node[scale=0.7] at (1.5,1) {1MB/s In | 2MB/s Out}; % Shard 2 \draw[fill=blue!10] (3.5,0.5) rectangle (5.5,2.5); \node at (4.5,1.5) {Shard 2}; \node[scale=0.7] at (4.5,1) {1MB/s In | 2MB/s Out}; % Arrows \draw[->, thick] (-1.5,1.5) -- (0.5,1.5) node[midway, above] {Ingest}; \draw[->, thick] (5.5,1.5) -- (7.5,1.5) node[midway, above] {Consume};

\end{tikzpicture}

Definition-Example Pairs

  • Fan-out: The ability for one stream to support multiple independent applications.
    • Example: A web server logs stream is read by a Security App (detecting SQL injection) and an Analytics App (tracking user clicks) simultaneously.
  • Data Transformation: Converting data formats for better storage or analysis.
    • Example: Using Kinesis Firehose + Lambda to convert high-volume JSON logs into Apache Parquet format to save 80% on storage costs in Amazon S3.
  • Kinesis Agent: A standalone Java application that monitors files.
    • Example: Installing the agent on an EC2 instance running NGINX to automatically push access.log updates to a Kinesis Stream without writing code.

Worked Examples

Example 1: Calculating Shards

Scenario: A fleet of IoT devices produces 1.5 MB/sec of data. You have one consumer application that requires 2.5 MB/sec of read throughput.

  1. Write Requirement: 1.5 MB/s. Since one shard supports 1 MB/s, you need 1.5/1=2\lceil 1.5 / 1 \rceil = 2 shards.
  2. Read Requirement: 2.5 MB/s. Since one shard supports 2 MB/s, you need 2.5/2=2\lceil 2.5 / 2 \rceil = 2 shards.
  3. Result: You must provision 2 shards.

Example 2: SQS vs. Kinesis

Scenario: You need to process credit card transactions. Each transaction must be processed exactly once by a single billing worker. If a worker fails, the message should stay in the queue.

  • Choice: Amazon SQS.
  • Reason: SQS is designed for task-based decoupling where a single consumer processes and deletes a message. Kinesis is better for tracking a continuous stream for multiple analytics purposes.

Checkpoint Questions

  1. What is the maximum data rate per second for writing to a single Kinesis shard?
  2. Which service would you use to automatically deliver streaming data to Amazon Redshift with zero coding required for ingestion?
  3. True or False: Kinesis Data Streams are time-indexed, similar to Video Streams.
  4. How can you extend the retention period of Kinesis data beyond 7 days if required for compliance?
  5. What tool allows you to convert CSV data to Parquet within a Kinesis Firehose delivery stream?
Click to see answers
  1. 1 MB per second.
  2. Kinesis Data Firehose.
  3. False (KDS is indexed by partition key and sequence number; KVS is time-indexed).
  4. Use Kinesis Data Firehose to dump the data into an S3 bucket for long-term archival.
  5. AWS Lambda (via Firehose transformation) or AWS Glue.

Ready to study AWS Certified Solutions Architect - Associate (SAA-C03)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free