Study Guide985 words

Managing Fan-In and Fan-Out for Streaming Data Distribution

Manage fan-in and fan-out for streaming data distribution

Managing Fan-In and Fan-Out for Streaming Data Distribution

This guide covers the architectural patterns and AWS service configurations required to handle high-velocity streaming data, focusing on aggregating data (Fan-In) and distributing it efficiently (Fan-Out).

Learning Objectives

After studying this guide, you should be able to:

  • Differentiate between Shared Throughput and Enhanced Fan-Out in Kinesis Data Streams.
  • Architect solutions for high-velocity Fan-In from thousands of producers.
  • Configure Amazon Data Firehose and AWS Lambda for multi-destination distribution.
  • Identify throughput limits and performance bottlenecks in streaming pipelines.

Key Terms & Glossary

  • Fan-In: The process of aggregating data from multiple producers (e.g., IoT devices, web servers) into a single stream or entry point.
  • Fan-Out: The process of distributing data from a single stream to multiple independent consumers or downstream destinations.
  • Shard: The base throughput unit of an Amazon Kinesis data stream. One shard provides a fixed capacity (1 MB/s ingest, 2 MB/s egress).
  • KCL (Kinesis Client Library): A Java library that helps you build consumer applications for Kinesis Data Streams.
  • Provisioned Throughput: A mode where you specify the number of shards, defining the stream's total capacity.

The "Big Idea"

In modern data engineering, data is rarely linear. Fan-In solves the "bottleneck" problem by providing a durable entry point for thousands of simultaneous writers. Fan-Out solves the "contention" problem by ensuring that one slow consumer (like a batch process) doesn't prevent a fast consumer (like a real-time dashboard) from receiving data. Success depends on balancing shard density against latency requirements.

Formula / Concept Box

Limit CategoryMetricLimit Value
Shard IngestMax Throughput$1 MB/sec$
Shard IngestMax Records$1,000 \text{ records/sec}$
Standard EgressMax Throughput$2 MB/sec$ (Shared)
Enhanced EgressMax Throughput$2 \text{ MB/sec}$ (Dedicated per consumer)
Record SizeMax Payload$1 MB$

Hierarchical Outline

  • I. Fan-In Strategies (Producers)
    • Kinesis Producer Library (KPL): Uses Aggregation and Batching to increase throughput.
    • AWS SDK: Simple PutRecord or PutRecords calls for low-volume producers.
    • Agent-based: Kinesis Agent for Linux/Windows to stream log files automatically.
  • II. Fan-Out Strategies (Consumers)
    • Shared Throughput (Standard): Multiple consumers share the $2 \text{ MB/s}$ limit; use for low-latency non-critical apps.
    • Enhanced Fan-Out (EFO): Dedicated $2 MB/s$ pipe per consumer using HTTP/2 Push.
    • Lambda Consumer: Automatically scales based on the number of shards and batch settings.
  • III. Delivery and Persistence
    • Amazon Data Firehose: Zero-code fan-out to S3, Redshift, and OpenSearch.
    • Dynamic Partitioning: Firehose capability to route data to different S3 prefixes based on content.

Visual Anchors

Data Distribution Flow

Loading Diagram...

Enhanced Fan-Out Architecture

\begin{tikzpicture}[node distance=2cm, every node/.style={draw, rectangle, rounded corners, minimum height=1cm, align=center}] \node (shard) [fill=orange!20] {\textbf{Kinesis Shard}\2 MB/s Total Read}; \node (efo1) [right=of shard, xshift=1cm, fill=blue!10] {EFO Consumer 1$Dedicated 2 MB/s)}; \node (efo2) [below=of efo1, fill=blue!10] {EFO Consumer 2$Dedicated 2 MB/s)}; \node (std) [above=of efo1, fill=gray!10] {Standard Consumers$Sharing 2 MB/s)};

code
\draw[->, thick] (shard) -- (std) node[midway, above, sloped] {Pull}; \draw[->, thick, blue] (shard) -- (efo1) node[midway, above, sloped] {HTTP/2 Push}; \draw[->, thick, blue] (shard) -- (efo2) node[midway, below, sloped] {HTTP/2 Push};

\end{tikzpicture}

Definition-Example Pairs

  • Enhanced Fan-Out (EFO)
    • Definition: A Kinesis feature providing a dedicated $2 \text{ MB/s}$ throughput pipe for a registered consumer application.
    • Example: A financial trading app needs sub-100ms latency to update prices, while an archival process runs in the background. EFO ensures the archival process doesn't slow down the price updates.
  • Dynamic Partitioning
    • Definition: The ability for Amazon Data Firehose to parse incoming records and route them to specific destinations or S3 folders based on keys within the data.
    • Example: Streaming IoT data from multiple regions into one Firehose stream, which then automatically saves data into s3://bucket/region=US/ or s3://bucket/region=EU/ based on a "region" field in the JSON.

Comparison Tables

FeatureStandard ConsumerEnhanced Fan-Out (EFO)
Throughput$2 MB/s$ Shared across all
Latency~200ms - 1s (Pull model)~70ms (Push model)
ProtocolGetRecords (HTTP)SubscribeToShard (HTTP/2)
CostIncluded in Shard priceAdditional cost per GB and per Consumer-hour
LimitMax 5 GetRecords calls/secMax 20 consumers per stream (soft)

Worked Examples

Problem: Shard Calculation for Fan-Out

A data engineer has a Kinesis Data Stream with 10 shards. There are currently 3 standard consumers. The business wants to add 2 more standard consumers. Will this architecture work if each consumer requires $0.5 \text{ MB/s}$ per shard?

Step 1: Calculate total required egress per shard. 3 existing+2 new=5 total consumers3 \text{ existing} + 2 \text{ new} = 5 \text{ total consumers} 5 consumers×0.5 MB/s=2.5 MB/s required per shard5 \text{ consumers} \times 0.5 \text{ MB/s} = 2.5 \text{ MB/s required per shard}

Step 2: Compare against shard limits. A standard shard supports $2 MB/s$ of total read throughput. 2.5 MB/s (Required)>2.0 MB/s (Limit)2.5 \text{ MB/s (Required)} > 2.0 \text{ MB/s (Limit)}

Solution: This will cause ProvisionedThroughputExceeded errors. The engineer must either:

  1. Increase Shard Count: Double shards to 20, reducing per-shard consumer load.
  2. Use EFO: Convert the consumers to Enhanced Fan-Out to provide each with a dedicated $2 \text{ MB/s}$.

Checkpoint Questions

  1. What is the maximum number of records per second a single Kinesis shard can ingest?
  2. Which protocol does Enhanced Fan-out use to achieve lower latency than standard consumers?
  3. True/False: Amazon Data Firehose can deliver data to both S3 and a custom HTTP endpoint simultaneously from the same delivery stream.
  4. If you have 3 consumers on a standard Kinesis stream, how many GetRecords calls per second can each consumer make per shard without being throttled?
Click to see answers
  1. 1,000 records/second.
  2. HTTP/2 (using the SubscribeToShard API).
  3. True (Firehose supports multiple destinations and fanning out transformed/raw data).
  4. 1.6 calls/sec (5 total calls per shard / 3 consumers).

Muddy Points & Cross-Refs

  • KCL 1.x vs 2.x: Only KCL 2.x supports Enhanced Fan-out. If you are troubleshooting a legacy app that can't use EFO, check the KCL version.
  • Lambda Concurrency: Even with EFO, Lambda is subject to its own concurrency limits. If Lambda is fanning out to RDS, the database connection pool is usually the bottleneck, not the stream.
  • Replayability: Unlike SQS, Kinesis allows "replaying" data because records stay in the stream for the retention period (24 hours to 365 days).

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free