Selecting Appropriate Ingestion Configurations

This guide covers the critical decision-making process for ingesting data into AWS. It focuses on selecting the right service based on data type (streaming vs. batch), scale, and processing requirements, aligned with the AWS Certified Solutions Architect - Associate (SAA-C03) exam.

Learning Objectives

Distinguish between Kinesis Data Streams and Kinesis Data Firehose for various use cases.
Evaluate the trade-offs between Amazon SQS and Amazon Kinesis for data ingestion.
Identify appropriate tools for on-premises to cloud data migration (DataSync vs. Storage Gateway).
Understand the role of AWS Glue and Lake Formation in creating structured data lakes.

Key Terms & Glossary

Ingestion: The process of collecting and moving data from various sources into a storage or processing system (e.g., S3 or Redshift).
Shard: A unit of throughput capacity in Kinesis Data Streams. Each shard provides a fixed amount of resources.
Fan-out: The ability for multiple consumers to read from the same data stream concurrently.
Producer: An application or device that sends data to an ingestion service.
Consumer: A service or application that processes data delivered by an ingestion service.
ETL (Extract, Transform, Load): A three-step process where data is taken from a source, changed into a suitable format, and placed in a destination.

The "Big Idea"

Data ingestion is the "front door" of any data architecture. Choosing the wrong configuration leads to bottlenecks, data loss, or excessive costs. The fundamental choice revolves around Latency vs. Management: Do you need sub-second real-time processing (Kinesis Data Streams), or do you want a fully managed delivery service that can transform data before it hits the disk (Kinesis Data Firehose)?

Formula / Concept Box

Feature	Kinesis Data Streams (KDS)	Kinesis Data Firehose (KDF)
Management	Provisioned (Manual Sharding)	Fully Managed (Automatic)
Latency	Real-time (< 200ms)	Near real-time (60s buffer minimum)
Data Retention	24 hours to 365 days	No retention (transient)
Consumers	Multiple (Fan-out)	Single Destination
Transformation	Requires custom code	Integrated via AWS Lambda

[!IMPORTANT] Shard Limits:

Write: 1,000 records/sec OR 1 MB/sec per shard.

Read: 5 transactions/sec OR 2 MB/sec per shard.

Hierarchical Outline

Streaming Data Ingestion
- Amazon Kinesis Data Streams: Used for custom real-time applications; requires manual shard management.
- Amazon Kinesis Data Firehose: Simple loading of streaming data into S3, Redshift, OpenSearch, or Splunk.
- Amazon Kinesis Video Streams: Specifically for binary/video data ingestion for ML/analytics.
Hybrid and Bulk Ingestion
- AWS DataSync: Fast data transfer for large-scale migrations from on-premises to S3 or EFS.
- AWS Storage Gateway: Hybrid storage that allows on-premises apps to use AWS storage via standard protocols (iSCSI, NFS).
Managed Data Lakes
- AWS Lake Formation: Simplifies the setup of a secure data lake; orchestrates AWS Glue for ingestion and ETL.
- AWS Glue: Serverless ETL service that categorizes data and cleans it via the Data Catalog.

Visual Anchors

Ingestion Decision Flow

Loading Diagram...

Figure 1 — Mermaid diagram

Shard Architecture Visualization

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Figure 2 — TikZ diagram

Definition-Example Pairs

Deduplication (FindMatches ML): Using Machine Learning to identify duplicate records in a data lake that lack a common unique key.
- Example: A company merges customer databases where

Learning Objectives

Distinguish between Kinesis Data Streams and Kinesis Data Firehose for various use cases.

Evaluate the trade-offs between Amazon SQS and Amazon Kinesis for data ingestion.

Identify appropriate tools for on-premises to cloud data migration (DataSync vs. Storage Gateway).

Understand the role of AWS Glue and Lake Formation in creating structured data lakes.

Key Terms & Glossary

Ingestion: The process of collecting and moving data from various sources into a storage or processing system (e.g., S3 or Redshift).

Shard: A unit of throughput capacity in Kinesis Data Streams. Each shard provides a fixed amount of resources.

Fan-out: The ability for multiple consumers to read from the same data stream concurrently.

Producer: An application or device that sends data to an ingestion service.

Consumer: A service or application that processes data delivered by an ingestion service.

ETL (Extract, Transform, Load): A three-step process where data is taken from a source, changed into a suitable format, and placed in a destination.