Selecting Appropriate Ingestion Configurations
Selecting appropriate configurations for ingestion
Selecting Appropriate Ingestion Configurations
This guide covers the critical decision-making process for ingesting data into AWS. It focuses on selecting the right service based on data type (streaming vs. batch), scale, and processing requirements, aligned with the AWS Certified Solutions Architect - Associate (SAA-C03) exam.
Learning Objectives
- Distinguish between Kinesis Data Streams and Kinesis Data Firehose for various use cases.
- Evaluate the trade-offs between Amazon SQS and Amazon Kinesis for data ingestion.
- Identify appropriate tools for on-premises to cloud data migration (DataSync vs. Storage Gateway).
- Understand the role of AWS Glue and Lake Formation in creating structured data lakes.
Key Terms & Glossary
- Ingestion: The process of collecting and moving data from various sources into a storage or processing system (e.g., S3 or Redshift).
- Shard: A unit of throughput capacity in Kinesis Data Streams. Each shard provides a fixed amount of resources.
- Fan-out: The ability for multiple consumers to read from the same data stream concurrently.
- Producer: An application or device that sends data to an ingestion service.
- Consumer: A service or application that processes data delivered by an ingestion service.
- ETL (Extract, Transform, Load): A three-step process where data is taken from a source, changed into a suitable format, and placed in a destination.
The "Big Idea"
Data ingestion is the "front door" of any data architecture. Choosing the wrong configuration leads to bottlenecks, data loss, or excessive costs. The fundamental choice revolves around Latency vs. Management: Do you need sub-second real-time processing (Kinesis Data Streams), or do you want a fully managed delivery service that can transform data before it hits the disk (Kinesis Data Firehose)?
Formula / Concept Box
| Feature | Kinesis Data Streams (KDS) | Kinesis Data Firehose (KDF) |
|---|---|---|
| Management | Provisioned (Manual Sharding) | Fully Managed (Automatic) |
| Latency | Real-time (< 200ms) | Near real-time (60s buffer minimum) |
| Data Retention | 24 hours to 365 days | No retention (transient) |
| Consumers | Multiple (Fan-out) | Single Destination |
| Transformation | Requires custom code | Integrated via AWS Lambda |
[!IMPORTANT] Shard Limits:
- Write: 1,000 records/sec OR 1 MB/sec per shard.
- Read: 5 transactions/sec OR 2 MB/sec per shard.
Hierarchical Outline
- Streaming Data Ingestion
- Amazon Kinesis Data Streams: Used for custom real-time applications; requires manual shard management.
- Amazon Kinesis Data Firehose: Simple loading of streaming data into S3, Redshift, OpenSearch, or Splunk.
- Amazon Kinesis Video Streams: Specifically for binary/video data ingestion for ML/analytics.
- Hybrid and Bulk Ingestion
- AWS DataSync: Fast data transfer for large-scale migrations from on-premises to S3 or EFS.
- AWS Storage Gateway: Hybrid storage that allows on-premises apps to use AWS storage via standard protocols (iSCSI, NFS).
- Managed Data Lakes
- AWS Lake Formation: Simplifies the setup of a secure data lake; orchestrates AWS Glue for ingestion and ETL.
- AWS Glue: Serverless ETL service that categorizes data and cleans it via the Data Catalog.
Visual Anchors
Ingestion Decision Flow
Shard Architecture Visualization
Definition-Example Pairs
- Deduplication (FindMatches ML): Using Machine Learning to identify duplicate records in a data lake that lack a common unique key.
- Example: A company merges customer databases where