AWS Study Guide: Data Ingestion Patterns and Frequency
Data ingestion patterns (for example, frequency)
Data Ingestion Patterns and Frequency
This guide covers the critical patterns for moving data from sources into the AWS cloud, focusing on the trade-offs between real-time streaming and batch processing as required for the SAA-C03 exam.
Learning Objectives
By the end of this module, you should be able to:
- Differentiate between batch ingestion (Glue/Lake Formation) and stream ingestion (Kinesis).
- Determine the appropriate AWS service based on data frequency and consumer needs.
- Calculate throughput requirements for Kinesis Data Streams using shards.
- Contrast SQS and Kinesis Data Streams for decoupling and ingestion use cases.
Key Terms & Glossary
- Ingestion: The process of collecting data from various sources (S3, RDS, on-premises) and pulling it into a data lake or warehouse.
- Shard: The base throughput unit of a Kinesis Data Stream. It provides a fixed capacity for reads and writes.
- Fan-out: A pattern where multiple consumers read from the same data stream concurrently.
- Producer: The application or service that sends data into a stream (e.g., Kinesis Agent, CloudWatch Logs).
- Consumer: The application or service that processes data from a stream (e.g., Lambda, EC2 instances).
- Put-to-Get Delay: The time elapsed between data being added to a stream and being available for a consumer; typically < 1 second in Kinesis.
The "Big Idea"
Data ingestion is the "front door" of any data architecture. The core architectural decision is Frequency. If the business needs insights in seconds (e.g., fraud detection, stock price monitoring), you must use Streaming. If the business needs insights daily or hourly (e.g., financial reporting, historical archiving), Batch is more cost-effective and simpler to manage.
Formula / Concept Box
| Kinesis Shard Metric | Limit Per Shard |
|---|---|
| Write Capacity | 1,000 records/sec OR 1 MB/sec |
| Read Capacity | 5 transactions/sec OR 2 MB/sec |
| Data Retention | Default 24 hours; Max 365 days |
[!IMPORTANT] If your ingestion rate exceeds these limits, you will receive a
ProvisionedThroughputExceededException. You must increase the number of shards to scale.
Hierarchical Outline
- Stream Ingestion (Real-Time)
- Kinesis Data Streams: Custom code for producers/consumers; sub-second latency; supports replayability.
- Kinesis Data Firehose: Zero-code delivery to S3, Redshift, OpenSearch; supports Lambda transformations.
- Kinesis Video Streams: Specifically for binary encoded video/audio data.
- Batch Ingestion (Scheduled/Periodic)
- AWS Glue: ETL service that uses Python/Spark to move data in bulk.
- AWS Lake Formation: Simplifies setting up a data lake by managing Glue crawlers and permissions.
- JDBC Connectors: Used to pull data from on-premises databases into AWS.
- Comparison: SQS vs. Kinesis
- SQS: One consumer per message; message deleted after processing; simple decoupling.
- Kinesis: Multiple consumers (fan-out); data persists in the stream (replayable); ordered processing per shard.
Visual Anchors
Ingestion Flow Patterns
Visualizing Throughput Sharding
Definition-Example Pairs
- Batch Processing: Processing data in large, discrete groups at scheduled intervals.
- Example: An e-commerce company exports all sales records from their RDS production database to an S3 Data Lake every night at 2:00 AM using AWS Glue.
- Stream Processing: Processing data continuously as it is generated.
- Example: A ride-sharing app ingests GPS coordinates from thousands of drivers every second into Kinesis Data Streams to update the "nearby drivers" map in real-time.
- Data Transformation: Changing the format or structure of data during the ingestion process.
- Example: Using Kinesis Data Firehose to invoke a Lambda function that converts raw JSON logs into Parquet format before saving them to S3 for more efficient querying in Athena.
Worked Examples
Problem: Scaling for High Volume
Scenario: A company is ingesting 5,000 records per second. Each record is 500 KB in size. How many shards are required for Kinesis Data Streams?
Step-by-Step Breakdown:
- Check Record Count Constraint:
- Total Records: 5,000 records/sec.
- Limit per Shard: 1,000 records/sec.
- Required Shards (Count): $5,000 / 1,000 = 5$ shards.
- Check Throughput Constraint:
- Total Throughput: $5,000 \text{ records} \times 500 \text{ KB} = 2,500,000 \text{ KB/sec} \approx 2,500 \text{ MB/sec}$.
- Limit per Shard: 1 MB/sec.
- Required Shards (Throughput): $2,500 / 1 = 2,500$ shards.
- Conclusion: You must take the higher of the two numbers. Therefore, you need 2,500 shards to handle the data volume because the 1 MB/sec throughput limit is the bottleneck, not the record count.
Checkpoint Questions
- Which Kinesis service is best suited for loading data directly into Amazon Redshift with minimum management?
- Answer: Kinesis Data Firehose.
- You need to allow three different applications to process the same log data simultaneously. Should you use SQS or Kinesis?
- Answer: Kinesis (supports fan-out to multiple consumers).
- What is the maximum retention period for data in Kinesis Data Streams?
- Answer: 365 days.
- True or False: AWS Glue is a real-time streaming ingestion service.
- Answer: False. It is primarily a batch-oriented ETL service.
- What happens if a producer attempts to write 2 MB/sec to a single Kinesis Shard?
- Answer: The request will be throttled with a
ProvisionedThroughputExceedederror.
- Answer: The request will be throttled with a