Study Guide862 words

AWS Study Guide: Data Ingestion Patterns and Frequency

Data ingestion patterns (for example, frequency)

Data Ingestion Patterns and Frequency

This guide covers the critical patterns for moving data from sources into the AWS cloud, focusing on the trade-offs between real-time streaming and batch processing as required for the SAA-C03 exam.

Learning Objectives

By the end of this module, you should be able to:

  • Differentiate between batch ingestion (Glue/Lake Formation) and stream ingestion (Kinesis).
  • Determine the appropriate AWS service based on data frequency and consumer needs.
  • Calculate throughput requirements for Kinesis Data Streams using shards.
  • Contrast SQS and Kinesis Data Streams for decoupling and ingestion use cases.

Key Terms & Glossary

  • Ingestion: The process of collecting data from various sources (S3, RDS, on-premises) and pulling it into a data lake or warehouse.
  • Shard: The base throughput unit of a Kinesis Data Stream. It provides a fixed capacity for reads and writes.
  • Fan-out: A pattern where multiple consumers read from the same data stream concurrently.
  • Producer: The application or service that sends data into a stream (e.g., Kinesis Agent, CloudWatch Logs).
  • Consumer: The application or service that processes data from a stream (e.g., Lambda, EC2 instances).
  • Put-to-Get Delay: The time elapsed between data being added to a stream and being available for a consumer; typically < 1 second in Kinesis.

The "Big Idea"

Data ingestion is the "front door" of any data architecture. The core architectural decision is Frequency. If the business needs insights in seconds (e.g., fraud detection, stock price monitoring), you must use Streaming. If the business needs insights daily or hourly (e.g., financial reporting, historical archiving), Batch is more cost-effective and simpler to manage.

Formula / Concept Box

Kinesis Shard MetricLimit Per Shard
Write Capacity1,000 records/sec OR 1 MB/sec
Read Capacity5 transactions/sec OR 2 MB/sec
Data RetentionDefault 24 hours; Max 365 days

[!IMPORTANT] If your ingestion rate exceeds these limits, you will receive a ProvisionedThroughputExceededException. You must increase the number of shards to scale.

Hierarchical Outline

  1. Stream Ingestion (Real-Time)
    • Kinesis Data Streams: Custom code for producers/consumers; sub-second latency; supports replayability.
    • Kinesis Data Firehose: Zero-code delivery to S3, Redshift, OpenSearch; supports Lambda transformations.
    • Kinesis Video Streams: Specifically for binary encoded video/audio data.
  2. Batch Ingestion (Scheduled/Periodic)
    • AWS Glue: ETL service that uses Python/Spark to move data in bulk.
    • AWS Lake Formation: Simplifies setting up a data lake by managing Glue crawlers and permissions.
    • JDBC Connectors: Used to pull data from on-premises databases into AWS.
  3. Comparison: SQS vs. Kinesis
    • SQS: One consumer per message; message deleted after processing; simple decoupling.
    • Kinesis: Multiple consumers (fan-out); data persists in the stream (replayable); ordered processing per shard.

Visual Anchors

Ingestion Flow Patterns

Loading Diagram...

Visualizing Throughput Sharding

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Batch Processing: Processing data in large, discrete groups at scheduled intervals.
    • Example: An e-commerce company exports all sales records from their RDS production database to an S3 Data Lake every night at 2:00 AM using AWS Glue.
  • Stream Processing: Processing data continuously as it is generated.
    • Example: A ride-sharing app ingests GPS coordinates from thousands of drivers every second into Kinesis Data Streams to update the "nearby drivers" map in real-time.
  • Data Transformation: Changing the format or structure of data during the ingestion process.
    • Example: Using Kinesis Data Firehose to invoke a Lambda function that converts raw JSON logs into Parquet format before saving them to S3 for more efficient querying in Athena.

Worked Examples

Problem: Scaling for High Volume

Scenario: A company is ingesting 5,000 records per second. Each record is 500 KB in size. How many shards are required for Kinesis Data Streams?

Step-by-Step Breakdown:

  1. Check Record Count Constraint:
    • Total Records: 5,000 records/sec.
    • Limit per Shard: 1,000 records/sec.
    • Required Shards (Count): $5,000 / 1,000 = 5$ shards.
  2. Check Throughput Constraint:
    • Total Throughput: $5,000 \text{ records} \times 500 \text{ KB} = 2,500,000 \text{ KB/sec} \approx 2,500 \text{ MB/sec}$.
    • Limit per Shard: 1 MB/sec.
    • Required Shards (Throughput): $2,500 / 1 = 2,500$ shards.
  3. Conclusion: You must take the higher of the two numbers. Therefore, you need 2,500 shards to handle the data volume because the 1 MB/sec throughput limit is the bottleneck, not the record count.

Checkpoint Questions

  1. Which Kinesis service is best suited for loading data directly into Amazon Redshift with minimum management?
    • Answer: Kinesis Data Firehose.
  2. You need to allow three different applications to process the same log data simultaneously. Should you use SQS or Kinesis?
    • Answer: Kinesis (supports fan-out to multiple consumers).
  3. What is the maximum retention period for data in Kinesis Data Streams?
    • Answer: 365 days.
  4. True or False: AWS Glue is a real-time streaming ingestion service.
    • Answer: False. It is primarily a batch-oriented ETL service.
  5. What happens if a producer attempts to write 2 MB/sec to a single Kinesis Shard?
    • Answer: The request will be throttled with a ProvisionedThroughputExceeded error.

Ready to study AWS Certified Solutions Architect - Associate (SAA-C03)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free