Study Guide864 words

Data Engineering Study Guide: Integrating AWS Lambda with Amazon Kinesis

Call a Lambda function from Kinesis

Data Engineering Study Guide: Integrating AWS Lambda with Amazon Kinesis

This guide covers the mechanics of using AWS Lambda to process streaming data from Amazon Kinesis Data Streams (KDS), a core skill for the AWS Certified Data Engineer – Associate exam (DEA-C01).

Learning Objectives

By the end of this guide, you should be able to:

  • Explain the synchronous polling mechanism used by Lambda for Kinesis.
  • Configure Batching Windows and Payload Limits to optimize processing efficiency.
  • Scale throughput using the ParallelizationFactor.
  • Implement a basic Lambda consumer in Node.js.
  • Identify correct Starting Positions for event source mappings to prevent data loss.

Key Terms & Glossary

  • Event Source Mapping (ESM): A Lambda resource that reads from an event source (like Kinesis) and invokes a Lambda function.
  • Shard: A uniquely identified sequence of data records in a Kinesis stream; the base unit of throughput.
  • Batching Window: The maximum amount of time (up to 300 seconds) Lambda spends gathering records before invoking the function.
  • Parallelization Factor: A setting that allows Lambda to process more than one batch from a single shard concurrently.
  • Trim Horizon: An iterator setting that starts reading from the oldest data record in the shard.

The "Big Idea"

[!IMPORTANT] The Decoupling Principle: Integrating Lambda with Kinesis allows for near-real-time stream processing where the producer (ingestion) and consumer (processing) are completely decoupled. This ensures that processing logic—such as PII removal or data enrichment—can scale independently of the data ingestion rate without impacting the main application's performance.

Formula / Concept Box

Configuration ItemDefault / LimitImpact
Max Payload Size6 MBLambda invokes once this limit is reached in a batch.
Max Batching Window300 seconds (5 mins)Lambda invokes once this time expires if the payload limit isn't met.
Polling Frequency~1 secondThe frequency at which Lambda checks each Kinesis shard for new data.
Parallelization Factor1 to 10Determines how many concurrent Lambda invocations happen per shard.

Hierarchical Outline

  1. Integration Mechanism
    • Polling: Lambda polls shards synchronously.
    • Execution: Synchronous invocation; Lambda waits for completion before fetching the next batch.
  2. Configuration & Tuning
    • Batching: Controlling costs and throughput via batch size and windows.
    • Scaling: Using ParallelizationFactor for high-volume shards.
  3. Reliability & Consistency
    • Starting Positions: TRIM_HORIZON, LATEST, or AT_TIMESTAMP.
    • Consistency: ESM updates are eventually consistent (may take a few minutes).
  4. Implementation
    • Transformation: Real-time PII removal or log analysis.
    • Downstream: Sending processed data to S3, Redshift, or SNS.

Visual Anchors

Data Flow Architecture

Loading Diagram...

Parallelization Factor Visualization

This diagram represents how a single shard is split into multiple concurrent processing paths when the Parallelization Factor is set to 2.

\begin{tikzpicture}[node distance=2cm] \draw[thick, fill=blue!10] (0,0) rectangle (2,3) node[midway, align=center] {Kinesis \ Shard 1}; \draw[->, thick] (2,2.2) -- (4,2.5) node[right] {Batch A1 (Lambda Invocation 1)}; \draw[->, thick] (2,0.8) -- (4,0.5) node[right] {Batch A2 (Lambda Invocation 2)}; \node[draw, dashed, inner sep=10pt] at (7,1.5) {Parallelization Factor = 2}; \end{tikzpicture}

Definition-Example Pairs

  • Parallelization Factor

    • Definition: Specifies the number of concurrent batches that Lambda can process from a single shard.
    • Example: If a shard is overwhelmed by high-velocity IoT data, increasing this factor from 1 to 5 allows 5 Lambda functions to process that shard's data in parallel, reducing latency.
  • TRIM_HORIZON

    • Definition: A starting position that instructs Lambda to process all existing data in the stream, starting from the oldest available record.
    • Example: When deploying a new analytics function that needs to process the last 24 hours of data still stored in the stream, you set the starting position to TRIM_HORIZON.

Worked Examples

Node.js Record Processor

When Lambda receives Kinesis data, it is Base64 encoded. You must decode it before processing.

javascript
exports.handler = async (event, context) => { // Iterating through the batch of records event.Records.forEach(record => { // 1. Extract the data // 2. Decode from Base64 to String const payload = Buffer.from(record.kinesis.data, 'base64').toString('ascii'); console.log('Decoded payload:', payload); // Process logic (e.g., PII Redaction) would go here }); return `Successfully processed ${event.Records.length} records.`; };

[!TIP] Always ensure your Lambda Timeout is longer than the time it takes to process a full 6MB batch to avoid partial failures and duplicate processing.

Checkpoint Questions

  1. What is the default behavior if a Lambda function fails while processing a Kinesis batch?
  2. How does Lambda determine when to invoke a function if a Batching Window is set to 60 seconds and the payload reaches 6MB in 10 seconds?
  3. Which starting position would you use to ignore all existing data in a stream and only process new incoming records?
Click to see answers
  1. Lambda will retry the entire batch until the data expires from Kinesis (retention period) or succeeds. (Note: On-failure destinations can be configured to handle this).
  2. It will invoke immediately at 10 seconds because the 6MB payload limit was reached first.
  3. LATEST.

Comparison Tables

FeatureLambda via Kinesis ESMAmazon Data Firehose
ComplexityHigher (Custom Code)Lower (Configuration-based)
ProcessingReal-time (Sync Polling)Near real-time (Buffer/Batch)
ScalingParallelization FactorAutomatic scaling
TransformationComplex logic/External APIsSimple (format conversion/Lambda)

Muddy Points & Cross-Refs

  • Duplicate Processing: Lambda guarantees at least once delivery. If a function times out after processing 5 out of 10 records, the entire batch of 10 may be retried. Your code should be idempotent.
  • Eventual Consistency: When you create or update an Event Source Mapping, it may take several minutes to take effect. Do not expect immediate processing upon script deployment.
  • Kinesis vs. SQS: Kinesis is for ordered stream processing; SQS is for discrete, independent message tasks.
  • Deep Dive: For advanced error handling (BisectBatchOnFunctionError), refer to the AWS Lambda Documentation.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free