Data Engineering Study Guide: Integrating AWS Lambda with Amazon Kinesis

This guide covers the mechanics of using AWS Lambda to process streaming data from Amazon Kinesis Data Streams (KDS), a core skill for the AWS Certified Data Engineer – Associate exam (DEA-C01).

Learning Objectives

By the end of this guide, you should be able to:

Explain the synchronous polling mechanism used by Lambda for Kinesis.
Configure Batching Windows and Payload Limits to optimize processing efficiency.
Scale throughput using the ParallelizationFactor.
Implement a basic Lambda consumer in Node.js.
Identify correct Starting Positions for event source mappings to prevent data loss.

Key Terms & Glossary

Event Source Mapping (ESM): A Lambda resource that reads from an event source (like Kinesis) and invokes a Lambda function.
Shard: A uniquely identified sequence of data records in a Kinesis stream; the base unit of throughput.
Batching Window: The maximum amount of time (up to 300 seconds) Lambda spends gathering records before invoking the function.
Parallelization Factor: A setting that allows Lambda to process more than one batch from a single shard concurrently.
Trim Horizon: An iterator setting that starts reading from the oldest data record in the shard.

The "Big Idea"

[!IMPORTANT] The Decoupling Principle: Integrating Lambda with Kinesis allows for near-real-time stream processing where the producer (ingestion) and consumer (processing) are completely decoupled. This ensures that processing logic—such as PII removal or data enrichment—can scale independently of the data ingestion rate without impacting the main application's performance.

Formula / Concept Box

Configuration Item	Default / Limit	Impact
Max Payload Size	6 MB	Lambda invokes once this limit is reached in a batch.
Max Batching Window	300 seconds (5 mins)	Lambda invokes once this time expires if the payload limit isn't met.
Polling Frequency	~1 second	The frequency at which Lambda checks each Kinesis shard for new data.
Parallelization Factor	1 to 10	Determines how many concurrent Lambda invocations happen per shard.

Hierarchical Outline

Integration Mechanism
- Polling: Lambda polls shards synchronously.
- Execution: Synchronous invocation; Lambda waits for completion before fetching the next batch.
Configuration & Tuning
- Batching: Controlling costs and throughput via batch size and windows.
- Scaling: Using ParallelizationFactor for high-volume shards.
Reliability & Consistency
- Starting Positions: TRIM_HORIZON, LATEST, or AT_TIMESTAMP.
- Consistency: ESM updates are eventually consistent (may take a few minutes).
Implementation
- Transformation: Real-time PII removal or log analysis.
- Downstream: Sending processed data to S3, Redshift, or SNS.

Visual Anchors

Data Flow Architecture

Loading Diagram...

Parallelization Factor Visualization

This diagram represents how a single shard is split into multiple concurrent processing paths when the Parallelization Factor is set to 2.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Parallelization Factor
- Definition: Specifies the number of concurrent batches that Lambda can process from a single shard.
- Example: If a shard is overwhelmed by high-velocity IoT data, increasing this factor from 1 to 5 allows 5 Lambda functions to process that shard's data in parallel, reducing latency.
TRIM_HORIZON
- Definition: A starting position that instructs Lambda to process all existing data in the stream, starting from the oldest available record.
- Example: When deploying a new analytics function that needs to process the last 24 hours of data still stored in the stream, you set the starting position to TRIM_HORIZON.

Worked Examples

Node.js Record Processor

When Lambda receives Kinesis data, it is Base64 encoded. You must decode it before processing.

javascript

exports.handler = async (event, context) => {
    // Iterating through the batch of records
    event.Records.forEach(record => {
        // 1. Extract the data
        // 2. Decode from Base64 to String
        const payload = Buffer.from(record.kinesis.data, 'base64').toString('ascii');
        
        console.log('Decoded payload:', payload);
        
        // Process logic (e.g., PII Redaction) would go here
    });
    
    return `Successfully processed ${event.Records.length} records.`;
};

[!TIP] Always ensure your Lambda Timeout is longer than the time it takes to process a full 6MB batch to avoid partial failures and duplicate processing.

Checkpoint Questions

What is the default behavior if a Lambda function fails while processing a Kinesis batch?
How does Lambda determine when to invoke a function if a Batching Window is set to 60 seconds and the payload reaches 6MB in 10 seconds?
Which starting position would you use to ignore all existing data in a stream and only process new incoming records?

▶Click to see answers

Lambda will retry the entire batch until the data expires from Kinesis (retention period) or succeeds. (Note: On-failure destinations can be configured to handle this).
It will invoke immediately at 10 seconds because the 6MB payload limit was reached first.
LATEST.

Comparison Tables

Feature	Lambda via Kinesis ESM	Amazon Data Firehose
Complexity	Higher (Custom Code)	Lower (Configuration-based)
Processing	Real-time (Sync Polling)	Near real-time (Buffer/Batch)
Scaling	Parallelization Factor	Automatic scaling
Transformation	Complex logic/External APIs	Simple (format conversion/Lambda)

Muddy Points & Cross-Refs

Duplicate Processing: Lambda guarantees at least once delivery. If a function times out after processing 5 out of 10 records, the entire batch of 10 may be retried. Your code should be idempotent.
Eventual Consistency: When you create or update an Event Source Mapping, it may take several minutes to take effect. Do not expect immediate processing upon script deployment.
Kinesis vs. SQS: Kinesis is for ordered stream processing; SQS is for discrete, independent message tasks.
Deep Dive: For advanced error handling (BisectBatchOnFunctionError), refer to the AWS Lambda Documentation.