Data Engineering Study Guide: Integrating AWS Lambda with Amazon Kinesis
Call a Lambda function from Kinesis
Data Engineering Study Guide: Integrating AWS Lambda with Amazon Kinesis
This guide covers the mechanics of using AWS Lambda to process streaming data from Amazon Kinesis Data Streams (KDS), a core skill for the AWS Certified Data Engineer – Associate exam (DEA-C01).
Learning Objectives
By the end of this guide, you should be able to:
- Explain the synchronous polling mechanism used by Lambda for Kinesis.
- Configure Batching Windows and Payload Limits to optimize processing efficiency.
- Scale throughput using the ParallelizationFactor.
- Implement a basic Lambda consumer in Node.js.
- Identify correct Starting Positions for event source mappings to prevent data loss.
Key Terms & Glossary
- Event Source Mapping (ESM): A Lambda resource that reads from an event source (like Kinesis) and invokes a Lambda function.
- Shard: A uniquely identified sequence of data records in a Kinesis stream; the base unit of throughput.
- Batching Window: The maximum amount of time (up to 300 seconds) Lambda spends gathering records before invoking the function.
- Parallelization Factor: A setting that allows Lambda to process more than one batch from a single shard concurrently.
- Trim Horizon: An iterator setting that starts reading from the oldest data record in the shard.
The "Big Idea"
[!IMPORTANT] The Decoupling Principle: Integrating Lambda with Kinesis allows for near-real-time stream processing where the producer (ingestion) and consumer (processing) are completely decoupled. This ensures that processing logic—such as PII removal or data enrichment—can scale independently of the data ingestion rate without impacting the main application's performance.
Formula / Concept Box
| Configuration Item | Default / Limit | Impact |
|---|---|---|
| Max Payload Size | 6 MB | Lambda invokes once this limit is reached in a batch. |
| Max Batching Window | 300 seconds (5 mins) | Lambda invokes once this time expires if the payload limit isn't met. |
| Polling Frequency | ~1 second | The frequency at which Lambda checks each Kinesis shard for new data. |
| Parallelization Factor | 1 to 10 | Determines how many concurrent Lambda invocations happen per shard. |
Hierarchical Outline
- Integration Mechanism
- Polling: Lambda polls shards synchronously.
- Execution: Synchronous invocation; Lambda waits for completion before fetching the next batch.
- Configuration & Tuning
- Batching: Controlling costs and throughput via batch size and windows.
- Scaling: Using
ParallelizationFactorfor high-volume shards.
- Reliability & Consistency
- Starting Positions:
TRIM_HORIZON,LATEST, orAT_TIMESTAMP. - Consistency: ESM updates are eventually consistent (may take a few minutes).
- Starting Positions:
- Implementation
- Transformation: Real-time PII removal or log analysis.
- Downstream: Sending processed data to S3, Redshift, or SNS.
Visual Anchors
Data Flow Architecture
Parallelization Factor Visualization
This diagram represents how a single shard is split into multiple concurrent processing paths when the Parallelization Factor is set to 2.
\begin{tikzpicture}[node distance=2cm] \draw[thick, fill=blue!10] (0,0) rectangle (2,3) node[midway, align=center] {Kinesis \ Shard 1}; \draw[->, thick] (2,2.2) -- (4,2.5) node[right] {Batch A1 (Lambda Invocation 1)}; \draw[->, thick] (2,0.8) -- (4,0.5) node[right] {Batch A2 (Lambda Invocation 2)}; \node[draw, dashed, inner sep=10pt] at (7,1.5) {Parallelization Factor = 2}; \end{tikzpicture}
Definition-Example Pairs
-
Parallelization Factor
- Definition: Specifies the number of concurrent batches that Lambda can process from a single shard.
- Example: If a shard is overwhelmed by high-velocity IoT data, increasing this factor from 1 to 5 allows 5 Lambda functions to process that shard's data in parallel, reducing latency.
-
TRIM_HORIZON
- Definition: A starting position that instructs Lambda to process all existing data in the stream, starting from the oldest available record.
- Example: When deploying a new analytics function that needs to process the last 24 hours of data still stored in the stream, you set the starting position to
TRIM_HORIZON.
Worked Examples
Node.js Record Processor
When Lambda receives Kinesis data, it is Base64 encoded. You must decode it before processing.
exports.handler = async (event, context) => {
// Iterating through the batch of records
event.Records.forEach(record => {
// 1. Extract the data
// 2. Decode from Base64 to String
const payload = Buffer.from(record.kinesis.data, 'base64').toString('ascii');
console.log('Decoded payload:', payload);
// Process logic (e.g., PII Redaction) would go here
});
return `Successfully processed ${event.Records.length} records.`;
};[!TIP] Always ensure your Lambda Timeout is longer than the time it takes to process a full 6MB batch to avoid partial failures and duplicate processing.
Checkpoint Questions
- What is the default behavior if a Lambda function fails while processing a Kinesis batch?
- How does Lambda determine when to invoke a function if a Batching Window is set to 60 seconds and the payload reaches 6MB in 10 seconds?
- Which starting position would you use to ignore all existing data in a stream and only process new incoming records?
▶Click to see answers
- Lambda will retry the entire batch until the data expires from Kinesis (retention period) or succeeds. (Note: On-failure destinations can be configured to handle this).
- It will invoke immediately at 10 seconds because the 6MB payload limit was reached first.
- LATEST.
Comparison Tables
| Feature | Lambda via Kinesis ESM | Amazon Data Firehose |
|---|---|---|
| Complexity | Higher (Custom Code) | Lower (Configuration-based) |
| Processing | Real-time (Sync Polling) | Near real-time (Buffer/Batch) |
| Scaling | Parallelization Factor | Automatic scaling |
| Transformation | Complex logic/External APIs | Simple (format conversion/Lambda) |
Muddy Points & Cross-Refs
- Duplicate Processing: Lambda guarantees at least once delivery. If a function times out after processing 5 out of 10 records, the entire batch of 10 may be retried. Your code should be idempotent.
- Eventual Consistency: When you create or update an Event Source Mapping, it may take several minutes to take effect. Do not expect immediate processing upon script deployment.
- Kinesis vs. SQS: Kinesis is for ordered stream processing; SQS is for discrete, independent message tasks.
- Deep Dive: For advanced error handling (BisectBatchOnFunctionError), refer to the AWS Lambda Documentation.