Lab: Building a Serverless Data Processor with AWS Lambda and Python
Programming Concepts
Lab: Building a Serverless Data Processor with AWS Lambda and Python
This lab focuses on Domain 1.4 (Apply Programming Concepts) of the AWS Certified Data Engineer - Associate (DEA-C01) exam. You will build a serverless data pipeline that automates the transformation of raw CSV data into a structured JSON format using Python, S3 event triggers, and CloudWatch monitoring.
Prerequisites
To complete this lab, you need:
- An AWS Account with administrative privileges.
- AWS CLI installed and configured (
aws configure). - A text editor (VS Code, Sublime, or similar).
- Python 3.9+ installed locally for packaging.
- Basic familiarity with Python and the
boto3library.
Learning Objectives
- Deploy a Lambda Function using the AWS CLI and verify its performance configuration.
- Implement Data Transformation logic to convert CSV to JSON within a memory-constrained environment.
- Configure Event Triggers to automate processing upon S3 object uploads.
- Apply Best Practices for logging and monitoring using Amazon CloudWatch Logs.
Architecture Overview
The following diagram illustrates the flow of data from ingestion to monitoring.
Step-by-Step Instructions
Step 1: Create the Data Lake Storage
We need a bucket to host our incoming data and the processed results.
# Replace <YOUR_UNIQUE_SUFFIX> with your name or a random number
aws s3 mb s3://brainybee-lab-data-<YOUR_UNIQUE_SUFFIX>▶Console alternative
- Log in to the S3 Console.
- Click Create bucket.
- Name:
brainybee-lab-data-<YOUR_UNIQUE_SUFFIX>. - Keep default settings and click Create bucket.
Step 2: Create the IAM Execution Role
Lambda requires permission to read from S3 and write logs to CloudWatch.
- Save the following as
trust-policy.json:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "Service": "lambda.amazonaws.com" },
"Action": "sts:AssumeRole"
}
]
}- Create the role:
aws iam create-role --role-name BrainyBeeLambdaRole --assume-role-policy-document file://trust-policy.json
# Attach Managed Policy for S3 access and Logs
aws iam attach-role-policy --role-name BrainyBeeLambdaRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam attach-role-policy --role-name BrainyBeeLambdaRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRoleStep 3: Write and Package the Lambda Code
Create a file named lambda_function.py. This script applies programming concepts like optimizing runtime and logging.
import json
import boto3
import csv
import io
import time
s3 = boto3.client('s3')
def lambda_handler(event, context):
start_time = time.time()
# Get bucket and file name from the event
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
print(f"Processing file: {key} from bucket: {bucket}")
# Read CSV
response = s3.get_object(Bucket=bucket, Key=key)
content = response['Body'].read().decode('utf-8')
# Transform CSV to JSON
reader = csv.DictReader(io.StringIO(content))
json_data = json.dumps([row for row in reader])
# Save to S3 (in the /processed folder)
output_key = f"processed/{key.replace('.csv', '.json')}"
s3.put_object(Bucket=bucket, Key=output_key, Body=json_data)
duration = time.time() - start_time
print(f"Transformation complete in {duration:.2f} seconds.")
return {
'statusCode': 200,
'body': json.dumps('Success!')
}Package the function:
zip function.zip lambda_function.pyStep 4: Deploy the Lambda Function
We will configure the function with 128MB of memory, a standard for lightweight ETL scripts.
aws lambda create-function --function-name DataProcessor \
--zip-file fileb://function.zip --handler lambda_function.lambda_handler --runtime python3.9 \
--role arn:aws:iam::<YOUR_ACCOUNT_ID>:role/BrainyBeeLambdaRole[!TIP] Use
aws sts get-caller-identity --query Account --output textto find your Account ID.
Step 5: Configure S3 Trigger
- Grant Permission to S3 to invoke the Lambda:
aws lambda add-permission --function-name DataProcessor --statement-id s3-trigger \
--action "lambda:InvokeFunction" --principal s3.amazonaws.com \
--source-arn arn:aws:s3:::brainybee-lab-data-<YOUR_UNIQUE_SUFFIX>- Configure the Bucket Notification:
Save as
notification.json:
{
"LambdaFunctionConfigurations": [
{
"LambdaFunctionArn": "arn:aws:lambda:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:function:DataProcessor",
"Events": ["s3:ObjectCreated:*"],
"Filter": { "Key": { "FilterRules": [{ "Name": "suffix", "Value": ".csv" }] } }
}
]
}aws s3api put-bucket-notification-configuration --bucket brainybee-lab-data-<YOUR_UNIQUE_SUFFIX> \
--notification-configuration file://notification.jsonCheckpoints
| Verification Step | Command / Action | Expected Result |
|---|---|---|
| Function Status | aws lambda get-function --function-name DataProcessor | State should be Active. |
| Data Ingestion | Upload a sample CSV to S3. | File appears in bucket root. |
| Execution | Check /processed/ folder in S3. | A .json version of your file exists. |
| Logging | View CloudWatch Log Groups for /aws/lambda/DataProcessor. | Logs show "Transformation complete..." with timing. |
Teardown
[!WARNING] Remember to run these commands to avoid ongoing charges for storage and logging.
# 1. Delete Lambda
aws lambda delete-function --function-name DataProcessor
# 2. Empty and Delete S3 Bucket
aws s3 rm s3://brainybee-lab-data-<YOUR_UNIQUE_SUFFIX> --recursive
aws s3 rb s3://brainybee-lab-data-<YOUR_UNIQUE_SUFFIX>
# 3. Delete IAM Role
aws iam detach-role-policy --role-name BrainyBeeLambdaRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam detach-role-policy --role-name BrainyBeeLambdaRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
aws iam delete-role --role-name BrainyBeeLambdaRoleTroubleshooting
| Error | Cause | Fix |
|---|---|---|
AccessDenied | IAM role lacks S3 permissions. | Re-run attach-role-policy commands in Step 2. |
ModuleNotFoundError | Code structure in ZIP is incorrect. | Ensure lambda_function.py is at the root of the ZIP. |
MemorySizeExceeded | CSV file is too large for 128MB. | Use aws lambda update-function-configuration to increase memory. |
Cost Estimate
- AWS Lambda: First 1 million requests per month are free (Free Tier).
- Amazon S3: ~$0.023 per GB (Standard). In this lab, cost will be < $0.01.
- CloudWatch: Free up to 5GB of log data.
Stretch Challenge
Task: Optimize the code for Concurrency. Modify the Lambda function configuration to set a Reserved Concurrency of 5. This prevents a sudden burst of S3 uploads from consuming all available Lambda execution slots in your account, which is a key DEA-C01 skill (Skill 1.4.2).
▶Show CLI Command
aws lambda put-function-concurrency --function-name DataProcessor --reserved-concurrent-executions 5Concept Review
In this lab, we applied several core programming concepts required for the DEA-C01 exam:
Lambda Execution Lifecycle
This TikZ diagram visualizes the phases of a Lambda execution environment.
\begin{tikzpicture}[node distance=1.5cm, every node/.style={draw, rectangle, rounded corners, fill=blue!10, text width=2.5cm, align=center}] \node (init) {\textbf{Init Phase}\Extensions, Runtime, Function code}; \node (invoke) [right=of init] {\textbf{Invoke Phase}\Handler execution, Event processing}; \node (shutdown) [right=of invoke] {\textbf{Shutdown}\Runtime cleanup};
\draw[->, thick] (init) -- (invoke);
\draw[->, thick] (invoke) -- (shutdown);
\node[draw=none, fill=none, below=0.1cm of init] (cold) {\textit{Cold Start}};
\node[draw=none, fill=none, below=0.1cm of invoke] (warm) {\textit{Warm Start}};\end{tikzpicture}
Key Terms
- Distributed Computing: The use of multiple compute resources (like Lambda instances) to process data in parallel.
- Event-Driven Architecture: A system where actions (Lambda) are triggered by events (S3 upload) rather than polling.
- IaC (Infrastructure as Code): While we used CLI, these steps are typically automated via AWS SAM or CDK in production to ensure repeatability.