Lab: Building a Serverless Data Processor with AWS Lambda and Python

This lab focuses on Domain 1.4 (Apply Programming Concepts) of the AWS Certified Data Engineer - Associate (DEA-C01) exam. You will build a serverless data pipeline that automates the transformation of raw CSV data into a structured JSON format using Python, S3 event triggers, and CloudWatch monitoring.

Prerequisites

To complete this lab, you need:

An AWS Account with administrative privileges.
AWS CLI installed and configured (aws configure).
A text editor (VS Code, Sublime, or similar).
Python 3.9+ installed locally for packaging.
Basic familiarity with Python and the boto3 library.

Learning Objectives

Deploy a Lambda Function using the AWS CLI and verify its performance configuration.
Implement Data Transformation logic to convert CSV to JSON within a memory-constrained environment.
Configure Event Triggers to automate processing upon S3 object uploads.
Apply Best Practices for logging and monitoring using Amazon CloudWatch Logs.

Architecture Overview

The following diagram illustrates the flow of data from ingestion to monitoring.

Loading Diagram...

Step-by-Step Instructions

Step 1: Create the Data Lake Storage

We need a bucket to host our incoming data and the processed results.

bash

# Replace <YOUR_UNIQUE_SUFFIX> with your name or a random number
aws s3 mb s3://brainybee-lab-data-<YOUR_UNIQUE_SUFFIX>

▶Console alternative

Log in to the S3 Console.
Click Create bucket.
Name: brainybee-lab-data-<YOUR_UNIQUE_SUFFIX>.
Keep default settings and click Create bucket.

Step 2: Create the IAM Execution Role

Lambda requires permission to read from S3 and write logs to CloudWatch.

Save the following as trust-policy.json:

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "lambda.amazonaws.com" },
      "Action": "sts:AssumeRole"
    }
  ]
}

Create the role:

bash

aws iam create-role --role-name BrainyBeeLambdaRole --assume-role-policy-document file://trust-policy.json

# Attach Managed Policy for S3 access and Logs
aws iam attach-role-policy --role-name BrainyBeeLambdaRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam attach-role-policy --role-name BrainyBeeLambdaRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Step 3: Write and Package the Lambda Code

Create a file named lambda_function.py. This script applies programming concepts like optimizing runtime and logging.

python

import json
import boto3
import csv
import io
import time

s3 = boto3.client('s3')

def lambda_handler(event, context):
    start_time = time.time()
    
    # Get bucket and file name from the event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    print(f"Processing file: {key} from bucket: {bucket}")
    
    # Read CSV
    response = s3.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read().decode('utf-8')
    
    # Transform CSV to JSON
    reader = csv.DictReader(io.StringIO(content))
    json_data = json.dumps([row for row in reader])
    
    # Save to S3 (in the /processed folder)
    output_key = f"processed/{key.replace('.csv', '.json')}"
    s3.put_object(Bucket=bucket, Key=output_key, Body=json_data)
    
    duration = time.time() - start_time
    print(f"Transformation complete in {duration:.2f} seconds.")
    
    return {
        'statusCode': 200,
        'body': json.dumps('Success!')
    }

Package the function:

bash

zip function.zip lambda_function.py

Step 4: Deploy the Lambda Function

We will configure the function with 128MB of memory, a standard for lightweight ETL scripts.

bash

aws lambda create-function --function-name DataProcessor \
--zip-file fileb://function.zip --handler lambda_function.lambda_handler --runtime python3.9 \
--role arn:aws:iam::<YOUR_ACCOUNT_ID>:role/BrainyBeeLambdaRole

[!TIP] Use aws sts get-caller-identity --query Account --output text to find your Account ID.

Step 5: Configure S3 Trigger

Grant Permission to S3 to invoke the Lambda:

bash

aws lambda add-permission --function-name DataProcessor --statement-id s3-trigger \
--action "lambda:InvokeFunction" --principal s3.amazonaws.com \
--source-arn arn:aws:s3:::brainybee-lab-data-<YOUR_UNIQUE_SUFFIX>

Configure the Bucket Notification: Save as notification.json:

json

{
  "LambdaFunctionConfigurations": [
    {
      "LambdaFunctionArn": "arn:aws:lambda:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:function:DataProcessor",
      "Events": ["s3:ObjectCreated:*"],
      "Filter": { "Key": { "FilterRules": [{ "Name": "suffix", "Value": ".csv" }] } }
    }
  ]
}

bash

aws s3api put-bucket-notification-configuration --bucket brainybee-lab-data-<YOUR_UNIQUE_SUFFIX> \
--notification-configuration file://notification.json

Checkpoints

Verification Step	Command / Action	Expected Result
Function Status	`aws lambda get-function --function-name DataProcessor`	State should be `Active`.
Data Ingestion	Upload a sample CSV to S3.	File appears in bucket root.
Execution	Check `/processed/` folder in S3.	A `.json` version of your file exists.
Logging	View CloudWatch Log Groups for `/aws/lambda/DataProcessor`.	Logs show "Transformation complete..." with timing.

Teardown

[!WARNING] Remember to run these commands to avoid ongoing charges for storage and logging.

bash

# 1. Delete Lambda
aws lambda delete-function --function-name DataProcessor

# 2. Empty and Delete S3 Bucket
aws s3 rm s3://brainybee-lab-data-<YOUR_UNIQUE_SUFFIX> --recursive
aws s3 rb s3://brainybee-lab-data-<YOUR_UNIQUE_SUFFIX>

# 3. Delete IAM Role
aws iam detach-role-policy --role-name BrainyBeeLambdaRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam detach-role-policy --role-name BrainyBeeLambdaRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
aws iam delete-role --role-name BrainyBeeLambdaRole

Troubleshooting

Error	Cause	Fix
`AccessDenied`	IAM role lacks S3 permissions.	Re-run `attach-role-policy` commands in Step 2.
`ModuleNotFoundError`	Code structure in ZIP is incorrect.	Ensure `lambda_function.py` is at the root of the ZIP.
`MemorySizeExceeded`	CSV file is too large for 128MB.	Use `aws lambda update-function-configuration` to increase memory.

Cost Estimate

AWS Lambda: First 1 million requests per month are free (Free Tier).
Amazon S3: ~$0.023 per GB (Standard). In this lab, cost will be < $0.01.
CloudWatch: Free up to 5GB of log data.

Stretch Challenge

Task: Optimize the code for Concurrency. Modify the Lambda function configuration to set a Reserved Concurrency of 5. This prevents a sudden burst of S3 uploads from consuming all available Lambda execution slots in your account, which is a key DEA-C01 skill (Skill 1.4.2).

▶Show CLI Command

bash

aws lambda put-function-concurrency --function-name DataProcessor --reserved-concurrent-executions 5

Concept Review

In this lab, we applied several core programming concepts required for the DEA-C01 exam:

Lambda Execution Lifecycle

This TikZ diagram visualizes the phases of a Lambda execution environment.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Key Terms

Distributed Computing: The use of multiple compute resources (like Lambda instances) to process data in parallel.
Event-Driven Architecture: A system where actions (Lambda) are triggered by events (S3 upload) rather than polling.
IaC (Infrastructure as Code): While we used CLI, these steps are typically automated via AWS SAM or CDK in production to ensure repeatability.

Lab: Building a Serverless Data Processor with AWS Lambda and Python

Prerequisites

To complete this lab, you need:

An AWS Account with administrative privileges.
AWS CLI installed and configured (aws configure).
A text editor (VS Code, Sublime, or similar).
Python 3.9+ installed locally for packaging.
Basic familiarity with Python and the boto3 library.

Learning Objectives

Deploy a Lambda Function using the AWS CLI and verify its performance configuration.
Implement Data Transformation logic to convert CSV to JSON within a memory-constrained environment.
Configure Event Triggers to automate processing upon S3 object uploads.
Apply Best Practices for logging and monitoring using Amazon CloudWatch Logs.

Architecture Overview

The following diagram illustrates the flow of data from ingestion to monitoring.

Loading Diagram...

Step-by-Step Instructions

Step 1: Create the Data Lake Storage

We need a bucket to host our incoming data and the processed results.

bash

# Replace <YOUR_UNIQUE_SUFFIX> with your name or a random number
aws s3 mb s3://brainybee-lab-data-<YOUR_UNIQUE_SUFFIX>

▶Console alternative

Log in to the S3 Console.
Click Create bucket.
Name: brainybee-lab-data-<YOUR_UNIQUE_SUFFIX>.
Keep default settings and click Create bucket.

Step 2: Create the IAM Execution Role

Lambda requires permission to read from S3 and write logs to CloudWatch.

Save the following as trust-policy.json:

json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "lambda.amazonaws.com" },
      "Action": "sts:AssumeRole"
    }
  ]
}

Create the role:

bash

aws iam create-role --role-name BrainyBeeLambdaRole --assume-role-policy-document file://trust-policy.json

# Attach Managed Policy for S3 access and Logs
aws iam attach-role-policy --role-name BrainyBeeLambdaRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam attach-role-policy --role-name BrainyBeeLambdaRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Step 3: Write and Package the Lambda Code

Create a file named lambda_function.py. This script applies programming concepts like optimizing runtime and logging.

python

import json
import boto3
import csv
import io
import time

s3 = boto3.client('s3')

def lambda_handler(event, context):
    start_time = time.time()
    
    # Get bucket and file name from the event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    print(f"Processing file: {key} from bucket: {bucket}")
    
    # Read CSV
    response = s3.get_object(Bucket=bucket, Key=key)
    content = response['Body'].read().decode('utf-8')
    
    # Transform CSV to JSON
    reader = csv.DictReader(io.StringIO(content))
    json_data = json.dumps([row for row in reader])
    
    # Save to S3 (in the /processed folder)
    output_key = f"processed/{key.replace('.csv', '.json')}"
    s3.put_object(Bucket=bucket, Key=output_key, Body=json_data)
    
    duration = time.time() - start_time
    print(f"Transformation complete in {duration:.2f} seconds.")
    
    return {
        'statusCode': 200,
        'body': json.dumps('Success!')
    }

Package the function:

bash

zip function.zip lambda_function.py

Step 4: Deploy the Lambda Function

We will configure the function with 128MB of memory, a standard for lightweight ETL scripts.

bash

aws lambda create-function --function-name DataProcessor \
--zip-file fileb://function.zip --handler lambda_function.lambda_handler --runtime python3.9 \
--role arn:aws:iam::<YOUR_ACCOUNT_ID>:role/BrainyBeeLambdaRole

[!TIP] Use aws sts get-caller-identity --query Account --output text to find your Account ID.

Step 5: Configure S3 Trigger

Grant Permission to S3 to invoke the Lambda:

bash

aws lambda add-permission --function-name DataProcessor --statement-id s3-trigger \
--action "lambda:InvokeFunction" --principal s3.amazonaws.com \
--source-arn arn:aws:s3:::brainybee-lab-data-<YOUR_UNIQUE_SUFFIX>

Configure the Bucket Notification: Save as notification.json:

json

{
  "LambdaFunctionConfigurations": [
    {
      "LambdaFunctionArn": "arn:aws:lambda:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:function:DataProcessor",
      "Events": ["s3:ObjectCreated:*"],
      "Filter": { "Key": { "FilterRules": [{ "Name": "suffix", "Value": ".csv" }] } }
    }
  ]
}

bash

aws s3api put-bucket-notification-configuration --bucket brainybee-lab-data-<YOUR_UNIQUE_SUFFIX> \
--notification-configuration file://notification.json

Checkpoints

Verification Step	Command / Action	Expected Result
Function Status	`aws lambda get-function --function-name DataProcessor`	State should be `Active`.
Data Ingestion	Upload a sample CSV to S3.	File appears in bucket root.
Execution	Check `/processed/` folder in S3.	A `.json` version of your file exists.
Logging	View CloudWatch Log Groups for `/aws/lambda/DataProcessor`.	Logs show "Transformation complete..." with timing.

Teardown

[!WARNING] Remember to run these commands to avoid ongoing charges for storage and logging.

bash

# 1. Delete Lambda
aws lambda delete-function --function-name DataProcessor

# 2. Empty and Delete S3 Bucket
aws s3 rm s3://brainybee-lab-data-<YOUR_UNIQUE_SUFFIX> --recursive
aws s3 rb s3://brainybee-lab-data-<YOUR_UNIQUE_SUFFIX>

# 3. Delete IAM Role
aws iam detach-role-policy --role-name BrainyBeeLambdaRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam detach-role-policy --role-name BrainyBeeLambdaRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
aws iam delete-role --role-name BrainyBeeLambdaRole

Troubleshooting

Error	Cause	Fix
`AccessDenied`	IAM role lacks S3 permissions.	Re-run `attach-role-policy` commands in Step 2.
`ModuleNotFoundError`	Code structure in ZIP is incorrect.	Ensure `lambda_function.py` is at the root of the ZIP.
`MemorySizeExceeded`	CSV file is too large for 128MB.	Use `aws lambda update-function-configuration` to increase memory.

Cost Estimate

AWS Lambda: First 1 million requests per month are free (Free Tier).
Amazon S3: ~$0.023 per GB (Standard). In this lab, cost will be < $0.01.
CloudWatch: Free up to 5GB of log data.

Stretch Challenge

▶Show CLI Command

bash

aws lambda put-function-concurrency --function-name DataProcessor --reserved-concurrent-executions 5

Concept Review

In this lab, we applied several core programming concepts required for the DEA-C01 exam:

Lambda Execution Lifecycle

This TikZ diagram visualizes the phases of a Lambda execution environment.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Key Terms

Distributed Computing: The use of multiple compute resources (like Lambda instances) to process data in parallel.
Event-Driven Architecture: A system where actions (Lambda) are triggered by events (S3 upload) rather than polling.
IaC (Infrastructure as Code): While we used CLI, these steps are typically automated via AWS SAM or CDK in production to ensure repeatability.