AWS Data Ingestion: Building an Automated Batch Pipeline

This lab provides hands-on experience in performing batch data ingestion, a core task for the AWS Certified Data Engineer – Associate (DEA-C01) exam. You will build a serverless pipeline that automatically reacts to new data uploads, processes them using AWS Lambda, and catalogs the results using AWS Glue.

Prerequisites

AWS Account: Access to an AWS account with AdministratorAccess or equivalent permissions.
CLI Tools: AWS CLI installed and configured with your credentials (aws configure).
Region: We will use us-east-1 for this lab.
IAM Permissions: Ability to create IAM Roles, S3 Buckets, Lambda Functions, and Glue Crawlers.

Learning Objectives

Create and configure Amazon S3 buckets for data landing and processing.
Implement Amazon S3 Event Notifications to trigger downstream workflows.
Develop an AWS Lambda function to handle file ingestion logic.
Configure an AWS Glue Crawler to automatically discover schemas and update the Data Catalog.
Understand the difference between batch and near real-time ingestion patterns.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Create S3 Buckets

You need two buckets: one for incoming raw data and one for the processed data.

bash

# Generate a unique suffix to avoid naming conflicts
export ID=$RANDOM
aws s3 mb s3://brainybee-lab-landing-$ID
aws s3 mb s3://brainybee-lab-processed-$ID

▶Console alternative

Navigate to S3 > Create bucket.
Name: brainybee-lab-landing-<unique-id>.
Keep defaults and click Create bucket.
Repeat for brainybee-lab-processed-<unique-id>.

Step 2: Create IAM Role for Lambda

The Lambda function needs permission to read from the landing bucket, write to the processed bucket, and write logs to CloudWatch.

bash

# Create the trust policy file
echo '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "lambda.amazonaws.com"},"Action": "sts:AssumeRole"}]}' > trust-policy.json

# Create the role
aws iam create-role --role-name BrainyBeeIngestionRole --assume-role-policy-document file://trust-policy.json

# Attach S3 and Logging permissions
aws iam attach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam attach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Step 3: Create and Deploy the Lambda Function

This function will be triggered by S3. It logs the event and moves the object to the processed bucket.

python

# save as lambda_function.py
import json
import boto3
import urllib.parse

s3 = boto3.client('s3')

def lambda_handler(event, context):
    source_bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    target_bucket = source_bucket.replace('landing', 'processed')
    
    print(f"Ingesting {key} from {source_bucket} to {target_bucket}")
    
    # Copy object to processed bucket
    s3.copy_object(Bucket=target_bucket, Key=key, CopySource={'Bucket': source_bucket, 'Key': key})
    
    # Delete from landing bucket (Ingestion complete)
    s3.delete_object(Bucket=source_bucket, Key=key)
    
    return {'statusCode': 200, 'body': 'Data Ingested Successfully'}

bash

# Package and deploy
zip function.zip lambda_function.py
aws lambda create-function --function-name DataIngestor \
    --zip-file fileb://function.zip --handler lambda_function.lambda_handler --runtime python3.9 \
    --role arn:aws:iam::<YOUR_ACCOUNT_ID>:role/BrainyBeeIngestionRole

Step 4: Configure S3 Event Notification

Connect the landing bucket to the Lambda function.

bash

# Grant S3 permission to invoke Lambda
aws lambda add-permission --function-name DataIngestor --statement-id s3-invoke \
    --action "lambda:InvokeFunction" --principal s3.amazonaws.com \
    --source-arn arn:aws:s3:::brainybee-lab-landing-$ID

# Configure the notification
echo '{"LambdaFunctionConfigurations": [{"LambdaFunctionArn": "arn:aws:lambda:us-east-1:<YOUR_ACCOUNT_ID>:function:DataIngestor", "Events": ["s3:ObjectCreated:*"]}]}' > notification.json
aws s3api put-bucket-notification-configuration --bucket brainybee-lab-landing-$ID --notification-configuration file://notification.json

Checkpoints

Verify S3 Trigger: Go to the S3 Console for your landing bucket. Under the Properties tab, scroll to Event notifications. You should see the DataIngestor Lambda listed.
Test Upload: Create a simple CSV file test.csv and upload it to the landing bucket.
bash
echo "id,name,value\n1,test,100" > test.csv aws s3 cp test.csv s3://brainybee-lab-landing-$ID/
Validate Movement: Run aws s3 ls s3://brainybee-lab-processed-$ID/. The file should appear here, and the landing bucket should be empty.

Step 5: Run AWS Glue Crawler

Now catalog the processed data to make it queryable via SQL.

bash

# Create a Glue Database
aws glue create-database --database-input '{"Name": "ingested_data_db"}'

# Create the Crawler
aws glue create-crawler --name DataIngestCrawler --role BrainyBeeIngestionRole \
    --database-name ingested_data_db --targets '{"S3Targets": [{"Path": "s3://brainybee-lab-processed-'$ID'/"}]}'

# Start the Crawler
aws glue start-crawler --name DataIngestCrawler

Troubleshooting

Problem	Cause	Fix
Lambda not triggering	Missing S3 permission to invoke Lambda	Check `aws lambda add-permission` output.
Copy failed	IAM Role lacks permissions for processed bucket	Ensure `BrainyBeeIngestionRole` has `s3:PutObject` for the target bucket.
Glue Crawler stays "Starting"	Crawler is provisioning resources	Wait 1-2 minutes; this is normal for the first run.

Cost Estimate

S3: $0.023 per GB (Free Tier covers first 5GB).
Lambda: First 1 million requests per month are free.
Glue: Crawlers are charged $0.44 per DPU-Hour (Minimum 10-minute billable duration). Estimated cost for this lab: <$0.15.

Clean-Up / Teardown

[!WARNING] Remember to run these commands to avoid ongoing charges from S3 storage or IAM roles.

bash

# Delete S3 Objects and Buckets
aws s3 rb s3://brainybee-lab-landing-$ID --force
aws s3 rb s3://brainybee-lab-processed-$ID --force

# Delete Lambda
aws lambda delete-function --function-name DataIngestor

# Delete Glue components
aws glue delete-crawler --name DataIngestCrawler
aws glue delete-database --name ingested_data_db

# Delete IAM Role
aws iam detach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam detach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
aws iam delete-role --role-name BrainyBeeIngestionRole

Stretch Challenge

Convert to Parquet: Modify the Lambda function to use pandas or awswrangler (via a Lambda Layer) to convert the incoming CSV file to Parquet format before saving it to the processed bucket. This is a best practice for optimizing performance in Athena and Redshift Spectrum.

Concept Review

Ingestion Comparison Table

Feature	Batch (S3/AppFlow)	Streaming (Kinesis/MSK)
Latency	Minutes to Hours	Milliseconds to Seconds
Data Size	Large files/blocks	Continuous small records
Use Case	Daily reports, History sync	Fraud detection, IoT monitoring
Trigger	Scheduled / Event-based	Continuous polling

[!TIP] For the DEA-C01 exam, remember that AWS DataSync is the preferred tool for large-scale on-premises file migrations, while AWS Data Exchange is used for third-party marketplace data.

AWS Data Ingestion: Building an Automated Batch Pipeline

Prerequisites

AWS Account: Access to an AWS account with AdministratorAccess or equivalent permissions.
CLI Tools: AWS CLI installed and configured with your credentials (aws configure).
Region: We will use us-east-1 for this lab.
IAM Permissions: Ability to create IAM Roles, S3 Buckets, Lambda Functions, and Glue Crawlers.

Learning Objectives

Create and configure Amazon S3 buckets for data landing and processing.
Implement Amazon S3 Event Notifications to trigger downstream workflows.
Develop an AWS Lambda function to handle file ingestion logic.
Configure an AWS Glue Crawler to automatically discover schemas and update the Data Catalog.
Understand the difference between batch and near real-time ingestion patterns.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Create S3 Buckets

You need two buckets: one for incoming raw data and one for the processed data.

bash

# Generate a unique suffix to avoid naming conflicts
export ID=$RANDOM
aws s3 mb s3://brainybee-lab-landing-$ID
aws s3 mb s3://brainybee-lab-processed-$ID

▶Console alternative

Navigate to S3 > Create bucket.
Name: brainybee-lab-landing-<unique-id>.
Keep defaults and click Create bucket.
Repeat for brainybee-lab-processed-<unique-id>.

Step 2: Create IAM Role for Lambda

The Lambda function needs permission to read from the landing bucket, write to the processed bucket, and write logs to CloudWatch.

bash

# Create the trust policy file
echo '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "lambda.amazonaws.com"},"Action": "sts:AssumeRole"}]}' > trust-policy.json

# Create the role
aws iam create-role --role-name BrainyBeeIngestionRole --assume-role-policy-document file://trust-policy.json

# Attach S3 and Logging permissions
aws iam attach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam attach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Step 3: Create and Deploy the Lambda Function

This function will be triggered by S3. It logs the event and moves the object to the processed bucket.

python

# save as lambda_function.py
import json
import boto3
import urllib.parse

s3 = boto3.client('s3')

def lambda_handler(event, context):
    source_bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    target_bucket = source_bucket.replace('landing', 'processed')
    
    print(f"Ingesting {key} from {source_bucket} to {target_bucket}")
    
    # Copy object to processed bucket
    s3.copy_object(Bucket=target_bucket, Key=key, CopySource={'Bucket': source_bucket, 'Key': key})
    
    # Delete from landing bucket (Ingestion complete)
    s3.delete_object(Bucket=source_bucket, Key=key)
    
    return {'statusCode': 200, 'body': 'Data Ingested Successfully'}

bash

# Package and deploy
zip function.zip lambda_function.py
aws lambda create-function --function-name DataIngestor \
    --zip-file fileb://function.zip --handler lambda_function.lambda_handler --runtime python3.9 \
    --role arn:aws:iam::<YOUR_ACCOUNT_ID>:role/BrainyBeeIngestionRole

Step 4: Configure S3 Event Notification

Connect the landing bucket to the Lambda function.

bash

# Grant S3 permission to invoke Lambda
aws lambda add-permission --function-name DataIngestor --statement-id s3-invoke \
    --action "lambda:InvokeFunction" --principal s3.amazonaws.com \
    --source-arn arn:aws:s3:::brainybee-lab-landing-$ID

# Configure the notification
echo '{"LambdaFunctionConfigurations": [{"LambdaFunctionArn": "arn:aws:lambda:us-east-1:<YOUR_ACCOUNT_ID>:function:DataIngestor", "Events": ["s3:ObjectCreated:*"]}]}' > notification.json
aws s3api put-bucket-notification-configuration --bucket brainybee-lab-landing-$ID --notification-configuration file://notification.json

Checkpoints

Verify S3 Trigger: Go to the S3 Console for your landing bucket. Under the Properties tab, scroll to Event notifications. You should see the DataIngestor Lambda listed.
Test Upload: Create a simple CSV file test.csv and upload it to the landing bucket.
bash
echo "id,name,value\n1,test,100" > test.csv aws s3 cp test.csv s3://brainybee-lab-landing-$ID/
Validate Movement: Run aws s3 ls s3://brainybee-lab-processed-$ID/. The file should appear here, and the landing bucket should be empty.

Step 5: Run AWS Glue Crawler

Now catalog the processed data to make it queryable via SQL.

bash

# Create a Glue Database
aws glue create-database --database-input '{"Name": "ingested_data_db"}'

# Create the Crawler
aws glue create-crawler --name DataIngestCrawler --role BrainyBeeIngestionRole \
    --database-name ingested_data_db --targets '{"S3Targets": [{"Path": "s3://brainybee-lab-processed-'$ID'/"}]}'

# Start the Crawler
aws glue start-crawler --name DataIngestCrawler

Troubleshooting

Problem	Cause	Fix
Lambda not triggering	Missing S3 permission to invoke Lambda	Check `aws lambda add-permission` output.
Copy failed	IAM Role lacks permissions for processed bucket	Ensure `BrainyBeeIngestionRole` has `s3:PutObject` for the target bucket.
Glue Crawler stays "Starting"	Crawler is provisioning resources	Wait 1-2 minutes; this is normal for the first run.

Cost Estimate

S3: $0.023 per GB (Free Tier covers first 5GB).
Lambda: First 1 million requests per month are free.
Glue: Crawlers are charged $0.44 per DPU-Hour (Minimum 10-minute billable duration). Estimated cost for this lab: <$0.15.

Clean-Up / Teardown

[!WARNING] Remember to run these commands to avoid ongoing charges from S3 storage or IAM roles.

bash

# Delete S3 Objects and Buckets
aws s3 rb s3://brainybee-lab-landing-$ID --force
aws s3 rb s3://brainybee-lab-processed-$ID --force

# Delete Lambda
aws lambda delete-function --function-name DataIngestor

# Delete Glue components
aws glue delete-crawler --name DataIngestCrawler
aws glue delete-database --name ingested_data_db

# Delete IAM Role
aws iam detach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam detach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
aws iam delete-role --role-name BrainyBeeIngestionRole

Stretch Challenge

Concept Review

Ingestion Comparison Table

Feature	Batch (S3/AppFlow)	Streaming (Kinesis/MSK)
Latency	Minutes to Hours	Milliseconds to Seconds
Data Size	Large files/blocks	Continuous small records
Use Case	Daily reports, History sync	Fraud detection, IoT monitoring
Trigger	Scheduled / Event-based	Continuous polling

[!TIP] For the DEA-C01 exam, remember that AWS DataSync is the preferred tool for large-scale on-premises file migrations, while AWS Data Exchange is used for third-party marketplace data.

AWS Data Ingestion: Building an Automated Batch Pipeline with S3, Lambda, and Glue

AWS Data Ingestion: Building an Automated Batch Pipeline

Prerequisites

Learning Objectives

Architecture Overview

Step-by-Step Instructions

Step 1: Create S3 Buckets

Step 2: Create IAM Role for Lambda

Step 3: Create and Deploy the Lambda Function

Step 4: Configure S3 Event Notification

Checkpoints

Step 5: Run AWS Glue Crawler

Troubleshooting

Cost Estimate

Clean-Up / Teardown

Stretch Challenge

Concept Review

Ingestion Comparison Table

AWS Data Ingestion: Building an Automated Batch Pipeline with S3, Lambda, and Glue

AWS Data Ingestion: Building an Automated Batch Pipeline

Prerequisites

Learning Objectives

Architecture Overview

Step-by-Step Instructions

Step 1: Create S3 Buckets

Step 2: Create IAM Role for Lambda

Step 3: Create and Deploy the Lambda Function

Step 4: Configure S3 Event Notification

Checkpoints

Step 5: Run AWS Glue Crawler

Troubleshooting

Cost Estimate

Clean-Up / Teardown

Stretch Challenge

Concept Review

Ingestion Comparison Table