AWS Data Ingestion: Building an Automated Batch Pipeline with S3, Lambda, and Glue
Perform data ingestion
AWS Data Ingestion: Building an Automated Batch Pipeline
This lab provides hands-on experience in performing batch data ingestion, a core task for the AWS Certified Data Engineer – Associate (DEA-C01) exam. You will build a serverless pipeline that automatically reacts to new data uploads, processes them using AWS Lambda, and catalogs the results using AWS Glue.
Prerequisites
- AWS Account: Access to an AWS account with
AdministratorAccessor equivalent permissions. - CLI Tools: AWS CLI installed and configured with your credentials (
aws configure). - Region: We will use
us-east-1for this lab. - IAM Permissions: Ability to create IAM Roles, S3 Buckets, Lambda Functions, and Glue Crawlers.
Learning Objectives
- Create and configure Amazon S3 buckets for data landing and processing.
- Implement Amazon S3 Event Notifications to trigger downstream workflows.
- Develop an AWS Lambda function to handle file ingestion logic.
- Configure an AWS Glue Crawler to automatically discover schemas and update the Data Catalog.
- Understand the difference between batch and near real-time ingestion patterns.
Architecture Overview
Step-by-Step Instructions
Step 1: Create S3 Buckets
You need two buckets: one for incoming raw data and one for the processed data.
# Generate a unique suffix to avoid naming conflicts
export ID=$RANDOM
aws s3 mb s3://brainybee-lab-landing-$ID
aws s3 mb s3://brainybee-lab-processed-$ID▶Console alternative
- Navigate to S3 > Create bucket.
- Name:
brainybee-lab-landing-<unique-id>. - Keep defaults and click Create bucket.
- Repeat for
brainybee-lab-processed-<unique-id>.
Step 2: Create IAM Role for Lambda
The Lambda function needs permission to read from the landing bucket, write to the processed bucket, and write logs to CloudWatch.
# Create the trust policy file
echo '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "lambda.amazonaws.com"},"Action": "sts:AssumeRole"}]}' > trust-policy.json
# Create the role
aws iam create-role --role-name BrainyBeeIngestionRole --assume-role-policy-document file://trust-policy.json
# Attach S3 and Logging permissions
aws iam attach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam attach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRoleStep 3: Create and Deploy the Lambda Function
This function will be triggered by S3. It logs the event and moves the object to the processed bucket.
# save as lambda_function.py
import json
import boto3
import urllib.parse
s3 = boto3.client('s3')
def lambda_handler(event, context):
source_bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
target_bucket = source_bucket.replace('landing', 'processed')
print(f"Ingesting {key} from {source_bucket} to {target_bucket}")
# Copy object to processed bucket
s3.copy_object(Bucket=target_bucket, Key=key, CopySource={'Bucket': source_bucket, 'Key': key})
# Delete from landing bucket (Ingestion complete)
s3.delete_object(Bucket=source_bucket, Key=key)
return {'statusCode': 200, 'body': 'Data Ingested Successfully'}# Package and deploy
zip function.zip lambda_function.py
aws lambda create-function --function-name DataIngestor \
--zip-file fileb://function.zip --handler lambda_function.lambda_handler --runtime python3.9 \
--role arn:aws:iam::<YOUR_ACCOUNT_ID>:role/BrainyBeeIngestionRoleStep 4: Configure S3 Event Notification
Connect the landing bucket to the Lambda function.
# Grant S3 permission to invoke Lambda
aws lambda add-permission --function-name DataIngestor --statement-id s3-invoke \
--action "lambda:InvokeFunction" --principal s3.amazonaws.com \
--source-arn arn:aws:s3:::brainybee-lab-landing-$ID
# Configure the notification
echo '{"LambdaFunctionConfigurations": [{"LambdaFunctionArn": "arn:aws:lambda:us-east-1:<YOUR_ACCOUNT_ID>:function:DataIngestor", "Events": ["s3:ObjectCreated:*"]}]}' > notification.json
aws s3api put-bucket-notification-configuration --bucket brainybee-lab-landing-$ID --notification-configuration file://notification.jsonCheckpoints
- Verify S3 Trigger: Go to the S3 Console for your landing bucket. Under the Properties tab, scroll to Event notifications. You should see the
DataIngestorLambda listed. - Test Upload: Create a simple CSV file
test.csvand upload it to the landing bucket.bashecho "id,name,value\n1,test,100" > test.csv aws s3 cp test.csv s3://brainybee-lab-landing-$ID/ - Validate Movement: Run
aws s3 ls s3://brainybee-lab-processed-$ID/. The file should appear here, and the landing bucket should be empty.
Step 5: Run AWS Glue Crawler
Now catalog the processed data to make it queryable via SQL.
# Create a Glue Database
aws glue create-database --database-input '{"Name": "ingested_data_db"}'
# Create the Crawler
aws glue create-crawler --name DataIngestCrawler --role BrainyBeeIngestionRole \
--database-name ingested_data_db --targets '{"S3Targets": [{"Path": "s3://brainybee-lab-processed-'$ID'/"}]}'
# Start the Crawler
aws glue start-crawler --name DataIngestCrawlerTroubleshooting
| Problem | Cause | Fix |
|---|---|---|
| Lambda not triggering | Missing S3 permission to invoke Lambda | Check aws lambda add-permission output. |
| Copy failed | IAM Role lacks permissions for processed bucket | Ensure BrainyBeeIngestionRole has s3:PutObject for the target bucket. |
| Glue Crawler stays "Starting" | Crawler is provisioning resources | Wait 1-2 minutes; this is normal for the first run. |
Cost Estimate
- S3: $0.023 per GB (Free Tier covers first 5GB).
- Lambda: First 1 million requests per month are free.
- Glue: Crawlers are charged $0.44 per DPU-Hour (Minimum 10-minute billable duration). Estimated cost for this lab: <$0.15.
Clean-Up / Teardown
[!WARNING] Remember to run these commands to avoid ongoing charges from S3 storage or IAM roles.
# Delete S3 Objects and Buckets
aws s3 rb s3://brainybee-lab-landing-$ID --force
aws s3 rb s3://brainybee-lab-processed-$ID --force
# Delete Lambda
aws lambda delete-function --function-name DataIngestor
# Delete Glue components
aws glue delete-crawler --name DataIngestCrawler
aws glue delete-database --name ingested_data_db
# Delete IAM Role
aws iam detach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam detach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
aws iam delete-role --role-name BrainyBeeIngestionRoleStretch Challenge
Convert to Parquet: Modify the Lambda function to use pandas or awswrangler (via a Lambda Layer) to convert the incoming CSV file to Parquet format before saving it to the processed bucket. This is a best practice for optimizing performance in Athena and Redshift Spectrum.
Concept Review
Ingestion Comparison Table
| Feature | Batch (S3/AppFlow) | Streaming (Kinesis/MSK) |
|---|---|---|
| Latency | Minutes to Hours | Milliseconds to Seconds |
| Data Size | Large files/blocks | Continuous small records |
| Use Case | Daily reports, History sync | Fraud detection, IoT monitoring |
| Trigger | Scheduled / Event-based | Continuous polling |
[!TIP] For the DEA-C01 exam, remember that AWS DataSync is the preferred tool for large-scale on-premises file migrations, while AWS Data Exchange is used for third-party marketplace data.