Hands-On Lab1,050 words

AWS Data Ingestion: Building an Automated Batch Pipeline with S3, Lambda, and Glue

Perform data ingestion

AWS Data Ingestion: Building an Automated Batch Pipeline

This lab provides hands-on experience in performing batch data ingestion, a core task for the AWS Certified Data Engineer – Associate (DEA-C01) exam. You will build a serverless pipeline that automatically reacts to new data uploads, processes them using AWS Lambda, and catalogs the results using AWS Glue.

Prerequisites

  • AWS Account: Access to an AWS account with AdministratorAccess or equivalent permissions.
  • CLI Tools: AWS CLI installed and configured with your credentials (aws configure).
  • Region: We will use us-east-1 for this lab.
  • IAM Permissions: Ability to create IAM Roles, S3 Buckets, Lambda Functions, and Glue Crawlers.

Learning Objectives

  • Create and configure Amazon S3 buckets for data landing and processing.
  • Implement Amazon S3 Event Notifications to trigger downstream workflows.
  • Develop an AWS Lambda function to handle file ingestion logic.
  • Configure an AWS Glue Crawler to automatically discover schemas and update the Data Catalog.
  • Understand the difference between batch and near real-time ingestion patterns.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Create S3 Buckets

You need two buckets: one for incoming raw data and one for the processed data.

bash
# Generate a unique suffix to avoid naming conflicts export ID=$RANDOM aws s3 mb s3://brainybee-lab-landing-$ID aws s3 mb s3://brainybee-lab-processed-$ID
Console alternative
  1. Navigate to S3 > Create bucket.
  2. Name: brainybee-lab-landing-<unique-id>.
  3. Keep defaults and click Create bucket.
  4. Repeat for brainybee-lab-processed-<unique-id>.

Step 2: Create IAM Role for Lambda

The Lambda function needs permission to read from the landing bucket, write to the processed bucket, and write logs to CloudWatch.

bash
# Create the trust policy file echo '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "lambda.amazonaws.com"},"Action": "sts:AssumeRole"}]}' > trust-policy.json # Create the role aws iam create-role --role-name BrainyBeeIngestionRole --assume-role-policy-document file://trust-policy.json # Attach S3 and Logging permissions aws iam attach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess aws iam attach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

Step 3: Create and Deploy the Lambda Function

This function will be triggered by S3. It logs the event and moves the object to the processed bucket.

python
# save as lambda_function.py import json import boto3 import urllib.parse s3 = boto3.client('s3') def lambda_handler(event, context): source_bucket = event['Records'][0]['s3']['bucket']['name'] key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8') target_bucket = source_bucket.replace('landing', 'processed') print(f"Ingesting {key} from {source_bucket} to {target_bucket}") # Copy object to processed bucket s3.copy_object(Bucket=target_bucket, Key=key, CopySource={'Bucket': source_bucket, 'Key': key}) # Delete from landing bucket (Ingestion complete) s3.delete_object(Bucket=source_bucket, Key=key) return {'statusCode': 200, 'body': 'Data Ingested Successfully'}
bash
# Package and deploy zip function.zip lambda_function.py aws lambda create-function --function-name DataIngestor \ --zip-file fileb://function.zip --handler lambda_function.lambda_handler --runtime python3.9 \ --role arn:aws:iam::<YOUR_ACCOUNT_ID>:role/BrainyBeeIngestionRole

Step 4: Configure S3 Event Notification

Connect the landing bucket to the Lambda function.

bash
# Grant S3 permission to invoke Lambda aws lambda add-permission --function-name DataIngestor --statement-id s3-invoke \ --action "lambda:InvokeFunction" --principal s3.amazonaws.com \ --source-arn arn:aws:s3:::brainybee-lab-landing-$ID # Configure the notification echo '{"LambdaFunctionConfigurations": [{"LambdaFunctionArn": "arn:aws:lambda:us-east-1:<YOUR_ACCOUNT_ID>:function:DataIngestor", "Events": ["s3:ObjectCreated:*"]}]}' > notification.json aws s3api put-bucket-notification-configuration --bucket brainybee-lab-landing-$ID --notification-configuration file://notification.json

Checkpoints

  1. Verify S3 Trigger: Go to the S3 Console for your landing bucket. Under the Properties tab, scroll to Event notifications. You should see the DataIngestor Lambda listed.
  2. Test Upload: Create a simple CSV file test.csv and upload it to the landing bucket.
    bash
    echo "id,name,value\n1,test,100" > test.csv aws s3 cp test.csv s3://brainybee-lab-landing-$ID/
  3. Validate Movement: Run aws s3 ls s3://brainybee-lab-processed-$ID/. The file should appear here, and the landing bucket should be empty.

Step 5: Run AWS Glue Crawler

Now catalog the processed data to make it queryable via SQL.

bash
# Create a Glue Database aws glue create-database --database-input '{"Name": "ingested_data_db"}' # Create the Crawler aws glue create-crawler --name DataIngestCrawler --role BrainyBeeIngestionRole \ --database-name ingested_data_db --targets '{"S3Targets": [{"Path": "s3://brainybee-lab-processed-'$ID'/"}]}' # Start the Crawler aws glue start-crawler --name DataIngestCrawler

Troubleshooting

ProblemCauseFix
Lambda not triggeringMissing S3 permission to invoke LambdaCheck aws lambda add-permission output.
Copy failedIAM Role lacks permissions for processed bucketEnsure BrainyBeeIngestionRole has s3:PutObject for the target bucket.
Glue Crawler stays "Starting"Crawler is provisioning resourcesWait 1-2 minutes; this is normal for the first run.

Cost Estimate

  • S3: $0.023 per GB (Free Tier covers first 5GB).
  • Lambda: First 1 million requests per month are free.
  • Glue: Crawlers are charged $0.44 per DPU-Hour (Minimum 10-minute billable duration). Estimated cost for this lab: <$0.15.

Clean-Up / Teardown

[!WARNING] Remember to run these commands to avoid ongoing charges from S3 storage or IAM roles.

bash
# Delete S3 Objects and Buckets aws s3 rb s3://brainybee-lab-landing-$ID --force aws s3 rb s3://brainybee-lab-processed-$ID --force # Delete Lambda aws lambda delete-function --function-name DataIngestor # Delete Glue components aws glue delete-crawler --name DataIngestCrawler aws glue delete-database --name ingested_data_db # Delete IAM Role aws iam detach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess aws iam detach-role-policy --role-name BrainyBeeIngestionRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole aws iam delete-role --role-name BrainyBeeIngestionRole

Stretch Challenge

Convert to Parquet: Modify the Lambda function to use pandas or awswrangler (via a Lambda Layer) to convert the incoming CSV file to Parquet format before saving it to the processed bucket. This is a best practice for optimizing performance in Athena and Redshift Spectrum.

Concept Review

Ingestion Comparison Table

FeatureBatch (S3/AppFlow)Streaming (Kinesis/MSK)
LatencyMinutes to HoursMilliseconds to Seconds
Data SizeLarge files/blocksContinuous small records
Use CaseDaily reports, History syncFraud detection, IoT monitoring
TriggerScheduled / Event-basedContinuous polling

[!TIP] For the DEA-C01 exam, remember that AWS DataSync is the preferred tool for large-scale on-premises file migrations, while AWS Data Exchange is used for third-party marketplace data.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free