Lab: Building a Scalable Data Ingestion Pipeline on AWS

This hands-on lab guides you through the process of setting up a data ingestion architecture, focusing on both batch and real-time streaming methods. You will learn to store data in Amazon S3, the foundation for most AWS Machine Learning workflows.

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges to your AWS account.

Prerequisites

AWS Account: Access to an AWS account with AdministratorAccess or equivalent permissions.
AWS CLI: Installed and configured on your local machine (aws configure).
IAM Knowledge: Basic understanding of IAM roles and S3 bucket policies.
Region: We will use us-east-1 for this lab.

Learning Objectives

Provision an Amazon S3 bucket for raw object storage.
Implement Batch Ingestion by uploading structured data (CSV) via CLI.
Configure Real-Time Ingestion using Amazon Kinesis Data Firehose.
Analyze the tradeoffs between object, block, and file storage.

Architecture Overview

This lab implements a hybrid ingestion pattern where batch data is uploaded directly to S3, while streaming data is buffered and delivered via Kinesis Firehose.

Loading Diagram...

Step-by-Step Instructions

Step 1: Create the S3 Data Lake Foundation

First, we need a destination for our data. S3 provides high durability and scalability for raw ML datasets.

CLI Method:

bash

# Create a unique bucket name
BUCKET_NAME=brainybee-lab-$(date +%s)

aws s3api create-bucket --bucket $BUCKET_NAME --region us-east-1

▶Console alternative

Navigate to

in the AWS Console. 2. Click

Create bucket

. 3. Enter a unique name like

brainybee-lab-data-123

. 4. Keep default settings and click

Create bucket

Step 2: Batch Ingestion of CSV Data

Simulate batch ingestion by moving a local dataset into your storage layer.

CLI Method:

bash

# Create a sample CSV file
echo "id,timestamp,value" > data.csv
echo "1,2023-10-01,10.5" >> data.csv

# Upload to S3
aws s3 cp data.csv s3://$BUCKET_NAME/batch/data.csv

Step 3: Setup Kinesis Data Firehose for Streaming

Kinesis Firehose allows us to ingest data in near real-time and deliver it to S3 automatically.

Console Method:

Navigate to Kinesis > Delivery streams > Create delivery stream.
Source: Direct PUT.
Destination: Amazon S3.
Delivery stream name: lab-stream.
S3 bucket: Browse and select the bucket you created in Step 1.
S3 bucket prefix: streaming/.
Click Create delivery stream.

[!NOTE] It may take 1-2 minutes for the stream to reach the ACTIVE state.

Step 4: Simulate Streaming Data Ingestion

Now we will push individual records into the stream to simulate real-time event generation.

CLI Method:

bash

aws firehose put-record \
    --delivery-stream-name lab-stream \
    --record '{"Data":"{\"id\": 2, \"timestamp\": \"2023-10-01T12:00:00Z\", \"value\": 22.1}"}'

Checkpoints

Checkpoint 1: Verify Batch Upload

Run the following to ensure your batch file is present:

bash

aws s3 ls s3://$BUCKET_NAME/batch/

Expected Result: You should see data.csv listed.

Checkpoint 2: Verify Streaming Delivery

Firehose buffers data for at least 60 seconds (or 1MB). Wait 2 minutes, then run:

bash

aws s3 ls s3://$BUCKET_NAME/streaming/ --recursive

Expected Result: You should see a folder structure based on year/month/day containing the delivered streaming record.

Troubleshooting

Issue	Possible Cause	Fix
AccessDenied on S3	Missing IAM permissions	Ensure your CLI user has `s3:CreateBucket` and `s3:PutObject`.
Firehose not delivering	Buffer time not met	Firehose waits for a specific time or size threshold before writing to S3. Wait 2-3 mins.
BucketNameAlreadyExists	Global namespace conflict	S3 bucket names must be unique globally. Add a random suffix to your name.

Concept Review

AWS provides different storage types depending on the ML access pattern:

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Storage Type	Ideal For	Scalability
S3 (Object)	Raw ML Data, Model Artifacts	Virtually Infinite
EBS (Block)	Training Instance OS, Databases	Fixed Volume Size
EFS (File)	Shared Training Code, Jupyter Notebooks	Elastic Expansion

Stretch Challenge

Convert Data on Ingestion: Use AWS Glue DataBrew to create a recipe that converts your CSV batch data into Apache Parquet format. Parquet is a columnar format that significantly speeds up ML training and reduces costs for S3 queries (Athena).

Cost Estimate

Amazon S3: $0.023 per GB (Free Tier includes 5GB/month).
Kinesis Firehose: $0.029 per GB ingested. For this lab (~1KB), the cost is <$0.01.
Total Lab Cost: Within Free Tier or approximately $0.01 USD.

Clean-Up / Teardown

To prevent ongoing charges, delete all resources created in this lab.

Delete the Firehose Stream:

bash

aws firehose delete-delivery-stream --delivery-stream-name lab-stream

Empty and Delete the S3 Bucket:

bash

# Empty all objects first
aws s3 rm s3://$BUCKET_NAME --recursive

# Delete the bucket
aws s3 rb s3://$BUCKET_NAME