Lab: Building a Scalable Data Ingestion Pipeline on AWS
Ingest and store data
Lab: Building a Scalable Data Ingestion Pipeline on AWS
This hands-on lab guides you through the process of setting up a data ingestion architecture, focusing on both batch and real-time streaming methods. You will learn to store data in Amazon S3, the foundation for most AWS Machine Learning workflows.
[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges to your AWS account.
Prerequisites
- AWS Account: Access to an AWS account with
AdministratorAccessor equivalent permissions. - AWS CLI: Installed and configured on your local machine (
aws configure). - IAM Knowledge: Basic understanding of IAM roles and S3 bucket policies.
- Region: We will use
us-east-1for this lab.
Learning Objectives
- Provision an Amazon S3 bucket for raw object storage.
- Implement Batch Ingestion by uploading structured data (CSV) via CLI.
- Configure Real-Time Ingestion using Amazon Kinesis Data Firehose.
- Analyze the tradeoffs between object, block, and file storage.
Architecture Overview
This lab implements a hybrid ingestion pattern where batch data is uploaded directly to S3, while streaming data is buffered and delivered via Kinesis Firehose.
Step-by-Step Instructions
Step 1: Create the S3 Data Lake Foundation
First, we need a destination for our data. S3 provides high durability and scalability for raw ML datasets.
CLI Method:
# Create a unique bucket name
BUCKET_NAME=brainybee-lab-$(date +%s)
aws s3api create-bucket --bucket $BUCKET_NAME --region us-east-1▶Console alternative
- Navigate to
in the AWS Console. 2. Click
. 3. Enter a unique name like
brainybee-lab-data-123. 4. Keep default settings and click
.
Step 2: Batch Ingestion of CSV Data
Simulate batch ingestion by moving a local dataset into your storage layer.
CLI Method:
# Create a sample CSV file
echo "id,timestamp,value" > data.csv
echo "1,2023-10-01,10.5" >> data.csv
# Upload to S3
aws s3 cp data.csv s3://$BUCKET_NAME/batch/data.csvStep 3: Setup Kinesis Data Firehose for Streaming
Kinesis Firehose allows us to ingest data in near real-time and deliver it to S3 automatically.
Console Method:
- Navigate to Kinesis > Delivery streams > Create delivery stream.
- Source:
Direct PUT. - Destination:
Amazon S3. - Delivery stream name:
lab-stream. - S3 bucket: Browse and select the bucket you created in Step 1.
- S3 bucket prefix:
streaming/. - Click Create delivery stream.
[!NOTE] It may take 1-2 minutes for the stream to reach the
ACTIVEstate.
Step 4: Simulate Streaming Data Ingestion
Now we will push individual records into the stream to simulate real-time event generation.
CLI Method:
aws firehose put-record \
--delivery-stream-name lab-stream \
--record '{"Data":"{\"id\": 2, \"timestamp\": \"2023-10-01T12:00:00Z\", \"value\": 22.1}"}'Checkpoints
Checkpoint 1: Verify Batch Upload
Run the following to ensure your batch file is present:
aws s3 ls s3://$BUCKET_NAME/batch/Expected Result: You should see data.csv listed.
Checkpoint 2: Verify Streaming Delivery
Firehose buffers data for at least 60 seconds (or 1MB). Wait 2 minutes, then run:
aws s3 ls s3://$BUCKET_NAME/streaming/ --recursiveExpected Result: You should see a folder structure based on year/month/day containing the delivered streaming record.
Troubleshooting
| Issue | Possible Cause | Fix |
|---|---|---|
| AccessDenied on S3 | Missing IAM permissions | Ensure your CLI user has s3:CreateBucket and s3:PutObject. |
| Firehose not delivering | Buffer time not met | Firehose waits for a specific time or size threshold before writing to S3. Wait 2-3 mins. |
| BucketNameAlreadyExists | Global namespace conflict | S3 bucket names must be unique globally. Add a random suffix to your name. |
Concept Review
AWS provides different storage types depending on the ML access pattern:
\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, fill=blue!10, text centered, minimum width=3cm, minimum height=1cm}] \node (S3) {Object (Amazon S3)}; \node (EBS) [right of=S3, xshift=2.5cm] {Block (Amazon EBS)}; \node (EFS) [right of=EBS, xshift=2.5cm] {File (Amazon EFS)};
\node (S3Desc) [below of=S3] {\small Scalable, Cheap, Latency High};
\node (EBSDesc) [below of=EBS] {\small EC2 Instance, Latency Low};
\node (EFSDesc) [below of=EFS] {\small Shared, NFS, Linux Support};\end{tikzpicture}
| Storage Type | Ideal For | Scalability |
|---|---|---|
| S3 (Object) | Raw ML Data, Model Artifacts | Virtually Infinite |
| EBS (Block) | Training Instance OS, Databases | Fixed Volume Size |
| EFS (File) | Shared Training Code, Jupyter Notebooks | Elastic Expansion |
Stretch Challenge
Convert Data on Ingestion: Use AWS Glue DataBrew to create a recipe that converts your CSV batch data into Apache Parquet format. Parquet is a columnar format that significantly speeds up ML training and reduces costs for S3 queries (Athena).
Cost Estimate
- Amazon S3: $0.023 per GB (Free Tier includes 5GB/month).
- Kinesis Firehose: $0.029 per GB ingested. For this lab (~1KB), the cost is <$0.01.
- Total Lab Cost: Within Free Tier or approximately $0.01 USD.
Clean-Up / Teardown
To prevent ongoing charges, delete all resources created in this lab.
- Delete the Firehose Stream:
aws firehose delete-delivery-stream --delivery-stream-name lab-stream- Empty and Delete the S3 Bucket:
# Empty all objects first
aws s3 rm s3://$BUCKET_NAME --recursive
# Delete the bucket
aws s3 rb s3://$BUCKET_NAME