Hands-On Lab895 words

Lab: Building a Scalable Data Ingestion Pipeline on AWS

Ingest and store data

Lab: Building a Scalable Data Ingestion Pipeline on AWS

This hands-on lab guides you through the process of setting up a data ingestion architecture, focusing on both batch and real-time streaming methods. You will learn to store data in Amazon S3, the foundation for most AWS Machine Learning workflows.

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges to your AWS account.

Prerequisites

  • AWS Account: Access to an AWS account with AdministratorAccess or equivalent permissions.
  • AWS CLI: Installed and configured on your local machine (aws configure).
  • IAM Knowledge: Basic understanding of IAM roles and S3 bucket policies.
  • Region: We will use us-east-1 for this lab.

Learning Objectives

  • Provision an Amazon S3 bucket for raw object storage.
  • Implement Batch Ingestion by uploading structured data (CSV) via CLI.
  • Configure Real-Time Ingestion using Amazon Kinesis Data Firehose.
  • Analyze the tradeoffs between object, block, and file storage.

Architecture Overview

This lab implements a hybrid ingestion pattern where batch data is uploaded directly to S3, while streaming data is buffered and delivered via Kinesis Firehose.

Loading Diagram...

Step-by-Step Instructions

Step 1: Create the S3 Data Lake Foundation

First, we need a destination for our data. S3 provides high durability and scalability for raw ML datasets.

CLI Method:

bash
# Create a unique bucket name BUCKET_NAME=brainybee-lab-$(date +%s) aws s3api create-bucket --bucket $BUCKET_NAME --region us-east-1
Console alternative
  1. Navigate to
S3

in the AWS Console. 2. Click

Create bucket

. 3. Enter a unique name like

brainybee-lab-data-123

. 4. Keep default settings and click

Create bucket

.

Step 2: Batch Ingestion of CSV Data

Simulate batch ingestion by moving a local dataset into your storage layer.

CLI Method:

bash
# Create a sample CSV file echo "id,timestamp,value" > data.csv echo "1,2023-10-01,10.5" >> data.csv # Upload to S3 aws s3 cp data.csv s3://$BUCKET_NAME/batch/data.csv

Step 3: Setup Kinesis Data Firehose for Streaming

Kinesis Firehose allows us to ingest data in near real-time and deliver it to S3 automatically.

Console Method:

  1. Navigate to Kinesis > Delivery streams > Create delivery stream.
  2. Source: Direct PUT.
  3. Destination: Amazon S3.
  4. Delivery stream name: lab-stream.
  5. S3 bucket: Browse and select the bucket you created in Step 1.
  6. S3 bucket prefix: streaming/.
  7. Click Create delivery stream.

[!NOTE] It may take 1-2 minutes for the stream to reach the ACTIVE state.

Step 4: Simulate Streaming Data Ingestion

Now we will push individual records into the stream to simulate real-time event generation.

CLI Method:

bash
aws firehose put-record \ --delivery-stream-name lab-stream \ --record '{"Data":"{\"id\": 2, \"timestamp\": \"2023-10-01T12:00:00Z\", \"value\": 22.1}"}'

Checkpoints

Checkpoint 1: Verify Batch Upload

Run the following to ensure your batch file is present:

bash
aws s3 ls s3://$BUCKET_NAME/batch/

Expected Result: You should see data.csv listed.

Checkpoint 2: Verify Streaming Delivery

Firehose buffers data for at least 60 seconds (or 1MB). Wait 2 minutes, then run:

bash
aws s3 ls s3://$BUCKET_NAME/streaming/ --recursive

Expected Result: You should see a folder structure based on year/month/day containing the delivered streaming record.

Troubleshooting

IssuePossible CauseFix
AccessDenied on S3Missing IAM permissionsEnsure your CLI user has s3:CreateBucket and s3:PutObject.
Firehose not deliveringBuffer time not metFirehose waits for a specific time or size threshold before writing to S3. Wait 2-3 mins.
BucketNameAlreadyExistsGlobal namespace conflictS3 bucket names must be unique globally. Add a random suffix to your name.

Concept Review

AWS provides different storage types depending on the ML access pattern:

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds
Storage TypeIdeal ForScalability
S3 (Object)Raw ML Data, Model ArtifactsVirtually Infinite
EBS (Block)Training Instance OS, DatabasesFixed Volume Size
EFS (File)Shared Training Code, Jupyter NotebooksElastic Expansion

Stretch Challenge

Convert Data on Ingestion: Use AWS Glue DataBrew to create a recipe that converts your CSV batch data into Apache Parquet format. Parquet is a columnar format that significantly speeds up ML training and reduces costs for S3 queries (Athena).

Cost Estimate

  • Amazon S3: $0.023 per GB (Free Tier includes 5GB/month).
  • Kinesis Firehose: $0.029 per GB ingested. For this lab (~1KB), the cost is <$0.01.
  • Total Lab Cost: Within Free Tier or approximately $0.01 USD.

Clean-Up / Teardown

To prevent ongoing charges, delete all resources created in this lab.

  1. Delete the Firehose Stream:
bash
aws firehose delete-delivery-stream --delivery-stream-name lab-stream
  1. Empty and Delete the S3 Bucket:
bash
# Empty all objects first aws s3 rm s3://$BUCKET_NAME --recursive # Delete the bucket aws s3 rb s3://$BUCKET_NAME

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free