Hands-On Lab895 words

Lab: Building a Scalable Data Ingestion Pipeline on AWS

Ingest and store data

Lab: Building a Scalable Data Ingestion Pipeline on AWS

This hands-on lab guides you through the process of setting up a data ingestion architecture, focusing on both batch and real-time streaming methods. You will learn to store data in Amazon S3, the foundation for most AWS Machine Learning workflows.

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges to your AWS account.

Prerequisites

  • AWS Account: Access to an AWS account with AdministratorAccess or equivalent permissions.
  • AWS CLI: Installed and configured on your local machine (aws configure).
  • IAM Knowledge: Basic understanding of IAM roles and S3 bucket policies.
  • Region: We will use us-east-1 for this lab.

Learning Objectives

  • Provision an Amazon S3 bucket for raw object storage.
  • Implement Batch Ingestion by uploading structured data (CSV) via CLI.
  • Configure Real-Time Ingestion using Amazon Kinesis Data Firehose.
  • Analyze the tradeoffs between object, block, and file storage.

Architecture Overview

This lab implements a hybrid ingestion pattern where batch data is uploaded directly to S3, while streaming data is buffered and delivered via Kinesis Firehose.

Loading Diagram...

Step-by-Step Instructions

Step 1: Create the S3 Data Lake Foundation

First, we need a destination for our data. S3 provides high durability and scalability for raw ML datasets.

CLI Method:

bash
# Create a unique bucket name BUCKET_NAME=brainybee-lab-$(date +%s) aws s3api create-bucket --bucket $BUCKET_NAME --region us-east-1
Console alternative
  1. Navigate to
S3

in the AWS Console. 2. Click

Create bucket

. 3. Enter a unique name like

brainybee-lab-data-123

. 4. Keep default settings and click

Create bucket

.

Step 2: Batch Ingestion of CSV Data

Simulate batch ingestion by moving a local dataset into your storage layer.

CLI Method:

bash
# Create a sample CSV file echo "id,timestamp,value" > data.csv echo "1,2023-10-01,10.5" >> data.csv # Upload to S3 aws s3 cp data.csv s3://$BUCKET_NAME/batch/data.csv

Step 3: Setup Kinesis Data Firehose for Streaming

Kinesis Firehose allows us to ingest data in near real-time and deliver it to S3 automatically.

Console Method:

  1. Navigate to Kinesis > Delivery streams > Create delivery stream.
  2. Source: Direct PUT.
  3. Destination: Amazon S3.
  4. Delivery stream name: lab-stream.
  5. S3 bucket: Browse and select the bucket you created in Step 1.
  6. S3 bucket prefix: streaming/.
  7. Click Create delivery stream.

[!NOTE] It may take 1-2 minutes for the stream to reach the ACTIVE state.

Step 4: Simulate Streaming Data Ingestion

Now we will push individual records into the stream to simulate real-time event generation.

CLI Method:

bash
aws firehose put-record \ --delivery-stream-name lab-stream \ --record '{"Data":"{\"id\": 2, \"timestamp\": \"2023-10-01T12:00:00Z\", \"value\": 22.1}"}'

Checkpoints

Checkpoint 1: Verify Batch Upload

Run the following to ensure your batch file is present:

bash
aws s3 ls s3://$BUCKET_NAME/batch/

Expected Result: You should see data.csv listed.

Checkpoint 2: Verify Streaming Delivery

Firehose buffers data for at least 60 seconds (or 1MB). Wait 2 minutes, then run:

bash
aws s3 ls s3://$BUCKET_NAME/streaming/ --recursive

Expected Result: You should see a folder structure based on year/month/day containing the delivered streaming record.

Troubleshooting

IssuePossible CauseFix
AccessDenied on S3Missing IAM permissionsEnsure your CLI user has s3:CreateBucket and s3:PutObject.
Firehose not deliveringBuffer time not metFirehose waits for a specific time or size threshold before writing to S3. Wait 2-3 mins.
BucketNameAlreadyExistsGlobal namespace conflictS3 bucket names must be unique globally. Add a random suffix to your name.

Concept Review

AWS provides different storage types depending on the ML access pattern:

\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, fill=blue!10, text centered, minimum width=3cm, minimum height=1cm}] \node (S3) {Object (Amazon S3)}; \node (EBS) [right of=S3, xshift=2.5cm] {Block (Amazon EBS)}; \node (EFS) [right of=EBS, xshift=2.5cm] {File (Amazon EFS)};

code
\node (S3Desc) [below of=S3] {\small Scalable, Cheap, Latency High}; \node (EBSDesc) [below of=EBS] {\small EC2 Instance, Latency Low}; \node (EFSDesc) [below of=EFS] {\small Shared, NFS, Linux Support};

\end{tikzpicture}

Storage TypeIdeal ForScalability
S3 (Object)Raw ML Data, Model ArtifactsVirtually Infinite
EBS (Block)Training Instance OS, DatabasesFixed Volume Size
EFS (File)Shared Training Code, Jupyter NotebooksElastic Expansion

Stretch Challenge

Convert Data on Ingestion: Use AWS Glue DataBrew to create a recipe that converts your CSV batch data into Apache Parquet format. Parquet is a columnar format that significantly speeds up ML training and reduces costs for S3 queries (Athena).

Cost Estimate

  • Amazon S3: $0.023 per GB (Free Tier includes 5GB/month).
  • Kinesis Firehose: $0.029 per GB ingested. For this lab (~1KB), the cost is <$0.01.
  • Total Lab Cost: Within Free Tier or approximately $0.01 USD.

Clean-Up / Teardown

To prevent ongoing charges, delete all resources created in this lab.

  1. Delete the Firehose Stream:
bash
aws firehose delete-delivery-stream --delivery-stream-name lab-stream
  1. Empty and Delete the S3 Bucket:
bash
# Empty all objects first aws s3 rm s3://$BUCKET_NAME --recursive # Delete the bucket aws s3 rb s3://$BUCKET_NAME

Ready to study AWS Certified Machine Learning Engineer - Associate (MLA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free