Hands-On Lab: Build a High-Performing Data Ingestion Pipeline

Welcome to this hands-on lab! Based on the AWS Certified Solutions Architect - Associate (SAA-C03) exam objectives, a core skill is determining high-performing data ingestion and transformation solutions. In this lab, you will build a scalable serverless ingestion pipeline using Amazon Kinesis Data Firehose to collect streaming data and deliver it securely to an Amazon S3 data lake.

Prerequisites

Before starting this lab, ensure you have the following ready:

AWS Account: Administrator access to an AWS account.
Command Line Tools: The AWS CLI (aws) installed and configured with your credentials.
IAM Permissions: Ability to create S3 buckets, Kinesis delivery streams, and IAM roles.
Knowledge: Basic understanding of JSON and streaming data concepts.

Learning Objectives

By completing this lab, you will be able to:

Provision a centralized Amazon S3 bucket to act as the foundation of a data lake.
Configure Amazon Kinesis Data Firehose to securely ingest streaming data.
Establish IAM trust policies allowing AWS services to interact securely.
Manually ingest mock telemetry data and verify its automated delivery.

Architecture Overview

This diagram illustrates the ingestion pipeline you will build. Data produced by the CLI is sent to Kinesis, which automatically buffers the stream and delivers it to S3.

Loading Diagram...

Step-by-Step Instructions

Step 1: Create the Target S3 Data Lake Bucket

First, we need a highly durable storage destination for our ingested data.

bash

aws s3 mb s3://brainybee-data-lake-<YOUR_ACCOUNT_ID> --region <YOUR_REGION>

📸 Screenshot: Terminal output showing make_bucket: brainybee-data-lake-<YOUR_ACCOUNT_ID>

▶Console alternative

Navigate to the Amazon S3 console.
Click Create bucket.
Enter the bucket name brainybee-data-lake-<YOUR_ACCOUNT_ID>.
Select your preferred region.
Leave all other settings as default and click Create bucket.

Step 2: Create an IAM Role for Kinesis Data Firehose

Kinesis needs permission to write data into your newly created S3 bucket. We will create an IAM role and attach a policy.

First, create the trust policy document:

bash

echo '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "firehose.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}' > trust-policy.json

Create the role using the trust policy:

bash

aws iam create-role \
  --role-name brainybee-firehose-s3-role \
  --assume-role-policy-document file://trust-policy.json

Attach the necessary S3 access policy (Note: In production, use least-privilege scoping instead of Full Access):

bash

aws iam attach-role-policy \
  --role-name brainybee-firehose-s3-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

▶Console alternative

Navigate to the IAM console > Roles.
Click Create role.
Select AWS service as the trusted entity and choose Kinesis (then Kinesis Firehose).
In permissions, attach AmazonS3FullAccess.
Name the role brainybee-firehose-s3-role and click Create role.

Step 3: Create the Kinesis Data Firehose Delivery Stream

Now we provision the ingestion stream, instructing it to send data to our S3 bucket.

bash

aws firehose create-delivery-stream \
  --delivery-stream-name brainybee-ingestion-stream \
  --s3-destination-configuration RoleARN=arn:aws:iam::<YOUR_ACCOUNT_ID>:role/brainybee-firehose-role,BucketARN=arn:aws:s3:::brainybee-data-lake-<YOUR_ACCOUNT_ID>

[!TIP] By default, Kinesis Data Firehose buffers data for 5 minutes or 5 MB (whichever comes first) before delivering it to S3. This batching improves performance and reduces S3 API costs.

▶Console alternative

Navigate to the Amazon Kinesis console.
Select Data Firehose and click Create delivery stream.
Source: Direct PUT / Destination: Amazon S3.
Stream name: brainybee-ingestion-stream.
Select your S3 bucket brainybee-data-lake-<YOUR_ACCOUNT_ID>.
Under Advanced settings, ensure the newly created IAM role is selected.
Click Create delivery stream.

Step 4: Ingest Streaming Data via CLI

Let's simulate a clickstream or IoT device sending telemetry data into our ingestion pipeline.

bash

aws firehose put-record \
  --delivery-stream-name brainybee-ingestion-stream \
  --record '{"Data":"eyJ1c2VySWQiOiAiMTIzNDUiLCAiYWN0aW9uIjogImxvZ2luIiwgInRpbWVzdGFtcCI6ICIyMDIzLTEwLTAxVDEyOjAwOjAwWiJ9Cg=="}'

[!NOTE] The Data payload must be Base64 encoded. The string above decodes to {"userId": "12345", "action": "login", "timestamp": "2023-10-01T12:00:00Z"}.

Execute the command 3-5 times to simulate multiple incoming records.

📸 Screenshot: Terminal showing the RecordId confirmation from AWS.

▶Console alternative

Kinesis Data Firehose currently requires data injection via CLI, SDKs, or agents (like Kinesis Agent). There is no direct "Test Event" button in the Firehose console. Please use the CLI method above or the provided Python boto3 SDK.

Step 5: Verify Data Delivery in S3

Wait approximately 5 minutes for the buffer interval to complete, then check your S3 bucket for the delivered files.

bash

aws s3 ls s3://brainybee-data-lake-<YOUR_ACCOUNT_ID>/ --recursive

▶Console alternative

Navigate to the Amazon S3 console.
Open brainybee-data-lake-<YOUR_ACCOUNT_ID>.
Navigate through the automatically generated YYYY/MM/DD/HH folder structure.
Download the object and open it in a text editor to verify the ingested JSON data.

Checkpoints

Verify your progress by running these commands:

Checkpoint 1 (After Step 3): Run aws firehose describe-delivery-stream --delivery-stream-name brainybee-ingestion-stream. Ensure DeliveryStreamStatus shows as ACTIVE.
Checkpoint 2 (After Step 5): Run aws s3 ls s3://brainybee-data-lake-<YOUR_ACCOUNT_ID>/ --recursive. You should see at least one file path resembling 2023/10/01/12/brainybee-ingestion-stream-1-2023-10-01-12....

Troubleshooting

Issue	Probable Cause	Solution
ResourceNotFoundException during `put-record`	Stream is still in the `CREATING` state.	Wait 1-2 minutes and check stream status before retrying.
AccessDenied when creating stream	IAM Role missing S3 permissions or trust relationship is incorrect.	Verify `trust-policy.json` contains `firehose.amazonaws.com` as the principal.
No data appearing in S3	Buffer time hasn't elapsed.	Wait a full 5 minutes for the Firehose buffer to flush to the S3 bucket.

Clean-Up / Teardown

[!WARNING] Remember to run the teardown commands to avoid ongoing charges. While S3 storage is cheap, leaving idle resources is bad practice.

Execute the following commands to destroy all provisioned resources:

Delete the Firehose Delivery Stream:

bash
aws firehose delete-delivery-stream --delivery-stream-name brainybee-ingestion-stream
Empty and Delete the S3 Bucket:

bash
aws s3 rm s3://brainybee-data-lake-<YOUR_ACCOUNT_ID> --recursive aws s3 rb s3://brainybee-data-lake-<YOUR_ACCOUNT_ID>
Detach Policy and Delete IAM Role:

bash
aws iam detach-role-policy --role-name brainybee-firehose-s3-role --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess aws iam delete-role --role-name brainybee-firehose-s3-role

Lab complete! You have successfully implemented a serverless, highly scalable data ingestion pipeline using AWS purpose-built services.

Hands-On Lab: Build a High-Performing Data Ingestion Pipeline

Prerequisites

Before starting this lab, ensure you have the following ready:

AWS Account: Administrator access to an AWS account.
Command Line Tools: The AWS CLI (aws) installed and configured with your credentials.
IAM Permissions: Ability to create S3 buckets, Kinesis delivery streams, and IAM roles.
Knowledge: Basic understanding of JSON and streaming data concepts.

Learning Objectives

By completing this lab, you will be able to:

Provision a centralized Amazon S3 bucket to act as the foundation of a data lake.
Configure Amazon Kinesis Data Firehose to securely ingest streaming data.
Establish IAM trust policies allowing AWS services to interact securely.
Manually ingest mock telemetry data and verify its automated delivery.

Architecture Overview

This diagram illustrates the ingestion pipeline you will build. Data produced by the CLI is sent to Kinesis, which automatically buffers the stream and delivers it to S3.

Loading Diagram...

Step-by-Step Instructions

Step 1: Create the Target S3 Data Lake Bucket

First, we need a highly durable storage destination for our ingested data.

bash

aws s3 mb s3://brainybee-data-lake-<YOUR_ACCOUNT_ID> --region <YOUR_REGION>

📸 Screenshot: Terminal output showing make_bucket: brainybee-data-lake-<YOUR_ACCOUNT_ID>

▶Console alternative

Navigate to the Amazon S3 console.
Click Create bucket.
Enter the bucket name brainybee-data-lake-<YOUR_ACCOUNT_ID>.
Select your preferred region.
Leave all other settings as default and click Create bucket.

Step 2: Create an IAM Role for Kinesis Data Firehose

Kinesis needs permission to write data into your newly created S3 bucket. We will create an IAM role and attach a policy.

First, create the trust policy document:

bash

echo '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "firehose.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}' > trust-policy.json

Create the role using the trust policy:

bash

aws iam create-role \
  --role-name brainybee-firehose-s3-role \
  --assume-role-policy-document file://trust-policy.json

Attach the necessary S3 access policy (Note: In production, use least-privilege scoping instead of Full Access):

bash

aws iam attach-role-policy \
  --role-name brainybee-firehose-s3-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

▶Console alternative

Navigate to the IAM console > Roles.
Click Create role.
Select AWS service as the trusted entity and choose Kinesis (then Kinesis Firehose).
In permissions, attach AmazonS3FullAccess.
Name the role brainybee-firehose-s3-role and click Create role.

Step 3: Create the Kinesis Data Firehose Delivery Stream

Now we provision the ingestion stream, instructing it to send data to our S3 bucket.

bash

aws firehose create-delivery-stream \
  --delivery-stream-name brainybee-ingestion-stream \
  --s3-destination-configuration RoleARN=arn:aws:iam::<YOUR_ACCOUNT_ID>:role/brainybee-firehose-role,BucketARN=arn:aws:s3:::brainybee-data-lake-<YOUR_ACCOUNT_ID>

[!TIP] By default, Kinesis Data Firehose buffers data for 5 minutes or 5 MB (whichever comes first) before delivering it to S3. This batching improves performance and reduces S3 API costs.

▶Console alternative

Navigate to the Amazon Kinesis console.
Select Data Firehose and click Create delivery stream.
Source: Direct PUT / Destination: Amazon S3.
Stream name: brainybee-ingestion-stream.
Select your S3 bucket brainybee-data-lake-<YOUR_ACCOUNT_ID>.
Under Advanced settings, ensure the newly created IAM role is selected.
Click Create delivery stream.

Step 4: Ingest Streaming Data via CLI

Let's simulate a clickstream or IoT device sending telemetry data into our ingestion pipeline.

bash

aws firehose put-record \
  --delivery-stream-name brainybee-ingestion-stream \
  --record '{"Data":"eyJ1c2VySWQiOiAiMTIzNDUiLCAiYWN0aW9uIjogImxvZ2luIiwgInRpbWVzdGFtcCI6ICIyMDIzLTEwLTAxVDEyOjAwOjAwWiJ9Cg=="}'

[!NOTE] The Data payload must be Base64 encoded. The string above decodes to {"userId": "12345", "action": "login", "timestamp": "2023-10-01T12:00:00Z"}.

Execute the command 3-5 times to simulate multiple incoming records.

📸 Screenshot: Terminal showing the RecordId confirmation from AWS.

▶Console alternative

Kinesis Data Firehose currently requires data injection via CLI, SDKs, or agents (like Kinesis Agent). There is no direct "Test Event" button in the Firehose console. Please use the CLI method above or the provided Python boto3 SDK.

Step 5: Verify Data Delivery in S3

Wait approximately 5 minutes for the buffer interval to complete, then check your S3 bucket for the delivered files.

bash

aws s3 ls s3://brainybee-data-lake-<YOUR_ACCOUNT_ID>/ --recursive

▶Console alternative

Navigate to the Amazon S3 console.
Open brainybee-data-lake-<YOUR_ACCOUNT_ID>.
Navigate through the automatically generated YYYY/MM/DD/HH folder structure.
Download the object and open it in a text editor to verify the ingested JSON data.

Checkpoints

Verify your progress by running these commands:

Checkpoint 1 (After Step 3): Run aws firehose describe-delivery-stream --delivery-stream-name brainybee-ingestion-stream. Ensure DeliveryStreamStatus shows as ACTIVE.
Checkpoint 2 (After Step 5): Run aws s3 ls s3://brainybee-data-lake-<YOUR_ACCOUNT_ID>/ --recursive. You should see at least one file path resembling 2023/10/01/12/brainybee-ingestion-stream-1-2023-10-01-12....

Troubleshooting

Issue	Probable Cause	Solution
ResourceNotFoundException during `put-record`	Stream is still in the `CREATING` state.	Wait 1-2 minutes and check stream status before retrying.
AccessDenied when creating stream	IAM Role missing S3 permissions or trust relationship is incorrect.	Verify `trust-policy.json` contains `firehose.amazonaws.com` as the principal.
No data appearing in S3	Buffer time hasn't elapsed.	Wait a full 5 minutes for the Firehose buffer to flush to the S3 bucket.

Clean-Up / Teardown

[!WARNING] Remember to run the teardown commands to avoid ongoing charges. While S3 storage is cheap, leaving idle resources is bad practice.

Execute the following commands to destroy all provisioned resources:

Delete the Firehose Delivery Stream:

bash
aws firehose delete-delivery-stream --delivery-stream-name brainybee-ingestion-stream
Empty and Delete the S3 Bucket:

bash
aws s3 rm s3://brainybee-data-lake-<YOUR_ACCOUNT_ID> --recursive aws s3 rb s3://brainybee-data-lake-<YOUR_ACCOUNT_ID>
Detach Policy and Delete IAM Role:

bash
aws iam detach-role-policy --role-name brainybee-firehose-s3-role --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess aws iam delete-role --role-name brainybee-firehose-s3-role

Lab complete! You have successfully implemented a serverless, highly scalable data ingestion pipeline using AWS purpose-built services.

Hands-On Lab: Build a High-Performing Data Ingestion Pipeline with Kinesis Data Firehose

Hands-On Lab: Build a High-Performing Data Ingestion Pipeline

Prerequisites

Learning Objectives

Architecture Overview

Step-by-Step Instructions

Step 1: Create the Target S3 Data Lake Bucket

Step 2: Create an IAM Role for Kinesis Data Firehose

Step 3: Create the Kinesis Data Firehose Delivery Stream

Step 4: Ingest Streaming Data via CLI

Step 5: Verify Data Delivery in S3

Checkpoints

Troubleshooting

Clean-Up / Teardown

Hands-On Lab: Build a High-Performing Data Ingestion Pipeline with Kinesis Data Firehose

Hands-On Lab: Build a High-Performing Data Ingestion Pipeline

Prerequisites

Learning Objectives

Architecture Overview

Step-by-Step Instructions

Step 1: Create the Target S3 Data Lake Bucket

Step 2: Create an IAM Role for Kinesis Data Firehose

Step 3: Create the Kinesis Data Firehose Delivery Stream

Step 4: Ingest Streaming Data via CLI

Step 5: Verify Data Delivery in S3

Checkpoints

Troubleshooting

Clean-Up / Teardown