Hands-On Lab: Build a High-Performing Data Ingestion Pipeline with Kinesis Data Firehose
Determine high-performing data ingestion and transformation solutions
Hands-On Lab: Build a High-Performing Data Ingestion Pipeline
Welcome to this hands-on lab! Based on the AWS Certified Solutions Architect - Associate (SAA-C03) exam objectives, a core skill is determining high-performing data ingestion and transformation solutions. In this lab, you will build a scalable serverless ingestion pipeline using Amazon Kinesis Data Firehose to collect streaming data and deliver it securely to an Amazon S3 data lake.
Prerequisites
Before starting this lab, ensure you have the following ready:
- AWS Account: Administrator access to an AWS account.
- Command Line Tools: The AWS CLI (
aws) installed and configured with your credentials. - IAM Permissions: Ability to create S3 buckets, Kinesis delivery streams, and IAM roles.
- Knowledge: Basic understanding of JSON and streaming data concepts.
Learning Objectives
By completing this lab, you will be able to:
- Provision a centralized Amazon S3 bucket to act as the foundation of a data lake.
- Configure Amazon Kinesis Data Firehose to securely ingest streaming data.
- Establish IAM trust policies allowing AWS services to interact securely.
- Manually ingest mock telemetry data and verify its automated delivery.
Architecture Overview
This diagram illustrates the ingestion pipeline you will build. Data produced by the CLI is sent to Kinesis, which automatically buffers the stream and delivers it to S3.
Step-by-Step Instructions
Step 1: Create the Target S3 Data Lake Bucket
First, we need a highly durable storage destination for our ingested data.
aws s3 mb s3://brainybee-data-lake-<YOUR_ACCOUNT_ID> --region <YOUR_REGION>📸 Screenshot: Terminal output showing
make_bucket: brainybee-data-lake-<YOUR_ACCOUNT_ID>
▶Console alternative
- Navigate to the Amazon S3 console.
- Click Create bucket.
- Enter the bucket name
brainybee-data-lake-<YOUR_ACCOUNT_ID>. - Select your preferred region.
- Leave all other settings as default and click Create bucket.
Step 2: Create an IAM Role for Kinesis Data Firehose
Kinesis needs permission to write data into your newly created S3 bucket. We will create an IAM role and attach a policy.
First, create the trust policy document:
echo '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "firehose.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}' > trust-policy.jsonCreate the role using the trust policy:
aws iam create-role \
--role-name brainybee-firehose-s3-role \
--assume-role-policy-document file://trust-policy.jsonAttach the necessary S3 access policy (Note: In production, use least-privilege scoping instead of Full Access):
aws iam attach-role-policy \
--role-name brainybee-firehose-s3-role \
--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess▶Console alternative
- Navigate to the IAM console > Roles.
- Click Create role.
- Select AWS service as the trusted entity and choose Kinesis (then Kinesis Firehose).
- In permissions, attach
AmazonS3FullAccess. - Name the role
brainybee-firehose-s3-roleand click Create role.
Step 3: Create the Kinesis Data Firehose Delivery Stream
Now we provision the ingestion stream, instructing it to send data to our S3 bucket.
aws firehose create-delivery-stream \
--delivery-stream-name brainybee-ingestion-stream \
--s3-destination-configuration RoleARN=arn:aws:iam::<YOUR_ACCOUNT_ID>:role/brainybee-firehose-role,BucketARN=arn:aws:s3:::brainybee-data-lake-<YOUR_ACCOUNT_ID>[!TIP] By default, Kinesis Data Firehose buffers data for 5 minutes or 5 MB (whichever comes first) before delivering it to S3. This batching improves performance and reduces S3 API costs.
▶Console alternative
- Navigate to the Amazon Kinesis console.
- Select Data Firehose and click Create delivery stream.
- Source: Direct PUT / Destination: Amazon S3.
- Stream name:
brainybee-ingestion-stream. - Select your S3 bucket
brainybee-data-lake-<YOUR_ACCOUNT_ID>. - Under Advanced settings, ensure the newly created IAM role is selected.
- Click Create delivery stream.
Step 4: Ingest Streaming Data via CLI
Let's simulate a clickstream or IoT device sending telemetry data into our ingestion pipeline.
aws firehose put-record \
--delivery-stream-name brainybee-ingestion-stream \
--record '{"Data":"eyJ1c2VySWQiOiAiMTIzNDUiLCAiYWN0aW9uIjogImxvZ2luIiwgInRpbWVzdGFtcCI6ICIyMDIzLTEwLTAxVDEyOjAwOjAwWiJ9Cg=="}'[!NOTE] The
Datapayload must be Base64 encoded. The string above decodes to{"userId": "12345", "action": "login", "timestamp": "2023-10-01T12:00:00Z"}.
Execute the command 3-5 times to simulate multiple incoming records.
📸 Screenshot: Terminal showing the
RecordIdconfirmation from AWS.
▶Console alternative
- Kinesis Data Firehose currently requires data injection via CLI, SDKs, or agents (like Kinesis Agent). There is no direct "Test Event" button in the Firehose console. Please use the CLI method above or the provided Python
boto3SDK.
Step 5: Verify Data Delivery in S3
Wait approximately 5 minutes for the buffer interval to complete, then check your S3 bucket for the delivered files.
aws s3 ls s3://brainybee-data-lake-<YOUR_ACCOUNT_ID>/ --recursive▶Console alternative
- Navigate to the Amazon S3 console.
- Open
brainybee-data-lake-<YOUR_ACCOUNT_ID>. - Navigate through the automatically generated YYYY/MM/DD/HH folder structure.
- Download the object and open it in a text editor to verify the ingested JSON data.
Checkpoints
Verify your progress by running these commands:
- Checkpoint 1 (After Step 3): Run
aws firehose describe-delivery-stream --delivery-stream-name brainybee-ingestion-stream. EnsureDeliveryStreamStatusshows asACTIVE. - Checkpoint 2 (After Step 5): Run
aws s3 ls s3://brainybee-data-lake-<YOUR_ACCOUNT_ID>/ --recursive. You should see at least one file path resembling2023/10/01/12/brainybee-ingestion-stream-1-2023-10-01-12....
Troubleshooting
| Issue | Probable Cause | Solution |
|---|---|---|
ResourceNotFoundException during put-record | Stream is still in the CREATING state. | Wait 1-2 minutes and check stream status before retrying. |
| AccessDenied when creating stream | IAM Role missing S3 permissions or trust relationship is incorrect. | Verify trust-policy.json contains firehose.amazonaws.com as the principal. |
| No data appearing in S3 | Buffer time hasn't elapsed. | Wait a full 5 minutes for the Firehose buffer to flush to the S3 bucket. |
Clean-Up / Teardown
[!WARNING] Remember to run the teardown commands to avoid ongoing charges. While S3 storage is cheap, leaving idle resources is bad practice.
Execute the following commands to destroy all provisioned resources:
-
Delete the Firehose Delivery Stream:
bashaws firehose delete-delivery-stream --delivery-stream-name brainybee-ingestion-stream -
Empty and Delete the S3 Bucket:
bashaws s3 rm s3://brainybee-data-lake-<YOUR_ACCOUNT_ID> --recursive aws s3 rb s3://brainybee-data-lake-<YOUR_ACCOUNT_ID> -
Detach Policy and Delete IAM Role:
bashaws iam detach-role-policy --role-name brainybee-firehose-s3-role --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess aws iam delete-role --role-name brainybee-firehose-s3-role
Lab complete! You have successfully implemented a serverless, highly scalable data ingestion pipeline using AWS purpose-built services.