Lab: Building a Serverless Data Lake with AWS Glue and Amazon Athena

This lab provides a hands-on introduction to the AWS serverless analytics ecosystem. You will learn how to store raw data in Amazon S3, use AWS Glue to automatically discover the schema, and perform SQL-based analysis using Amazon Athena—all without managing a single server.

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges for stored data or metadata. This lab is mostly Free Tier eligible, but Athena queries cost $5 per TB scanned (prorated).

Prerequisites

Before starting, ensure you have:

An AWS Account with administrative access.
AWS CLI installed and configured (aws configure).
Basic knowledge of SQL (SELECT, FROM, WHERE).
A sample dataset (save a few lines of CSV data as customer_data.csv).

Learning Objectives

By the end of this lab, you will be able to:

Configure S3 as a landing zone for raw structured data.
Deploy an AWS Glue Crawler to populate the Glue Data Catalog.
Execute SQL queries against S3 data using the Athena serverless engine.
Understand the cost and performance benefits of serverless analytics.

Architecture Overview

The following diagram illustrates the data flow: Raw data is uploaded to S3, indexed by Glue, and queried via Athena.

Loading Diagram...

Step-by-Step Instructions

Step 1: Create the Data Lake Storage

First, we need an S3 bucket to act as our data lake storage and another for Athena query results.

bash

# Replace <UNIQUE_ID> with a random string
aws s3 mb s3://brainybee-lab-data-<UNIQUE_ID>
aws s3 mb s3://brainybee-lab-results-<UNIQUE_ID>

▶Console alternative

Navigate to S3 in the AWS Console.
Click Create bucket.
Bucket name: brainybee-lab-data-<YOUR_NAME>.
Click Create bucket at the bottom. Repeat for the results bucket.

📸 Screenshot: S3 Bucket list showing both your data and results buckets.

Step 2: Upload Sample Data

Create a file named customer_data.csv with these contents:

csv

id,name,signup_date,plan
1,Alice,2023-01-01,Premium
2,Bob,2023-02-15,Free
3,Charlie,2023-03-10,Premium

Upload it to a "folder" (prefix) in your bucket:

bash

aws s3 cp customer_data.csv s3://brainybee-lab-data-<UNIQUE_ID>/customers/customer_data.csv

Step 3: Create an IAM Role for Glue

AWS Glue needs permission to read your S3 bucket and write to the Data Catalog.

bash

# Create the trust policy file
echo '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "glue.amazonaws.com"},"Action": "sts:AssumeRole"}]}' > glue-trust-policy.json

# Create the role
aws iam create-role --role-name BrainyBeeGlueRole --assume-role-policy-document file://glue-trust-policy.json

# Attach the managed policy
aws iam attach-role-policy --role-name BrainyBeeGlueRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

Step 4: Catalog Data with Glue Crawler

We will create a crawler to scan the S3 bucket and create a table in the Glue Data Catalog.

bash

# Create a database in the catalog
aws glue create-database --database-input '{"Name": "brainybee_db"}'

# Create the crawler
aws glue create-crawler --name brainybee-crawler \
    --role BrainyBeeGlueRole \
    --database-name brainybee_db \
    --targets '{"S3Targets": [{"Path": "s3://brainybee-lab-data-<UNIQUE_ID>/customers/"}]}'

# Run the crawler
aws glue start-crawler --name brainybee-crawler

▶Console alternative

Navigate to AWS Glue > Crawlers.
Click Create crawler.
Name: brainybee-crawler. Click Next.
Add a data source: Select S3, choose your customers/ path. Click Next.
Choose the IAM role BrainyBeeGlueRole. Click Next.
Target database: Create brainybee_db. Click Finish.
Select the crawler and click Run crawler.

Step 5: Querying Data with Athena

Once the crawler finishes, Athena can query the new table using standard SQL.

bash

# Configure Athena output location
aws athena start-query-execution \
    --query-string "SELECT * FROM customers WHERE plan = 'Premium';" \
    --query-execution-context Database=brainybee_db \
    --result-configuration OutputLocation=s3://brainybee-lab-results-<UNIQUE_ID>/

Checkpoints

S3 Verification: Run aws s3 ls s3://brainybee-lab-data-<UNIQUE_ID>/customers/. You should see customer_data.csv.
Glue Verification: Check the Glue Console. Under Databases > Tables, you should see a table named customers with columns id, name, signup_date, and plan correctly identified.
Athena Verification: In the Athena Console, run SELECT count(*) FROM customers. The result should be 3.

Concept Review

Feature	Amazon Athena	Amazon Redshift	AWS Glue
Type	Serverless Query Engine	Data Warehouse (Cluster)	Serverless ETL & Catalog
Primary Use	Ad-hoc SQL on S3	Complex, high-perf Analytics	Data Prep and Metadata
Scaling	Automatic	Manual or Auto-scaling	Serverless Spark/Python
Pricing	Per-Query (Data Scanned)	Hourly per Node	Per DPU-Hour

Visualizing Concepts: The Serverless Data Lake

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Troubleshooting

Error	Likely Cause	Fix
`Access Denied` when running Crawler	IAM Role lacks S3 permissions	Ensure `AWSGlueServiceRole` is attached and S3 bucket policy allows Glue access.
Athena: `Table not found`	Crawler hasn't finished	Wait for Crawler status to return to `Ready` before querying.
Athena: `Zero records returned`	S3 pathing issue	Ensure the Crawler points to the folder `customers/`, not the specific file.

Cost Estimate

S3 Storage: ~0.023 per GB/mo. (For this lab: < $0.01)
Glue Crawler: $0.44 per DPU-Hour. (Minimum 0.25 DPUs for 1 min: < $0.10)
Amazon Athena: $5.00 per TB scanned. (This lab scans bytes: < $0.01)
Total Estimated Cost: <$0.20

Challenge

Can you convert the data to Apache Parquet format? Parquet is a columnar format that makes Athena queries faster and cheaper because it reduces the amount of data scanned.

▶Hint

Use an AWS Glue Studio visual job to read from the customers table and write to a new S3 folder with the format set to "Parquet". Then, update your Athena query to point to the Parquet table.

Clean-Up / Teardown

To avoid ongoing charges, delete the resources created:

bash

# Delete S3 objects and buckets
aws s3 rb s3://brainybee-lab-data-<UNIQUE_ID> --force
aws s3 rb s3://brainybee-lab-results-<UNIQUE_ID> --force

# Delete Glue resources
aws glue delete-crawler --name brainybee-crawler
aws glue delete-database --name brainybee_db

# Delete IAM Role
aws iam detach-role-policy --role-name BrainyBeeGlueRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
aws iam delete-role --role-name BrainyBeeGlueRole

Lab: Building a Serverless Data Lake with AWS Glue and Amazon Athena

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges for stored data or metadata. This lab is mostly Free Tier eligible, but Athena queries cost $5 per TB scanned (prorated).

Prerequisites

Before starting, ensure you have:

An AWS Account with administrative access.
AWS CLI installed and configured (aws configure).
Basic knowledge of SQL (SELECT, FROM, WHERE).
A sample dataset (save a few lines of CSV data as customer_data.csv).

Learning Objectives

By the end of this lab, you will be able to:

Configure S3 as a landing zone for raw structured data.
Deploy an AWS Glue Crawler to populate the Glue Data Catalog.
Execute SQL queries against S3 data using the Athena serverless engine.
Understand the cost and performance benefits of serverless analytics.

Architecture Overview

The following diagram illustrates the data flow: Raw data is uploaded to S3, indexed by Glue, and queried via Athena.

Loading Diagram...

Step-by-Step Instructions

Step 1: Create the Data Lake Storage

First, we need an S3 bucket to act as our data lake storage and another for Athena query results.

bash

# Replace <UNIQUE_ID> with a random string
aws s3 mb s3://brainybee-lab-data-<UNIQUE_ID>
aws s3 mb s3://brainybee-lab-results-<UNIQUE_ID>

▶Console alternative

Navigate to S3 in the AWS Console.
Click Create bucket.
Bucket name: brainybee-lab-data-<YOUR_NAME>.
Click Create bucket at the bottom. Repeat for the results bucket.

📸 Screenshot: S3 Bucket list showing both your data and results buckets.

Step 2: Upload Sample Data

Create a file named customer_data.csv with these contents:

csv

id,name,signup_date,plan
1,Alice,2023-01-01,Premium
2,Bob,2023-02-15,Free
3,Charlie,2023-03-10,Premium

Upload it to a "folder" (prefix) in your bucket:

bash

aws s3 cp customer_data.csv s3://brainybee-lab-data-<UNIQUE_ID>/customers/customer_data.csv

Step 3: Create an IAM Role for Glue

AWS Glue needs permission to read your S3 bucket and write to the Data Catalog.

bash

# Create the trust policy file
echo '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "glue.amazonaws.com"},"Action": "sts:AssumeRole"}]}' > glue-trust-policy.json

# Create the role
aws iam create-role --role-name BrainyBeeGlueRole --assume-role-policy-document file://glue-trust-policy.json

# Attach the managed policy
aws iam attach-role-policy --role-name BrainyBeeGlueRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

Step 4: Catalog Data with Glue Crawler

We will create a crawler to scan the S3 bucket and create a table in the Glue Data Catalog.

bash

# Create a database in the catalog
aws glue create-database --database-input '{"Name": "brainybee_db"}'

# Create the crawler
aws glue create-crawler --name brainybee-crawler \
    --role BrainyBeeGlueRole \
    --database-name brainybee_db \
    --targets '{"S3Targets": [{"Path": "s3://brainybee-lab-data-<UNIQUE_ID>/customers/"}]}'

# Run the crawler
aws glue start-crawler --name brainybee-crawler

▶Console alternative

Navigate to AWS Glue > Crawlers.
Click Create crawler.
Name: brainybee-crawler. Click Next.
Add a data source: Select S3, choose your customers/ path. Click Next.
Choose the IAM role BrainyBeeGlueRole. Click Next.
Target database: Create brainybee_db. Click Finish.
Select the crawler and click Run crawler.

Step 5: Querying Data with Athena

Once the crawler finishes, Athena can query the new table using standard SQL.

bash

# Configure Athena output location
aws athena start-query-execution \
    --query-string "SELECT * FROM customers WHERE plan = 'Premium';" \
    --query-execution-context Database=brainybee_db \
    --result-configuration OutputLocation=s3://brainybee-lab-results-<UNIQUE_ID>/

Checkpoints

S3 Verification: Run aws s3 ls s3://brainybee-lab-data-<UNIQUE_ID>/customers/. You should see customer_data.csv.
Glue Verification: Check the Glue Console. Under Databases > Tables, you should see a table named customers with columns id, name, signup_date, and plan correctly identified.
Athena Verification: In the Athena Console, run SELECT count(*) FROM customers. The result should be 3.

Concept Review

Feature	Amazon Athena	Amazon Redshift	AWS Glue
Type	Serverless Query Engine	Data Warehouse (Cluster)	Serverless ETL & Catalog
Primary Use	Ad-hoc SQL on S3	Complex, high-perf Analytics	Data Prep and Metadata
Scaling	Automatic	Manual or Auto-scaling	Serverless Spark/Python
Pricing	Per-Query (Data Scanned)	Hourly per Node	Per DPU-Hour

Visualizing Concepts: The Serverless Data Lake

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Troubleshooting

Error	Likely Cause	Fix
`Access Denied` when running Crawler	IAM Role lacks S3 permissions	Ensure `AWSGlueServiceRole` is attached and S3 bucket policy allows Glue access.
Athena: `Table not found`	Crawler hasn't finished	Wait for Crawler status to return to `Ready` before querying.
Athena: `Zero records returned`	S3 pathing issue	Ensure the Crawler points to the folder `customers/`, not the specific file.

Cost Estimate

S3 Storage: ~0.023 per GB/mo. (For this lab: < $0.01)
Glue Crawler: $0.44 per DPU-Hour. (Minimum 0.25 DPUs for 1 min: < $0.10)
Amazon Athena: $5.00 per TB scanned. (This lab scans bytes: < $0.01)
Total Estimated Cost: <$0.20

Challenge

Can you convert the data to Apache Parquet format? Parquet is a columnar format that makes Athena queries faster and cheaper because it reduces the amount of data scanned.

▶Hint

Use an AWS Glue Studio visual job to read from the customers table and write to a new S3 folder with the format set to "Parquet". Then, update your Athena query to point to the Parquet table.

Clean-Up / Teardown

To avoid ongoing charges, delete the resources created:

bash

# Delete S3 objects and buckets
aws s3 rb s3://brainybee-lab-data-<UNIQUE_ID> --force
aws s3 rb s3://brainybee-lab-results-<UNIQUE_ID> --force

# Delete Glue resources
aws glue delete-crawler --name brainybee-crawler
aws glue delete-database --name brainybee_db

# Delete IAM Role
aws iam detach-role-policy --role-name BrainyBeeGlueRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
aws iam delete-role --role-name BrainyBeeGlueRole