Lab: Building a Serverless Data Lake with AWS Glue and Amazon Athena
Data Analysis and Querying Using AWS Services
Lab: Building a Serverless Data Lake with AWS Glue and Amazon Athena
This lab provides a hands-on introduction to the AWS serverless analytics ecosystem. You will learn how to store raw data in Amazon S3, use AWS Glue to automatically discover the schema, and perform SQL-based analysis using Amazon Athena—all without managing a single server.
[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges for stored data or metadata. This lab is mostly Free Tier eligible, but Athena queries cost $5 per TB scanned (prorated).
Prerequisites
Before starting, ensure you have:
- An AWS Account with administrative access.
- AWS CLI installed and configured (
aws configure). - Basic knowledge of SQL (SELECT, FROM, WHERE).
- A sample dataset (save a few lines of CSV data as
customer_data.csv).
Learning Objectives
By the end of this lab, you will be able to:
- Configure S3 as a landing zone for raw structured data.
- Deploy an AWS Glue Crawler to populate the Glue Data Catalog.
- Execute SQL queries against S3 data using the Athena serverless engine.
- Understand the cost and performance benefits of serverless analytics.
Architecture Overview
The following diagram illustrates the data flow: Raw data is uploaded to S3, indexed by Glue, and queried via Athena.
Step-by-Step Instructions
Step 1: Create the Data Lake Storage
First, we need an S3 bucket to act as our data lake storage and another for Athena query results.
# Replace <UNIQUE_ID> with a random string
aws s3 mb s3://brainybee-lab-data-<UNIQUE_ID>
aws s3 mb s3://brainybee-lab-results-<UNIQUE_ID>▶Console alternative
- Navigate to S3 in the AWS Console.
- Click Create bucket.
- Bucket name:
brainybee-lab-data-<YOUR_NAME>. - Click Create bucket at the bottom. Repeat for the results bucket.
📸 Screenshot: S3 Bucket list showing both your data and results buckets.
Step 2: Upload Sample Data
Create a file named customer_data.csv with these contents:
id,name,signup_date,plan
1,Alice,2023-01-01,Premium
2,Bob,2023-02-15,Free
3,Charlie,2023-03-10,PremiumUpload it to a "folder" (prefix) in your bucket:
aws s3 cp customer_data.csv s3://brainybee-lab-data-<UNIQUE_ID>/customers/customer_data.csvStep 3: Create an IAM Role for Glue
AWS Glue needs permission to read your S3 bucket and write to the Data Catalog.
# Create the trust policy file
echo '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "glue.amazonaws.com"},"Action": "sts:AssumeRole"}]}' > glue-trust-policy.json
# Create the role
aws iam create-role --role-name BrainyBeeGlueRole --assume-role-policy-document file://glue-trust-policy.json
# Attach the managed policy
aws iam attach-role-policy --role-name BrainyBeeGlueRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRoleStep 4: Catalog Data with Glue Crawler
We will create a crawler to scan the S3 bucket and create a table in the Glue Data Catalog.
# Create a database in the catalog
aws glue create-database --database-input '{"Name": "brainybee_db"}'
# Create the crawler
aws glue create-crawler --name brainybee-crawler \
--role BrainyBeeGlueRole \
--database-name brainybee_db \
--targets '{"S3Targets": [{"Path": "s3://brainybee-lab-data-<UNIQUE_ID>/customers/"}]}'
# Run the crawler
aws glue start-crawler --name brainybee-crawler▶Console alternative
- Navigate to AWS Glue > Crawlers.
- Click Create crawler.
- Name:
brainybee-crawler. Click Next. - Add a data source: Select S3, choose your
customers/path. Click Next. - Choose the IAM role
BrainyBeeGlueRole. Click Next. - Target database: Create
brainybee_db. Click Finish. - Select the crawler and click Run crawler.
Step 5: Querying Data with Athena
Once the crawler finishes, Athena can query the new table using standard SQL.
# Configure Athena output location
aws athena start-query-execution \
--query-string "SELECT * FROM customers WHERE plan = 'Premium';" \
--query-execution-context Database=brainybee_db \
--result-configuration OutputLocation=s3://brainybee-lab-results-<UNIQUE_ID>/Checkpoints
- S3 Verification: Run
aws s3 ls s3://brainybee-lab-data-<UNIQUE_ID>/customers/. You should seecustomer_data.csv. - Glue Verification: Check the Glue Console. Under Databases > Tables, you should see a table named
customerswith columnsid,name,signup_date, andplancorrectly identified. - Athena Verification: In the Athena Console, run
SELECT count(*) FROM customers. The result should be3.
Concept Review
| Feature | Amazon Athena | Amazon Redshift | AWS Glue |
|---|---|---|---|
| Type | Serverless Query Engine | Data Warehouse (Cluster) | Serverless ETL & Catalog |
| Primary Use | Ad-hoc SQL on S3 | Complex, high-perf Analytics | Data Prep and Metadata |
| Scaling | Automatic | Manual or Auto-scaling | Serverless Spark/Python |
| Pricing | Per-Query (Data Scanned) | Hourly per Node | Per DPU-Hour |
Visualizing Concepts: The Serverless Data Lake
\begin{tikzpicture}[node distance=2cm] \draw[thick, fill=blue!10] (0,0) rectangle (3,2) node[pos=.5] {S3 (Storage)}; \draw[thick, fill=green!10] (4,0) rectangle (7,2) node[pos=.5] {Glue (Metadata)}; \draw[thick, fill=orange!10] (8,0) rectangle (11,2) node[pos=.5] {Athena (Query)}; \draw[->, >=stealth, thick] (3,1) -- (4,1) node[midway, above] {Index}; \draw[->, >=stealth, thick] (7,1) -- (8,1) node[midway, above] {SQL}; \node at (5.5, -1) {No Servers to Manage (Serverless)}; \end{tikzpicture}
Troubleshooting
| Error | Likely Cause | Fix |
|---|---|---|
Access Denied when running Crawler | IAM Role lacks S3 permissions | Ensure AWSGlueServiceRole is attached and S3 bucket policy allows Glue access. |
Athena: Table not found | Crawler hasn't finished | Wait for Crawler status to return to Ready before querying. |
Athena: Zero records returned | S3 pathing issue | Ensure the Crawler points to the folder customers/, not the specific file. |
Cost Estimate
- S3 Storage: ~0.023 per GB/mo. (For this lab: < $0.01)
- Glue Crawler: $0.44 per DPU-Hour. (Minimum 0.25 DPUs for 1 min: < $0.10)
- Amazon Athena: $5.00 per TB scanned. (This lab scans bytes: < $0.01)
- Total Estimated Cost: <$0.20
Challenge
Can you convert the data to Apache Parquet format? Parquet is a columnar format that makes Athena queries faster and cheaper because it reduces the amount of data scanned.
▶Hint
Use an AWS Glue Studio visual job to read from the customers table and write to a new S3 folder with the format set to "Parquet". Then, update your Athena query to point to the Parquet table.
Clean-Up / Teardown
To avoid ongoing charges, delete the resources created:
# Delete S3 objects and buckets
aws s3 rb s3://brainybee-lab-data-<UNIQUE_ID> --force
aws s3 rb s3://brainybee-lab-results-<UNIQUE_ID> --force
# Delete Glue resources
aws glue delete-crawler --name brainybee-crawler
aws glue delete-database --name brainybee_db
# Delete IAM Role
aws iam detach-role-policy --role-name BrainyBeeGlueRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
aws iam delete-role --role-name BrainyBeeGlueRole