Hands-On Lab1,050 words

Lab: Building a Serverless Data Lake with AWS Glue and Amazon Athena

Data Analysis and Querying Using AWS Services

Lab: Building a Serverless Data Lake with AWS Glue and Amazon Athena

This lab provides a hands-on introduction to the AWS serverless analytics ecosystem. You will learn how to store raw data in Amazon S3, use AWS Glue to automatically discover the schema, and perform SQL-based analysis using Amazon Athena—all without managing a single server.

[!WARNING] Remember to run the teardown commands at the end of this lab to avoid ongoing charges for stored data or metadata. This lab is mostly Free Tier eligible, but Athena queries cost $5 per TB scanned (prorated).

Prerequisites

Before starting, ensure you have:

  • An AWS Account with administrative access.
  • AWS CLI installed and configured (aws configure).
  • Basic knowledge of SQL (SELECT, FROM, WHERE).
  • A sample dataset (save a few lines of CSV data as customer_data.csv).

Learning Objectives

By the end of this lab, you will be able to:

  1. Configure S3 as a landing zone for raw structured data.
  2. Deploy an AWS Glue Crawler to populate the Glue Data Catalog.
  3. Execute SQL queries against S3 data using the Athena serverless engine.
  4. Understand the cost and performance benefits of serverless analytics.

Architecture Overview

The following diagram illustrates the data flow: Raw data is uploaded to S3, indexed by Glue, and queried via Athena.

Loading Diagram...

Step-by-Step Instructions

Step 1: Create the Data Lake Storage

First, we need an S3 bucket to act as our data lake storage and another for Athena query results.

bash
# Replace <UNIQUE_ID> with a random string aws s3 mb s3://brainybee-lab-data-<UNIQUE_ID> aws s3 mb s3://brainybee-lab-results-<UNIQUE_ID>
Console alternative
  1. Navigate to S3 in the AWS Console.
  2. Click Create bucket.
  3. Bucket name: brainybee-lab-data-<YOUR_NAME>.
  4. Click Create bucket at the bottom. Repeat for the results bucket.

📸 Screenshot: S3 Bucket list showing both your data and results buckets.

Step 2: Upload Sample Data

Create a file named customer_data.csv with these contents:

csv
id,name,signup_date,plan 1,Alice,2023-01-01,Premium 2,Bob,2023-02-15,Free 3,Charlie,2023-03-10,Premium

Upload it to a "folder" (prefix) in your bucket:

bash
aws s3 cp customer_data.csv s3://brainybee-lab-data-<UNIQUE_ID>/customers/customer_data.csv

Step 3: Create an IAM Role for Glue

AWS Glue needs permission to read your S3 bucket and write to the Data Catalog.

bash
# Create the trust policy file echo '{"Version": "2012-10-17","Statement": [{"Effect": "Allow","Principal": {"Service": "glue.amazonaws.com"},"Action": "sts:AssumeRole"}]}' > glue-trust-policy.json # Create the role aws iam create-role --role-name BrainyBeeGlueRole --assume-role-policy-document file://glue-trust-policy.json # Attach the managed policy aws iam attach-role-policy --role-name BrainyBeeGlueRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole

Step 4: Catalog Data with Glue Crawler

We will create a crawler to scan the S3 bucket and create a table in the Glue Data Catalog.

bash
# Create a database in the catalog aws glue create-database --database-input '{"Name": "brainybee_db"}' # Create the crawler aws glue create-crawler --name brainybee-crawler \ --role BrainyBeeGlueRole \ --database-name brainybee_db \ --targets '{"S3Targets": [{"Path": "s3://brainybee-lab-data-<UNIQUE_ID>/customers/"}]}' # Run the crawler aws glue start-crawler --name brainybee-crawler
Console alternative
  1. Navigate to AWS Glue > Crawlers.
  2. Click Create crawler.
  3. Name: brainybee-crawler. Click Next.
  4. Add a data source: Select S3, choose your customers/ path. Click Next.
  5. Choose the IAM role BrainyBeeGlueRole. Click Next.
  6. Target database: Create brainybee_db. Click Finish.
  7. Select the crawler and click Run crawler.

Step 5: Querying Data with Athena

Once the crawler finishes, Athena can query the new table using standard SQL.

bash
# Configure Athena output location aws athena start-query-execution \ --query-string "SELECT * FROM customers WHERE plan = 'Premium';" \ --query-execution-context Database=brainybee_db \ --result-configuration OutputLocation=s3://brainybee-lab-results-<UNIQUE_ID>/

Checkpoints

  1. S3 Verification: Run aws s3 ls s3://brainybee-lab-data-<UNIQUE_ID>/customers/. You should see customer_data.csv.
  2. Glue Verification: Check the Glue Console. Under Databases > Tables, you should see a table named customers with columns id, name, signup_date, and plan correctly identified.
  3. Athena Verification: In the Athena Console, run SELECT count(*) FROM customers. The result should be 3.

Concept Review

FeatureAmazon AthenaAmazon RedshiftAWS Glue
TypeServerless Query EngineData Warehouse (Cluster)Serverless ETL & Catalog
Primary UseAd-hoc SQL on S3Complex, high-perf AnalyticsData Prep and Metadata
ScalingAutomaticManual or Auto-scalingServerless Spark/Python
PricingPer-Query (Data Scanned)Hourly per NodePer DPU-Hour

Visualizing Concepts: The Serverless Data Lake

\begin{tikzpicture}[node distance=2cm] \draw[thick, fill=blue!10] (0,0) rectangle (3,2) node[pos=.5] {S3 (Storage)}; \draw[thick, fill=green!10] (4,0) rectangle (7,2) node[pos=.5] {Glue (Metadata)}; \draw[thick, fill=orange!10] (8,0) rectangle (11,2) node[pos=.5] {Athena (Query)}; \draw[->, >=stealth, thick] (3,1) -- (4,1) node[midway, above] {Index}; \draw[->, >=stealth, thick] (7,1) -- (8,1) node[midway, above] {SQL}; \node at (5.5, -1) {No Servers to Manage (Serverless)}; \end{tikzpicture}

Troubleshooting

ErrorLikely CauseFix
Access Denied when running CrawlerIAM Role lacks S3 permissionsEnsure AWSGlueServiceRole is attached and S3 bucket policy allows Glue access.
Athena: Table not foundCrawler hasn't finishedWait for Crawler status to return to Ready before querying.
Athena: Zero records returnedS3 pathing issueEnsure the Crawler points to the folder customers/, not the specific file.

Cost Estimate

  • S3 Storage: ~0.023 per GB/mo. (For this lab: < $0.01)
  • Glue Crawler: $0.44 per DPU-Hour. (Minimum 0.25 DPUs for 1 min: < $0.10)
  • Amazon Athena: $5.00 per TB scanned. (This lab scans bytes: < $0.01)
  • Total Estimated Cost: <$0.20

Challenge

Can you convert the data to Apache Parquet format? Parquet is a columnar format that makes Athena queries faster and cheaper because it reduces the amount of data scanned.

Hint

Use an AWS Glue Studio visual job to read from the customers table and write to a new S3 folder with the format set to "Parquet". Then, update your Athena query to point to the Parquet table.

Clean-Up / Teardown

To avoid ongoing charges, delete the resources created:

bash
# Delete S3 objects and buckets aws s3 rb s3://brainybee-lab-data-<UNIQUE_ID> --force aws s3 rb s3://brainybee-lab-results-<UNIQUE_ID> --force # Delete Glue resources aws glue delete-crawler --name brainybee-crawler aws glue delete-database --name brainybee_db # Delete IAM Role aws iam detach-role-policy --role-name BrainyBeeGlueRole --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole aws iam delete-role --role-name BrainyBeeGlueRole

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free