Lab: Mastering Schema Evolution with AWS Glue Crawlers

This lab provides hands-on experience in managing a technical data catalog using AWS Glue. You will learn how to automate the discovery of data schemas in Amazon S3 and handle real-world schema evolution (changing data structures) gracefully.

Prerequisites

Before starting, ensure you have:

An AWS Account with administrative access.
AWS CLI installed and configured (aws configure).
Basic knowledge of SQL and JSON/CSV formats.
IAM permissions for S3, Glue, and Athena.

[!IMPORTANT] Replace <YOUR_ACCOUNT_ID> and <YOUR_REGION> with your actual AWS details throughout the lab.

Learning Objectives

By the end of this lab, you will be able to:

Create and configure an AWS Glue Crawler to discover metadata.
Build a Technical Data Catalog representing S3 datasets.
Implement Schema Evolution by adding new columns and re-crawling data.
Use Amazon Athena to verify metadata updates via SQL.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Create the Data Source (S3)

First, we need a landing zone for our raw data.

bash

# Generate a unique bucket name
export BUCKET_NAME="brainybee-lab-$(date +%s)"

# Create the bucket
aws s3 mb s3://$BUCKET_NAME

# Create initial data file (Version 1)
echo "product_id,product_name,price" > data_v1.csv
echo "101,Laptop,1200.00" >> data_v1.csv
echo "102,Mouse,25.00" >> data_v1.csv

# Upload to S3
aws s3 cp data_v1.csv s3://$BUCKET_NAME/products/data_v1.csv

▶Console alternative

Navigate to S3 Console > Create bucket.
Name it brainybee-lab-[timestamp].
Create a folder named products.
Upload a CSV file with columns: product_id, product_name, price.

Step 2: Create the Glue Database and Crawler

We will now set up the "librarian" (Crawler) to index our data.

bash

# Create Glue Database
aws glue create-database --database-input '{"Name":"lab_catalog_db"}'

# Create an IAM Role for Glue (Simplified for Lab)
# In production, use the 'AWSGlueServiceRole' managed policy.

# Create the Crawler
aws glue create-crawler --name "product-crawler" \
    --role "service-role/AWSGlueServiceRoleDefault" \
    --database-name "lab_catalog_db" \
    --targets '{"S3Targets": [{"Path": "s3://'$BUCKET_NAME'/products/"}]}'

Step 3: Run the Crawler and Verify Initial Schema

bash

# Start the crawler
aws glue start-crawler --name "product-crawler"

# Monitor status (Wait until state is READY)
aws glue get-crawler --name "product-crawler" --query "Crawler.State"

Checkpoints

Verification Step	Expected Result
Check Glue Table	Run `aws glue get-table --database-name lab_catalog_db --name products`. You should see 3 columns: `product_id`, `product_name`, `price`.
Query in Athena	Run `SELECT * FROM products` in Athena. You should see 2 rows of data.

Step 4: Evolve the Schema

Now, simulate a business change: we need to track category and stock_count.

bash

# Create evolved data file (Version 2)
echo "product_id,product_name,price,category,stock_count" > data_v2.csv
echo "103,Keyboard,45.00,Electronics,50" >> data_v2.csv
echo "104,Desk Lamp,15.00,Furniture,20" >> data_v2.csv

# Upload the new file to the same S3 prefix
aws s3 cp data_v2.csv s3://$BUCKET_NAME/products/data_v2.csv

# Run the Crawler again
aws glue start-crawler --name "product-crawler"

[!TIP] By default, Glue Crawlers update the table definition when new columns are detected at the end of the schema.

Visualizing Schema Evolution

Below is a representation of how the metadata table evolves between crawls.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Concept Review

Feature	Description	Use Case
Schema Inference	Glue identifies data types automatically (e.g., string vs int).	When source data formats are unknown or change frequently.
Incremental Crawl	Scans only new partitions added since the last run.	Optimizing performance for large, frequently updated S3 datasets.
Classification	Rules (Grok/JSON) used to determine file format.	Handling custom or legacy data formats.
Schema Evolution	Handling additions or deletions in data structure.	Managing data integrity across downstream applications.

Troubleshooting

Error/Issue	Potential Cause	Solution
Crawler fails with 403 Access Denied	IAM Role lacks S3 permissions.	Ensure the Glue service role has `s3:GetObject` and `s3:ListBucket` on your bucket.
No tables created in Catalog	Incorrect S3 path in Crawler targets.	Verify the S3 path ends with a `/` and contains files the crawler can read.
Columns not updating	Crawler configuration set to "Ignore change".	Check Crawler Settings: ensure it is set to "Update the table definition in the data catalog".

Challenge

Modify the Crawler to handle deleted columns. By default, Glue keeps deleted columns in the catalog to prevent breaking queries. Change the Crawler settings to "Delete tables and columns from the data catalog" when they are no longer found in S3. Run the crawler after removing data_v2.csv and see what happens to the table schema.

Cost Estimate

S3 Storage: Negligible (<$0.01 for this volume).
Glue Crawler: $0.44 per DPU-Hour (Minimum 10-minute billing). This lab usually costs ~$0.15.
Athena: $5 per TB scanned. This lab scans <1KB, essentially free.

Clean-Up / Teardown

[!WARNING] Failure to delete these resources may result in small recurring charges for Glue metadata storage and S3 storage.

bash

# 1. Delete the S3 bucket and all objects
aws s3 rb s3://$BUCKET_NAME --force

# 2. Delete the Glue Crawler
aws glue delete-crawler --name "product-crawler"

# 3. Delete the Glue Database (and its tables)
aws glue delete-database --name "lab_catalog_db"

Lab: Mastering Schema Evolution with AWS Glue Crawlers

Prerequisites

Before starting, ensure you have:

An AWS Account with administrative access.
AWS CLI installed and configured (aws configure).
Basic knowledge of SQL and JSON/CSV formats.
IAM permissions for S3, Glue, and Athena.

[!IMPORTANT] Replace <YOUR_ACCOUNT_ID> and <YOUR_REGION> with your actual AWS details throughout the lab.

Learning Objectives

By the end of this lab, you will be able to:

Create and configure an AWS Glue Crawler to discover metadata.
Build a Technical Data Catalog representing S3 datasets.
Implement Schema Evolution by adding new columns and re-crawling data.
Use Amazon Athena to verify metadata updates via SQL.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Create the Data Source (S3)

First, we need a landing zone for our raw data.

bash

# Generate a unique bucket name
export BUCKET_NAME="brainybee-lab-$(date +%s)"

# Create the bucket
aws s3 mb s3://$BUCKET_NAME

# Create initial data file (Version 1)
echo "product_id,product_name,price" > data_v1.csv
echo "101,Laptop,1200.00" >> data_v1.csv
echo "102,Mouse,25.00" >> data_v1.csv

# Upload to S3
aws s3 cp data_v1.csv s3://$BUCKET_NAME/products/data_v1.csv

▶Console alternative

Navigate to S3 Console > Create bucket.
Name it brainybee-lab-[timestamp].
Create a folder named products.
Upload a CSV file with columns: product_id, product_name, price.

Step 2: Create the Glue Database and Crawler

We will now set up the "librarian" (Crawler) to index our data.

bash

# Create Glue Database
aws glue create-database --database-input '{"Name":"lab_catalog_db"}'

# Create an IAM Role for Glue (Simplified for Lab)
# In production, use the 'AWSGlueServiceRole' managed policy.

# Create the Crawler
aws glue create-crawler --name "product-crawler" \
    --role "service-role/AWSGlueServiceRoleDefault" \
    --database-name "lab_catalog_db" \
    --targets '{"S3Targets": [{"Path": "s3://'$BUCKET_NAME'/products/"}]}'

Step 3: Run the Crawler and Verify Initial Schema

bash

# Start the crawler
aws glue start-crawler --name "product-crawler"

# Monitor status (Wait until state is READY)
aws glue get-crawler --name "product-crawler" --query "Crawler.State"

Checkpoints

Verification Step	Expected Result
Check Glue Table	Run `aws glue get-table --database-name lab_catalog_db --name products`. You should see 3 columns: `product_id`, `product_name`, `price`.
Query in Athena	Run `SELECT * FROM products` in Athena. You should see 2 rows of data.

Step 4: Evolve the Schema

Now, simulate a business change: we need to track category and stock_count.

bash

# Create evolved data file (Version 2)
echo "product_id,product_name,price,category,stock_count" > data_v2.csv
echo "103,Keyboard,45.00,Electronics,50" >> data_v2.csv
echo "104,Desk Lamp,15.00,Furniture,20" >> data_v2.csv

# Upload the new file to the same S3 prefix
aws s3 cp data_v2.csv s3://$BUCKET_NAME/products/data_v2.csv

# Run the Crawler again
aws glue start-crawler --name "product-crawler"

[!TIP] By default, Glue Crawlers update the table definition when new columns are detected at the end of the schema.

Visualizing Schema Evolution

Below is a representation of how the metadata table evolves between crawls.

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Concept Review

Feature	Description	Use Case
Schema Inference	Glue identifies data types automatically (e.g., string vs int).	When source data formats are unknown or change frequently.
Incremental Crawl	Scans only new partitions added since the last run.	Optimizing performance for large, frequently updated S3 datasets.
Classification	Rules (Grok/JSON) used to determine file format.	Handling custom or legacy data formats.
Schema Evolution	Handling additions or deletions in data structure.	Managing data integrity across downstream applications.

Troubleshooting

Error/Issue	Potential Cause	Solution
Crawler fails with 403 Access Denied	IAM Role lacks S3 permissions.	Ensure the Glue service role has `s3:GetObject` and `s3:ListBucket` on your bucket.
No tables created in Catalog	Incorrect S3 path in Crawler targets.	Verify the S3 path ends with a `/` and contains files the crawler can read.
Columns not updating	Crawler configuration set to "Ignore change".	Check Crawler Settings: ensure it is set to "Update the table definition in the data catalog".

Challenge

Cost Estimate

S3 Storage: Negligible (<$0.01 for this volume).
Glue Crawler: $0.44 per DPU-Hour (Minimum 10-minute billing). This lab usually costs ~$0.15.
Athena: $5 per TB scanned. This lab scans <1KB, essentially free.

Clean-Up / Teardown

[!WARNING] Failure to delete these resources may result in small recurring charges for Glue metadata storage and S3 storage.

bash

# 1. Delete the S3 bucket and all objects
aws s3 rb s3://$BUCKET_NAME --force

# 2. Delete the Glue Crawler
aws glue delete-crawler --name "product-crawler"

# 3. Delete the Glue Database (and its tables)
aws glue delete-database --name "lab_catalog_db"