Lab: Mastering Schema Evolution with AWS Glue Crawlers
Cataloging and Schema Evolution
Lab: Mastering Schema Evolution with AWS Glue Crawlers
This lab provides hands-on experience in managing a technical data catalog using AWS Glue. You will learn how to automate the discovery of data schemas in Amazon S3 and handle real-world schema evolution (changing data structures) gracefully.
Prerequisites
Before starting, ensure you have:
- An AWS Account with administrative access.
- AWS CLI installed and configured (
aws configure). - Basic knowledge of SQL and JSON/CSV formats.
- IAM permissions for
S3,Glue, andAthena.
[!IMPORTANT] Replace
<YOUR_ACCOUNT_ID>and<YOUR_REGION>with your actual AWS details throughout the lab.
Learning Objectives
By the end of this lab, you will be able to:
- Create and configure an AWS Glue Crawler to discover metadata.
- Build a Technical Data Catalog representing S3 datasets.
- Implement Schema Evolution by adding new columns and re-crawling data.
- Use Amazon Athena to verify metadata updates via SQL.
Architecture Overview
Step-by-Step Instructions
Step 1: Create the Data Source (S3)
First, we need a landing zone for our raw data.
# Generate a unique bucket name
export BUCKET_NAME="brainybee-lab-$(date +%s)"
# Create the bucket
aws s3 mb s3://$BUCKET_NAME
# Create initial data file (Version 1)
echo "product_id,product_name,price" > data_v1.csv
echo "101,Laptop,1200.00" >> data_v1.csv
echo "102,Mouse,25.00" >> data_v1.csv
# Upload to S3
aws s3 cp data_v1.csv s3://$BUCKET_NAME/products/data_v1.csv▶Console alternative
- Navigate to S3 Console > Create bucket.
- Name it
brainybee-lab-[timestamp]. - Create a folder named
products. - Upload a CSV file with columns:
product_id,product_name,price.
Step 2: Create the Glue Database and Crawler
We will now set up the "librarian" (Crawler) to index our data.
# Create Glue Database
aws glue create-database --database-input '{"Name":"lab_catalog_db"}'
# Create an IAM Role for Glue (Simplified for Lab)
# In production, use the 'AWSGlueServiceRole' managed policy.
# Create the Crawler
aws glue create-crawler --name "product-crawler" \
--role "service-role/AWSGlueServiceRoleDefault" \
--database-name "lab_catalog_db" \
--targets '{"S3Targets": [{"Path": "s3://'$BUCKET_NAME'/products/"}]}'Step 3: Run the Crawler and Verify Initial Schema
# Start the crawler
aws glue start-crawler --name "product-crawler"
# Monitor status (Wait until state is READY)
aws glue get-crawler --name "product-crawler" --query "Crawler.State"Checkpoints
| Verification Step | Expected Result |
|---|---|
| Check Glue Table | Run aws glue get-table --database-name lab_catalog_db --name products. You should see 3 columns: product_id, product_name, price. |
| Query in Athena | Run SELECT * FROM products in Athena. You should see 2 rows of data. |
Step 4: Evolve the Schema
Now, simulate a business change: we need to track category and stock_count.
# Create evolved data file (Version 2)
echo "product_id,product_name,price,category,stock_count" > data_v2.csv
echo "103,Keyboard,45.00,Electronics,50" >> data_v2.csv
echo "104,Desk Lamp,15.00,Furniture,20" >> data_v2.csv
# Upload the new file to the same S3 prefix
aws s3 cp data_v2.csv s3://$BUCKET_NAME/products/data_v2.csv
# Run the Crawler again
aws glue start-crawler --name "product-crawler"[!TIP] By default, Glue Crawlers update the table definition when new columns are detected at the end of the schema.
Visualizing Schema Evolution
Below is a representation of how the metadata table evolves between crawls.
\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, fill=blue!10, text centered, minimum width=3cm}] \node (v1) {\textbf{Table Schema V1} \ \hline product_id (int) \ product_name (string) \ price (double)}; \node (arrow) [right of=v1, xshift=2cm] {\Large }; \node (v2) [right of=arrow, xshift=2cm] {\textbf{Table Schema V2} \ \hline product_id (int) \ product_name (string) \ price (double) \ \textbf{category (string)} \ \textbf{stock_count (int)}}; \node[draw=none, fill=none, below of=arrow, yshift=1cm] {Crawler Detection}; \end{tikzpicture}
Concept Review
| Feature | Description | Use Case |
|---|---|---|
| Schema Inference | Glue identifies data types automatically (e.g., string vs int). | When source data formats are unknown or change frequently. |
| Incremental Crawl | Scans only new partitions added since the last run. | Optimizing performance for large, frequently updated S3 datasets. |
| Classification | Rules (Grok/JSON) used to determine file format. | Handling custom or legacy data formats. |
| Schema Evolution | Handling additions or deletions in data structure. | Managing data integrity across downstream applications. |
Troubleshooting
| Error/Issue | Potential Cause | Solution |
|---|---|---|
| Crawler fails with 403 Access Denied | IAM Role lacks S3 permissions. | Ensure the Glue service role has s3:GetObject and s3:ListBucket on your bucket. |
| No tables created in Catalog | Incorrect S3 path in Crawler targets. | Verify the S3 path ends with a / and contains files the crawler can read. |
| Columns not updating | Crawler configuration set to "Ignore change". | Check Crawler Settings: ensure it is set to "Update the table definition in the data catalog". |
Challenge
Modify the Crawler to handle deleted columns.
By default, Glue keeps deleted columns in the catalog to prevent breaking queries. Change the Crawler settings to "Delete tables and columns from the data catalog" when they are no longer found in S3. Run the crawler after removing data_v2.csv and see what happens to the table schema.
Cost Estimate
- S3 Storage: Negligible (<$0.01 for this volume).
- Glue Crawler: $0.44 per DPU-Hour (Minimum 10-minute billing). This lab usually costs ~$0.15.
- Athena: $5 per TB scanned. This lab scans <1KB, essentially free.
Clean-Up / Teardown
[!WARNING] Failure to delete these resources may result in small recurring charges for Glue metadata storage and S3 storage.
# 1. Delete the S3 bucket and all objects
aws s3 rb s3://$BUCKET_NAME --force
# 2. Delete the Glue Crawler
aws glue delete-crawler --name "product-crawler"
# 3. Delete the Glue Database (and its tables)
aws glue delete-database --name "lab_catalog_db"