Hands-On Lab945 words

Lab: Mastering Schema Evolution with AWS Glue Crawlers

Cataloging and Schema Evolution

Lab: Mastering Schema Evolution with AWS Glue Crawlers

This lab provides hands-on experience in managing a technical data catalog using AWS Glue. You will learn how to automate the discovery of data schemas in Amazon S3 and handle real-world schema evolution (changing data structures) gracefully.

Prerequisites

Before starting, ensure you have:

  • An AWS Account with administrative access.
  • AWS CLI installed and configured (aws configure).
  • Basic knowledge of SQL and JSON/CSV formats.
  • IAM permissions for S3, Glue, and Athena.

[!IMPORTANT] Replace <YOUR_ACCOUNT_ID> and <YOUR_REGION> with your actual AWS details throughout the lab.

Learning Objectives

By the end of this lab, you will be able to:

  1. Create and configure an AWS Glue Crawler to discover metadata.
  2. Build a Technical Data Catalog representing S3 datasets.
  3. Implement Schema Evolution by adding new columns and re-crawling data.
  4. Use Amazon Athena to verify metadata updates via SQL.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Create the Data Source (S3)

First, we need a landing zone for our raw data.

bash
# Generate a unique bucket name export BUCKET_NAME="brainybee-lab-$(date +%s)" # Create the bucket aws s3 mb s3://$BUCKET_NAME # Create initial data file (Version 1) echo "product_id,product_name,price" > data_v1.csv echo "101,Laptop,1200.00" >> data_v1.csv echo "102,Mouse,25.00" >> data_v1.csv # Upload to S3 aws s3 cp data_v1.csv s3://$BUCKET_NAME/products/data_v1.csv
Console alternative
  1. Navigate to S3 Console > Create bucket.
  2. Name it brainybee-lab-[timestamp].
  3. Create a folder named products.
  4. Upload a CSV file with columns: product_id, product_name, price.

Step 2: Create the Glue Database and Crawler

We will now set up the "librarian" (Crawler) to index our data.

bash
# Create Glue Database aws glue create-database --database-input '{"Name":"lab_catalog_db"}' # Create an IAM Role for Glue (Simplified for Lab) # In production, use the 'AWSGlueServiceRole' managed policy. # Create the Crawler aws glue create-crawler --name "product-crawler" \ --role "service-role/AWSGlueServiceRoleDefault" \ --database-name "lab_catalog_db" \ --targets '{"S3Targets": [{"Path": "s3://'$BUCKET_NAME'/products/"}]}'

Step 3: Run the Crawler and Verify Initial Schema

bash
# Start the crawler aws glue start-crawler --name "product-crawler" # Monitor status (Wait until state is READY) aws glue get-crawler --name "product-crawler" --query "Crawler.State"

Checkpoints

Verification StepExpected Result
Check Glue TableRun aws glue get-table --database-name lab_catalog_db --name products. You should see 3 columns: product_id, product_name, price.
Query in AthenaRun SELECT * FROM products in Athena. You should see 2 rows of data.

Step 4: Evolve the Schema

Now, simulate a business change: we need to track category and stock_count.

bash
# Create evolved data file (Version 2) echo "product_id,product_name,price,category,stock_count" > data_v2.csv echo "103,Keyboard,45.00,Electronics,50" >> data_v2.csv echo "104,Desk Lamp,15.00,Furniture,20" >> data_v2.csv # Upload the new file to the same S3 prefix aws s3 cp data_v2.csv s3://$BUCKET_NAME/products/data_v2.csv # Run the Crawler again aws glue start-crawler --name "product-crawler"

[!TIP] By default, Glue Crawlers update the table definition when new columns are detected at the end of the schema.

Visualizing Schema Evolution

Below is a representation of how the metadata table evolves between crawls.

\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, fill=blue!10, text centered, minimum width=3cm}] \node (v1) {\textbf{Table Schema V1} \ \hline product_id (int) \ product_name (string) \ price (double)}; \node (arrow) [right of=v1, xshift=2cm] {\Large \Rightarrow}; \node (v2) [right of=arrow, xshift=2cm] {\textbf{Table Schema V2} \ \hline product_id (int) \ product_name (string) \ price (double) \ \textbf{category (string)} \ \textbf{stock_count (int)}}; \node[draw=none, fill=none, below of=arrow, yshift=1cm] {Crawler Detection}; \end{tikzpicture}

Concept Review

FeatureDescriptionUse Case
Schema InferenceGlue identifies data types automatically (e.g., string vs int).When source data formats are unknown or change frequently.
Incremental CrawlScans only new partitions added since the last run.Optimizing performance for large, frequently updated S3 datasets.
ClassificationRules (Grok/JSON) used to determine file format.Handling custom or legacy data formats.
Schema EvolutionHandling additions or deletions in data structure.Managing data integrity across downstream applications.

Troubleshooting

Error/IssuePotential CauseSolution
Crawler fails with 403 Access DeniedIAM Role lacks S3 permissions.Ensure the Glue service role has s3:GetObject and s3:ListBucket on your bucket.
No tables created in CatalogIncorrect S3 path in Crawler targets.Verify the S3 path ends with a / and contains files the crawler can read.
Columns not updatingCrawler configuration set to "Ignore change".Check Crawler Settings: ensure it is set to "Update the table definition in the data catalog".

Challenge

Modify the Crawler to handle deleted columns. By default, Glue keeps deleted columns in the catalog to prevent breaking queries. Change the Crawler settings to "Delete tables and columns from the data catalog" when they are no longer found in S3. Run the crawler after removing data_v2.csv and see what happens to the table schema.

Cost Estimate

  • S3 Storage: Negligible (<$0.01 for this volume).
  • Glue Crawler: $0.44 per DPU-Hour (Minimum 10-minute billing). This lab usually costs ~$0.15.
  • Athena: $5 per TB scanned. This lab scans <1KB, essentially free.

Clean-Up / Teardown

[!WARNING] Failure to delete these resources may result in small recurring charges for Glue metadata storage and S3 storage.

bash
# 1. Delete the S3 bucket and all objects aws s3 rb s3://$BUCKET_NAME --force # 2. Delete the Glue Crawler aws glue delete-crawler --name "product-crawler" # 3. Delete the Glue Database (and its tables) aws glue delete-database --name "lab_catalog_db"

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free