Lab: Managing Schema Evolution with AWS Glue and Athena
Data Models and Schema Evolution
Lab: Managing Schema Evolution with AWS Glue and Athena
This lab provides a hands-on experience in managing data models and schema evolution within the AWS ecosystem. You will learn how to use AWS Glue Crawlers to discover schemas, handle changes in data structure using Glue DynamicFrames, and query evolved datasets using Amazon Athena.
Prerequisites
- An AWS Account with administrative access or permissions for IAM, S3, Glue, and Athena.
- AWS CLI installed and configured on your local machine.
- Basic knowledge of JSON and SQL.
- An S3 bucket named
brainybee-lab-data-<YOUR_ACCOUNT_ID>in your preferred region.
Learning Objectives
- Provision an S3-based data lake with initial and evolved data versions.
- Automate metadata discovery using AWS Glue Crawlers.
- Analyze the impact of schema changes (adding columns, changing types) on the Glue Data Catalog.
- Implement schema resolution strategies using Glue DynamicFrames and the
resolveChoicemethod. - Query multi-versioned schemas using Amazon Athena.
Architecture Overview
Step-by-Step Instructions
Step 1: Set Up the S3 Data Lake
First, we create the folder structure and upload the "Initial" data (Version 1).
# Create the bucket (change region if necessary)
aws s3 mb s3://brainybee-lab-data-<YOUR_ACCOUNT_ID> --region us-east-1
# Create a local file for Version 1
echo '{"id": 1, "product": "Laptop", "price": 1200}' > v1.json
# Upload to S3
aws s3 cp v1.json s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/raw/product_data/v1.json▶Console alternative
- Open the S3 Console.
- Click Create bucket and name it `brainybee-lab-data-
. 3. Create a folder named raw/product_data/. 4. Upload a JSON file with content: {"id": 1, "product": "Laptop", "price": 1200}`.
Step 2: Catalog Data with Glue Crawler
We need to create a technical data catalog for this data.
# Create a Glue Database
aws glue create-database --database-input '{"Name": "lab_db"}'
# Create the Crawler
aws glue create-crawler --name "product-crawler" \
--role "AWSGlueServiceRole" \
--database-name "lab_db" \
--targets '{"S3Targets": [{"Path": "s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/raw/product_data/"}]}'
# Run the Crawler
aws glue start-crawler --name "product-crawler"[!NOTE] It takes about 1-2 minutes for the crawler to transition from
STARTINGtoREADY. This process populates thelab_dbwith a table representing your JSON schema.
Step 3: Introduce Schema Evolution
Now we upload a second file with an extra column (discount) and a changed data type for price (from integer to string/double) to simulate evolution.
# Create local file for Version 2 (Evolved Schema)
echo '{"id": 2, "product": "Smartphone", "price": 799.99, "discount": 0.10}' > v2.json
# Upload to the same prefix
aws s3 cp v2.json s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/raw/product_data/v2.json
# Re-run the crawler to detect changes
aws glue start-crawler --name "product-crawler"Step 4: Handle Schema Conflict in Glue Job
To handle the potential conflict between price as an integer and price as a double, we use a Glue ETL Job with resolveChoice.
# Snippet for Glue Script
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext
gluc_context = GlueContext(SparkContext.getOrCreate())
dynamic_frame = gluc_context.create_dynamic_frame.from_catalog(database="lab_db", table_name="product_data")
# Resolve choice: Cast price to double if multiple types are found
resolved_df = dynamic_frame.resolveChoice(specs=[('price', 'cast:double')])
# Sink to Parquet for performance optimization
gluc_context.write_dynamic_frame.from_options(frame=resolved_df, connection_type="s3", connection_options={"path": "s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/transformed/"}, format="parquet")Checkpoints
- Crawler Status: Verify the crawler has finished. You should see "1 table created" or "1 table updated" in the Glue Console.
- Schema Check: In the Glue Catalog, view the
product_datatable. Does it contain thediscountcolumn? Ispricemarked asdoubleorchoice? - Athena Query: Go to Athena and run:
SELECT * FROM "lab_db"."product_data". Ensure both rows appear and thediscountfor the first row isnull(Schema-on-read).
Troubleshooting
| Issue | Possible Cause | Fix |
|---|---|---|
| Crawler fails | IAM Role missing permissions | Attach AWSGlueServiceRole and S3 Access to the role. |
| Athena returns 0 records | Metadata not synced | Ensure the Crawler finished and the S3 path in the table properties is correct. |
price column is empty | Data type mismatch | Use resolveChoice in Glue to explicitly cast types. |
Clean-Up / Teardown
[!WARNING] Failure to delete these resources will result in ongoing charges for S3 storage and potential Glue catalog costs.
# Delete S3 objects and bucket
aws s3 rb s3://brainybee-lab-data-<YOUR_ACCOUNT_ID> --force
# Delete Glue resources
aws glue delete-crawler --name "product-crawler"
aws glue delete-table --database-name "lab_db" --name "product_data"
aws glue delete-database --name "lab_db"Cost Estimate
- AWS Glue: $0.44 per DPU-Hour (Minimum 10 minutes per run). Total expected: ~$0.15.
- Amazon S3: $0.023 per GB. Total expected: <$0.01.
- Amazon Athena: $5.00 per TB scanned. Total expected: <$0.01.
- Total: Effectively free within the AWS Free Tier, otherwise under $0.20.
Stretch Challenge
Task: Implement Partition Projection.
Organize your S3 data by date (e.g., .../year=2023/month=10/v1.json). Update your Glue table to use Partition Projection instead of standard metadata retrieval to speed up Athena queries on large datasets.
Concept Review
Schema Evolution Strategies
| Strategy | Description | Best For |
|---|---|---|
| Backward Compatible | New schema can read old data. | Adding optional columns. |
| Forward Compatible | Old schema can read new data. | Removing columns. |
| Full/Breaking | Schema changes fundamentally. | Requires data migration/conversion. |
Key Terms
- Glue DynamicFrame: An extension of Spark DataFrames that handles "dirty" data and schema-on-the-fly via the
Choicetype. - Technical Catalog: A centralized repository (like Glue Data Catalog) storing metadata about data location, format, and schema.
- resolveChoice: A Glue-specific function used to handle columns with multiple data types by either casting, projecting, or creating separate columns.