Lab: Managing Schema Evolution with AWS Glue and Athena

This lab provides a hands-on experience in managing data models and schema evolution within the AWS ecosystem. You will learn how to use AWS Glue Crawlers to discover schemas, handle changes in data structure using Glue DynamicFrames, and query evolved datasets using Amazon Athena.

Prerequisites

An AWS Account with administrative access or permissions for IAM, S3, Glue, and Athena.
AWS CLI installed and configured on your local machine.
Basic knowledge of JSON and SQL.
An S3 bucket named brainybee-lab-data-<YOUR_ACCOUNT_ID> in your preferred region.

Learning Objectives

Provision an S3-based data lake with initial and evolved data versions.
Automate metadata discovery using AWS Glue Crawlers.
Analyze the impact of schema changes (adding columns, changing types) on the Glue Data Catalog.
Implement schema resolution strategies using Glue DynamicFrames and the resolveChoice method.
Query multi-versioned schemas using Amazon Athena.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Set Up the S3 Data Lake

First, we create the folder structure and upload the "Initial" data (Version 1).

bash

# Create the bucket (change region if necessary)
aws s3 mb s3://brainybee-lab-data-<YOUR_ACCOUNT_ID> --region us-east-1

# Create a local file for Version 1
echo '{"id": 1, "product": "Laptop", "price": 1200}' > v1.json

# Upload to S3
aws s3 cp v1.json s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/raw/product_data/v1.json

▶Console alternative

Open the S3 Console.
Click Create bucket and name it `brainybee-lab-data-

. 3. Create a folder named raw/product_data/. 4. Upload a JSON file with content: {"id": 1, "product": "Laptop", "price": 1200}`.

Step 2: Catalog Data with Glue Crawler

We need to create a technical data catalog for this data.

bash

# Create a Glue Database
aws glue create-database --database-input '{"Name": "lab_db"}'

# Create the Crawler
aws glue create-crawler --name "product-crawler" \
    --role "AWSGlueServiceRole" \
    --database-name "lab_db" \
    --targets '{"S3Targets": [{"Path": "s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/raw/product_data/"}]}'

# Run the Crawler
aws glue start-crawler --name "product-crawler"

[!NOTE] It takes about 1-2 minutes for the crawler to transition from STARTING to READY. This process populates the lab_db with a table representing your JSON schema.

Step 3: Introduce Schema Evolution

Now we upload a second file with an extra column (discount) and a changed data type for price (from integer to string/double) to simulate evolution.

bash

# Create local file for Version 2 (Evolved Schema)
echo '{"id": 2, "product": "Smartphone", "price": 799.99, "discount": 0.10}' > v2.json

# Upload to the same prefix
aws s3 cp v2.json s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/raw/product_data/v2.json

# Re-run the crawler to detect changes
aws glue start-crawler --name "product-crawler"

Step 4: Handle Schema Conflict in Glue Job

To handle the potential conflict between price as an integer and price as a double, we use a Glue ETL Job with resolveChoice.

python

# Snippet for Glue Script
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext

gluc_context = GlueContext(SparkContext.getOrCreate())
dynamic_frame = gluc_context.create_dynamic_frame.from_catalog(database="lab_db", table_name="product_data")

# Resolve choice: Cast price to double if multiple types are found
resolved_df = dynamic_frame.resolveChoice(specs=[('price', 'cast:double')])

# Sink to Parquet for performance optimization
gluc_context.write_dynamic_frame.from_options(frame=resolved_df, connection_type="s3", connection_options={"path": "s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/transformed/"}, format="parquet")

Checkpoints

Crawler Status: Verify the crawler has finished. You should see "1 table created" or "1 table updated" in the Glue Console.
Schema Check: In the Glue Catalog, view the product_data table. Does it contain the discount column? Is price marked as double or choice?
Athena Query: Go to Athena and run: SELECT * FROM "lab_db"."product_data". Ensure both rows appear and the discount for the first row is null (Schema-on-read).

Troubleshooting

Issue	Possible Cause	Fix
Crawler fails	IAM Role missing permissions	Attach `AWSGlueServiceRole` and S3 Access to the role.
Athena returns 0 records	Metadata not synced	Ensure the Crawler finished and the S3 path in the table properties is correct.
`price` column is empty	Data type mismatch	Use `resolveChoice` in Glue to explicitly cast types.

Clean-Up / Teardown

[!WARNING] Failure to delete these resources will result in ongoing charges for S3 storage and potential Glue catalog costs.

bash

# Delete S3 objects and bucket
aws s3 rb s3://brainybee-lab-data-<YOUR_ACCOUNT_ID> --force

# Delete Glue resources
aws glue delete-crawler --name "product-crawler"
aws glue delete-table --database-name "lab_db" --name "product_data"
aws glue delete-database --name "lab_db"

Cost Estimate

AWS Glue: $0.44 per DPU-Hour (Minimum 10 minutes per run). Total expected: ~$0.15.
Amazon S3: $0.023 per GB. Total expected: <$0.01.
Amazon Athena: $5.00 per TB scanned. Total expected: <$0.01.
Total: Effectively free within the AWS Free Tier, otherwise under $0.20.

Stretch Challenge

Task: Implement Partition Projection. Organize your S3 data by date (e.g., .../year=2023/month=10/v1.json). Update your Glue table to use Partition Projection instead of standard metadata retrieval to speed up Athena queries on large datasets.

Concept Review

Schema Evolution Strategies

Strategy	Description	Best For
Backward Compatible	New schema can read old data.	Adding optional columns.
Forward Compatible	Old schema can read new data.	Removing columns.
Full/Breaking	Schema changes fundamentally.	Requires data migration/conversion.

Key Terms

Glue DynamicFrame: An extension of Spark DataFrames that handles "dirty" data and schema-on-the-fly via the Choice type.
Technical Catalog: A centralized repository (like Glue Data Catalog) storing metadata about data location, format, and schema.
resolveChoice: A Glue-specific function used to handle columns with multiple data types by either casting, projecting, or creating separate columns.

Lab: Managing Schema Evolution with AWS Glue and Athena

Prerequisites

An AWS Account with administrative access or permissions for IAM, S3, Glue, and Athena.
AWS CLI installed and configured on your local machine.
Basic knowledge of JSON and SQL.
An S3 bucket named brainybee-lab-data-<YOUR_ACCOUNT_ID> in your preferred region.

Learning Objectives

Provision an S3-based data lake with initial and evolved data versions.
Automate metadata discovery using AWS Glue Crawlers.
Analyze the impact of schema changes (adding columns, changing types) on the Glue Data Catalog.
Implement schema resolution strategies using Glue DynamicFrames and the resolveChoice method.
Query multi-versioned schemas using Amazon Athena.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Set Up the S3 Data Lake

First, we create the folder structure and upload the "Initial" data (Version 1).

bash

# Create the bucket (change region if necessary)
aws s3 mb s3://brainybee-lab-data-<YOUR_ACCOUNT_ID> --region us-east-1

# Create a local file for Version 1
echo '{"id": 1, "product": "Laptop", "price": 1200}' > v1.json

# Upload to S3
aws s3 cp v1.json s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/raw/product_data/v1.json

▶Console alternative

Open the S3 Console.
Click Create bucket and name it `brainybee-lab-data-

. 3. Create a folder named raw/product_data/. 4. Upload a JSON file with content: {"id": 1, "product": "Laptop", "price": 1200}`.

Step 2: Catalog Data with Glue Crawler

We need to create a technical data catalog for this data.

bash

# Create a Glue Database
aws glue create-database --database-input '{"Name": "lab_db"}'

# Create the Crawler
aws glue create-crawler --name "product-crawler" \
    --role "AWSGlueServiceRole" \
    --database-name "lab_db" \
    --targets '{"S3Targets": [{"Path": "s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/raw/product_data/"}]}'

# Run the Crawler
aws glue start-crawler --name "product-crawler"

[!NOTE] It takes about 1-2 minutes for the crawler to transition from STARTING to READY. This process populates the lab_db with a table representing your JSON schema.

Step 3: Introduce Schema Evolution

Now we upload a second file with an extra column (discount) and a changed data type for price (from integer to string/double) to simulate evolution.

bash

# Create local file for Version 2 (Evolved Schema)
echo '{"id": 2, "product": "Smartphone", "price": 799.99, "discount": 0.10}' > v2.json

# Upload to the same prefix
aws s3 cp v2.json s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/raw/product_data/v2.json

# Re-run the crawler to detect changes
aws glue start-crawler --name "product-crawler"

Step 4: Handle Schema Conflict in Glue Job

To handle the potential conflict between price as an integer and price as a double, we use a Glue ETL Job with resolveChoice.

python

# Snippet for Glue Script
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext

gluc_context = GlueContext(SparkContext.getOrCreate())
dynamic_frame = gluc_context.create_dynamic_frame.from_catalog(database="lab_db", table_name="product_data")

# Resolve choice: Cast price to double if multiple types are found
resolved_df = dynamic_frame.resolveChoice(specs=[('price', 'cast:double')])

# Sink to Parquet for performance optimization
gluc_context.write_dynamic_frame.from_options(frame=resolved_df, connection_type="s3", connection_options={"path": "s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/transformed/"}, format="parquet")

Checkpoints

Crawler Status: Verify the crawler has finished. You should see "1 table created" or "1 table updated" in the Glue Console.
Schema Check: In the Glue Catalog, view the product_data table. Does it contain the discount column? Is price marked as double or choice?
Athena Query: Go to Athena and run: SELECT * FROM "lab_db"."product_data". Ensure both rows appear and the discount for the first row is null (Schema-on-read).

Troubleshooting

Issue	Possible Cause	Fix
Crawler fails	IAM Role missing permissions	Attach `AWSGlueServiceRole` and S3 Access to the role.
Athena returns 0 records	Metadata not synced	Ensure the Crawler finished and the S3 path in the table properties is correct.
`price` column is empty	Data type mismatch	Use `resolveChoice` in Glue to explicitly cast types.

Clean-Up / Teardown

[!WARNING] Failure to delete these resources will result in ongoing charges for S3 storage and potential Glue catalog costs.

bash

# Delete S3 objects and bucket
aws s3 rb s3://brainybee-lab-data-<YOUR_ACCOUNT_ID> --force

# Delete Glue resources
aws glue delete-crawler --name "product-crawler"
aws glue delete-table --database-name "lab_db" --name "product_data"
aws glue delete-database --name "lab_db"

Cost Estimate

AWS Glue: $0.44 per DPU-Hour (Minimum 10 minutes per run). Total expected: ~$0.15.
Amazon S3: $0.023 per GB. Total expected: <$0.01.
Amazon Athena: $5.00 per TB scanned. Total expected: <$0.01.
Total: Effectively free within the AWS Free Tier, otherwise under $0.20.

Stretch Challenge

Concept Review

Schema Evolution Strategies

Strategy	Description	Best For
Backward Compatible	New schema can read old data.	Adding optional columns.
Forward Compatible	Old schema can read new data.	Removing columns.
Full/Breaking	Schema changes fundamentally.	Requires data migration/conversion.

Key Terms

Glue DynamicFrame: An extension of Spark DataFrames that handles "dirty" data and schema-on-the-fly via the Choice type.
Technical Catalog: A centralized repository (like Glue Data Catalog) storing metadata about data location, format, and schema.
resolveChoice: A Glue-specific function used to handle columns with multiple data types by either casting, projecting, or creating separate columns.