Hands-On Lab920 words

Lab: Managing Schema Evolution with AWS Glue and Athena

Data Models and Schema Evolution

Lab: Managing Schema Evolution with AWS Glue and Athena

This lab provides a hands-on experience in managing data models and schema evolution within the AWS ecosystem. You will learn how to use AWS Glue Crawlers to discover schemas, handle changes in data structure using Glue DynamicFrames, and query evolved datasets using Amazon Athena.

Prerequisites

  • An AWS Account with administrative access or permissions for IAM, S3, Glue, and Athena.
  • AWS CLI installed and configured on your local machine.
  • Basic knowledge of JSON and SQL.
  • An S3 bucket named brainybee-lab-data-<YOUR_ACCOUNT_ID> in your preferred region.

Learning Objectives

  • Provision an S3-based data lake with initial and evolved data versions.
  • Automate metadata discovery using AWS Glue Crawlers.
  • Analyze the impact of schema changes (adding columns, changing types) on the Glue Data Catalog.
  • Implement schema resolution strategies using Glue DynamicFrames and the resolveChoice method.
  • Query multi-versioned schemas using Amazon Athena.

Architecture Overview

Loading Diagram...

Step-by-Step Instructions

Step 1: Set Up the S3 Data Lake

First, we create the folder structure and upload the "Initial" data (Version 1).

bash
# Create the bucket (change region if necessary) aws s3 mb s3://brainybee-lab-data-<YOUR_ACCOUNT_ID> --region us-east-1 # Create a local file for Version 1 echo '{"id": 1, "product": "Laptop", "price": 1200}' > v1.json # Upload to S3 aws s3 cp v1.json s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/raw/product_data/v1.json
Console alternative
  1. Open the S3 Console.
  2. Click Create bucket and name it `brainybee-lab-data-

. 3. Create a folder named raw/product_data/. 4. Upload a JSON file with content: {"id": 1, "product": "Laptop", "price": 1200}`.

Step 2: Catalog Data with Glue Crawler

We need to create a technical data catalog for this data.

bash
# Create a Glue Database aws glue create-database --database-input '{"Name": "lab_db"}' # Create the Crawler aws glue create-crawler --name "product-crawler" \ --role "AWSGlueServiceRole" \ --database-name "lab_db" \ --targets '{"S3Targets": [{"Path": "s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/raw/product_data/"}]}' # Run the Crawler aws glue start-crawler --name "product-crawler"

[!NOTE] It takes about 1-2 minutes for the crawler to transition from STARTING to READY. This process populates the lab_db with a table representing your JSON schema.

Step 3: Introduce Schema Evolution

Now we upload a second file with an extra column (discount) and a changed data type for price (from integer to string/double) to simulate evolution.

bash
# Create local file for Version 2 (Evolved Schema) echo '{"id": 2, "product": "Smartphone", "price": 799.99, "discount": 0.10}' > v2.json # Upload to the same prefix aws s3 cp v2.json s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/raw/product_data/v2.json # Re-run the crawler to detect changes aws glue start-crawler --name "product-crawler"

Step 4: Handle Schema Conflict in Glue Job

To handle the potential conflict between price as an integer and price as a double, we use a Glue ETL Job with resolveChoice.

python
# Snippet for Glue Script from awsglue.transforms import * from awsglue.utils import getResolvedOptions from awsglue.context import GlueContext from pyspark.context import SparkContext gluc_context = GlueContext(SparkContext.getOrCreate()) dynamic_frame = gluc_context.create_dynamic_frame.from_catalog(database="lab_db", table_name="product_data") # Resolve choice: Cast price to double if multiple types are found resolved_df = dynamic_frame.resolveChoice(specs=[('price', 'cast:double')]) # Sink to Parquet for performance optimization gluc_context.write_dynamic_frame.from_options(frame=resolved_df, connection_type="s3", connection_options={"path": "s3://brainybee-lab-data-<YOUR_ACCOUNT_ID>/transformed/"}, format="parquet")

Checkpoints

  1. Crawler Status: Verify the crawler has finished. You should see "1 table created" or "1 table updated" in the Glue Console.
  2. Schema Check: In the Glue Catalog, view the product_data table. Does it contain the discount column? Is price marked as double or choice?
  3. Athena Query: Go to Athena and run: SELECT * FROM "lab_db"."product_data". Ensure both rows appear and the discount for the first row is null (Schema-on-read).

Troubleshooting

IssuePossible CauseFix
Crawler failsIAM Role missing permissionsAttach AWSGlueServiceRole and S3 Access to the role.
Athena returns 0 recordsMetadata not syncedEnsure the Crawler finished and the S3 path in the table properties is correct.
price column is emptyData type mismatchUse resolveChoice in Glue to explicitly cast types.

Clean-Up / Teardown

[!WARNING] Failure to delete these resources will result in ongoing charges for S3 storage and potential Glue catalog costs.

bash
# Delete S3 objects and bucket aws s3 rb s3://brainybee-lab-data-<YOUR_ACCOUNT_ID> --force # Delete Glue resources aws glue delete-crawler --name "product-crawler" aws glue delete-table --database-name "lab_db" --name "product_data" aws glue delete-database --name "lab_db"

Cost Estimate

  • AWS Glue: $0.44 per DPU-Hour (Minimum 10 minutes per run). Total expected: ~$0.15.
  • Amazon S3: $0.023 per GB. Total expected: <$0.01.
  • Amazon Athena: $5.00 per TB scanned. Total expected: <$0.01.
  • Total: Effectively free within the AWS Free Tier, otherwise under $0.20.

Stretch Challenge

Task: Implement Partition Projection. Organize your S3 data by date (e.g., .../year=2023/month=10/v1.json). Update your Glue table to use Partition Projection instead of standard metadata retrieval to speed up Athena queries on large datasets.

Concept Review

Schema Evolution Strategies

StrategyDescriptionBest For
Backward CompatibleNew schema can read old data.Adding optional columns.
Forward CompatibleOld schema can read new data.Removing columns.
Full/BreakingSchema changes fundamentally.Requires data migration/conversion.

Key Terms

  • Glue DynamicFrame: An extension of Spark DataFrames that handles "dirty" data and schema-on-the-fly via the Choice type.
  • Technical Catalog: A centralized repository (like Glue Data Catalog) storing metadata about data location, format, and schema.
  • resolveChoice: A Glue-specific function used to handle columns with multiple data types by either casting, projecting, or creating separate columns.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free