Mastering AWS Glue Crawlers and Data Catalogs
Discover schemas and use AWS Glue crawlers to populate data catalogs
Mastering AWS Glue Crawlers and Data Catalogs
This study guide covers the core competencies for discovering schemas and managing metadata using AWS Glue, a critical component of the AWS Certified Data Engineer - Associate (DEA-C01) exam.
Learning Objectives
By the end of this chapter, you should be able to:
- Explain the role of AWS Glue Crawlers in automated metadata discovery.
- Configure and use Classifiers to handle standard and custom data formats.
- Populate and maintain the AWS Glue Data Catalog as a central metadata repository.
- Manage Schema Evolution and partition synchronization for high-performance querying.
- Distinguish between Technical Catalogs (Glue) and Business Catalogs (DataZone).
Key Terms & Glossary
- AWS Glue Data Catalog: A persistent, centralized metadata store that indexes data location, schema, and runtime metrics.
- Crawler: An automated program that connects to a data store, progresses through a prioritized list of classifiers to determine the schema, and populates the Data Catalog.
- Classifier: A logic block that reads the first few bytes of a file to recognize its format (e.g., CSV, Parquet, JSON).
- Schema Evolution: The process of managing changes to data structures (e.g., added columns) over time without breaking downstream applications.
- Partitioning: A method of organizing data (usually in S3 folders) to improve query performance by limiting the amount of data scanned.
The "Big Idea"
Metadata is the connective tissue of a modern data lake. Without it, your data is just a collection of "dark files" in S3. AWS Glue Crawlers act as the automated librarians of your architecture; they scan raw storage, translate physical bits into logical tables, and register them in a central index. This allows serverless engines like Amazon Athena and Amazon Redshift Spectrum to query S3 data as if it were a structured SQL database.
Formula / Concept Box
| Feature | Configuration / Rule |
|---|---|
| Update Behavior | Crawlers can "Update the table definition" or "Add new columns only". |
| Deletion Behavior | Options: "Delete tables/partitions from catalog", "Mark as deprecated", or "Ignore". |
| Crawler Frequency | Can be triggered on-demand, by a Cron schedule, or via EventBridge (event-driven). |
| Standard Classifiers | Built-in support for .csv, .json, .parquet, .orc, .avro, and common RDBMS logs. |
Hierarchical Outline
- I. Data Discovery & Crawlers
- A. Automation Mechanism: Crawlers scan subsets of data to infer structure.
- B. Classifiers: Built-in (default) vs. Custom (Grok patterns or XML/JSON paths).
- C. Connection Types: S3, DynamoDB, JDBC (Redshift, RDS, Snowflake), and MongoDB.
- II. AWS Glue Data Catalog
- A. Databases & Tables: Logical containers for metadata.
- B. Technical Metadata: Column names, data types, and S3 paths.
- C. Partition Management: Synchronizing new partitions automatically via incremental crawls.
- III. Schema Evolution & Performance
- A. Handling Changes: Merging schemas or creating new table versions.
- B. Optimization: Computing column statistics (min/max/nulls) to help Athena/Redshift query planning.
- C. Partition Indexes: Speeding up metadata retrieval for tables with millions of partitions.
Visual Anchors
The Discovery Pipeline
Data Catalog Architecture
\begin{tikzpicture}[node distance=1.5cm, every node/.style={draw, rectangle, fill=blue!10, text centered, rounded corners}] \node (catalog) [fill=orange!20] {AWS Glue Data Catalog}; \node (db) [below of=catalog] {Database (Logical Grouping)}; \node (table) [below of=db] {Table (Schema + Location)}; \node (part) [below of=table] {Partitions (e.g., year/month/day)};
\draw[->, thick] (catalog) -- (db);
\draw[->, thick] (db) -- (table);
\draw[->, thick] (table) -- (part);
\node (s3) [right of=part, xshift=3cm, fill=green!10] {Actual Data (Amazon S3)};
\draw[dashed, ->] (part) -- (s3) node[midway, above] {\small Pointer};\end{tikzpicture}
Definition-Example Pairs
- Crawler Connection: A set of properties required to connect to a data store.
- Example: A JDBC connection string and IAM role credentials used to scan an on-premises PostgreSQL database.
- Incremental Crawl: A crawler setting that only processes new S3 folders rather than the entire bucket.
- Example: A daily crawler that only scans the
year=2023/month=10/day=27folder in S3, ignoring previous months to save cost.
- Example: A daily crawler that only scans the
- Custom Classifier: A user-defined rule for data formats not natively recognized by Glue.
- Example: Using a Grok pattern to parse a proprietary legacy log file where fields are separated by multiple spaces and pipes.
Worked Examples
Scenario: Handling a New Column in a Daily CSV Feed
Problem: A vendor adds a discount_code column to the daily sales CSV files stored in S3. Your Glue Data Catalog needs to reflect this without losing historical data.
Step-by-Step Breakdown:
- Configure Crawler: Set the crawler's "Schema change policy" to "Update the table definition in the data catalog".
- Run Crawler: The crawler scans the newest file and detects the extra column.
- Schema Inference: Glue compares the new schema with the existing version in the catalog.
- Version Update: Glue creates a new Table Version. Current queries in Athena will now see the new column, while old data rows will show
nullfor that column (ensuring backward compatibility).
Comparison Tables
| Feature | Technical Catalog (AWS Glue) | Business Catalog (Amazon DataZone) |
|---|---|---|
| Primary Audience | Data Engineers / Developers | Business Analysts / Data Stewards |
| Content | Column types, S3 paths, SerDe info | Business terms, ownership, usage policies |
| Searchability | Search by table name/attributes | Search by business glossary and tags |
| Integration | Direct integration with ETL jobs/Athena | Governance-focused (Publish/Subscribe) |
Checkpoint Questions
- What is the benefit of using an "Incremental Crawl" for a data lake with millions of objects?
- True or False: A Glue Crawler can update existing table schemas if new columns are added to the source data.
- Which AWS service provides a business glossary to map technical Glue attributes to business terms?
- What happens to a Data Catalog table if the source S3 data is deleted and the crawler is set to "Mark as deprecated"?
Muddy Points & Cross-Refs
- Crawler vs. Manual
MSCK REPAIR TABLE: If you know your schema hasn't changed and you only added new partitions, runningMSCK REPAIR TABLEin Athena is faster and cheaper than running a full Glue Crawler. - Partition Projection: For highly partitioned data (e.g., IoT data by minute), even the Data Catalog can become a bottleneck. In these cases, look into Partition Projection to calculate partition locations in-memory during query time.
- Cost Gotcha: Crawlers are billed per second (minimum 1 minute). Running a crawler every 5 minutes on a small dataset can become unexpectedly expensive; use EventBridge to trigger crawlers only when new data arrives.