Study Guide920 words

Mastering AWS Glue Crawlers and Data Catalogs

Discover schemas and use AWS Glue crawlers to populate data catalogs

Mastering AWS Glue Crawlers and Data Catalogs

This study guide covers the core competencies for discovering schemas and managing metadata using AWS Glue, a critical component of the AWS Certified Data Engineer - Associate (DEA-C01) exam.

Learning Objectives

By the end of this chapter, you should be able to:

  • Explain the role of AWS Glue Crawlers in automated metadata discovery.
  • Configure and use Classifiers to handle standard and custom data formats.
  • Populate and maintain the AWS Glue Data Catalog as a central metadata repository.
  • Manage Schema Evolution and partition synchronization for high-performance querying.
  • Distinguish between Technical Catalogs (Glue) and Business Catalogs (DataZone).

Key Terms & Glossary

  • AWS Glue Data Catalog: A persistent, centralized metadata store that indexes data location, schema, and runtime metrics.
  • Crawler: An automated program that connects to a data store, progresses through a prioritized list of classifiers to determine the schema, and populates the Data Catalog.
  • Classifier: A logic block that reads the first few bytes of a file to recognize its format (e.g., CSV, Parquet, JSON).
  • Schema Evolution: The process of managing changes to data structures (e.g., added columns) over time without breaking downstream applications.
  • Partitioning: A method of organizing data (usually in S3 folders) to improve query performance by limiting the amount of data scanned.

The "Big Idea"

Metadata is the connective tissue of a modern data lake. Without it, your data is just a collection of "dark files" in S3. AWS Glue Crawlers act as the automated librarians of your architecture; they scan raw storage, translate physical bits into logical tables, and register them in a central index. This allows serverless engines like Amazon Athena and Amazon Redshift Spectrum to query S3 data as if it were a structured SQL database.

Formula / Concept Box

FeatureConfiguration / Rule
Update BehaviorCrawlers can "Update the table definition" or "Add new columns only".
Deletion BehaviorOptions: "Delete tables/partitions from catalog", "Mark as deprecated", or "Ignore".
Crawler FrequencyCan be triggered on-demand, by a Cron schedule, or via EventBridge (event-driven).
Standard ClassifiersBuilt-in support for .csv, .json, .parquet, .orc, .avro, and common RDBMS logs.

Hierarchical Outline

  • I. Data Discovery & Crawlers
    • A. Automation Mechanism: Crawlers scan subsets of data to infer structure.
    • B. Classifiers: Built-in (default) vs. Custom (Grok patterns or XML/JSON paths).
    • C. Connection Types: S3, DynamoDB, JDBC (Redshift, RDS, Snowflake), and MongoDB.
  • II. AWS Glue Data Catalog
    • A. Databases & Tables: Logical containers for metadata.
    • B. Technical Metadata: Column names, data types, and S3 paths.
    • C. Partition Management: Synchronizing new partitions automatically via incremental crawls.
  • III. Schema Evolution & Performance
    • A. Handling Changes: Merging schemas or creating new table versions.
    • B. Optimization: Computing column statistics (min/max/nulls) to help Athena/Redshift query planning.
    • C. Partition Indexes: Speeding up metadata retrieval for tables with millions of partitions.

Visual Anchors

The Discovery Pipeline

Loading Diagram...

Data Catalog Architecture

\begin{tikzpicture}[node distance=1.5cm, every node/.style={draw, rectangle, fill=blue!10, text centered, rounded corners}] \node (catalog) [fill=orange!20] {AWS Glue Data Catalog}; \node (db) [below of=catalog] {Database (Logical Grouping)}; \node (table) [below of=db] {Table (Schema + Location)}; \node (part) [below of=table] {Partitions (e.g., year/month/day)};

code
\draw[->, thick] (catalog) -- (db); \draw[->, thick] (db) -- (table); \draw[->, thick] (table) -- (part); \node (s3) [right of=part, xshift=3cm, fill=green!10] {Actual Data (Amazon S3)}; \draw[dashed, ->] (part) -- (s3) node[midway, above] {\small Pointer};

\end{tikzpicture}

Definition-Example Pairs

  • Crawler Connection: A set of properties required to connect to a data store.
    • Example: A JDBC connection string and IAM role credentials used to scan an on-premises PostgreSQL database.
  • Incremental Crawl: A crawler setting that only processes new S3 folders rather than the entire bucket.
    • Example: A daily crawler that only scans the year=2023/month=10/day=27 folder in S3, ignoring previous months to save cost.
  • Custom Classifier: A user-defined rule for data formats not natively recognized by Glue.
    • Example: Using a Grok pattern to parse a proprietary legacy log file where fields are separated by multiple spaces and pipes.

Worked Examples

Scenario: Handling a New Column in a Daily CSV Feed

Problem: A vendor adds a discount_code column to the daily sales CSV files stored in S3. Your Glue Data Catalog needs to reflect this without losing historical data.

Step-by-Step Breakdown:

  1. Configure Crawler: Set the crawler's "Schema change policy" to "Update the table definition in the data catalog".
  2. Run Crawler: The crawler scans the newest file and detects the extra column.
  3. Schema Inference: Glue compares the new schema with the existing version in the catalog.
  4. Version Update: Glue creates a new Table Version. Current queries in Athena will now see the new column, while old data rows will show null for that column (ensuring backward compatibility).

Comparison Tables

FeatureTechnical Catalog (AWS Glue)Business Catalog (Amazon DataZone)
Primary AudienceData Engineers / DevelopersBusiness Analysts / Data Stewards
ContentColumn types, S3 paths, SerDe infoBusiness terms, ownership, usage policies
SearchabilitySearch by table name/attributesSearch by business glossary and tags
IntegrationDirect integration with ETL jobs/AthenaGovernance-focused (Publish/Subscribe)

Checkpoint Questions

  1. What is the benefit of using an "Incremental Crawl" for a data lake with millions of objects?
  2. True or False: A Glue Crawler can update existing table schemas if new columns are added to the source data.
  3. Which AWS service provides a business glossary to map technical Glue attributes to business terms?
  4. What happens to a Data Catalog table if the source S3 data is deleted and the crawler is set to "Mark as deprecated"?

Muddy Points & Cross-Refs

  • Crawler vs. Manual MSCK REPAIR TABLE: If you know your schema hasn't changed and you only added new partitions, running MSCK REPAIR TABLE in Athena is faster and cheaper than running a full Glue Crawler.
  • Partition Projection: For highly partitioned data (e.g., IoT data by minute), even the Data Catalog can become a bottleneck. In these cases, look into Partition Projection to calculate partition locations in-memory during query time.
  • Cost Gotcha: Crawlers are billed per second (minimum 1 minute). Running a crawler every 5 minutes on a small dataset can become unexpectedly expensive; use EventBridge to trigger crawlers only when new data arrives.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free