AWS Glue: Source and Target Connections for Data Cataloging

This study guide covers the essential skills for the AWS Certified Data Engineer – Associate exam (DEA-C01) regarding metadata management, AWS Glue connections, and automated discovery through crawlers.

Learning Objectives

After studying this guide, you should be able to:

Differentiate between Built-in, Custom, and Marketplace connectors.
Configure JDBC connections with appropriate VPC and Security Group settings.
Implement AWS Glue Crawlers to automate schema discovery and partition synchronization.
Understand the role of the AWS Glue Data Catalog as a central metadata repository.
Address schema evolution and data freshness using classifiers.

Key Terms & Glossary

AWS Glue Data Catalog: A central metadata repository that stores table definitions, job definitions, and other control information to manage your AWS Glue environment.
Glue Crawler: A program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in the AWS Glue Data Catalog.
Glue Connection: A Data Catalog object that stores connection information, such as login credentials, URI strings, and VPC subnet/security group details for a particular data store.
Classifier: Logic used by a crawler to determine the format and schema of the data (e.g., CSV, JSON, Parquet).
Partition: A way of organizing data into a hierarchical structure to improve query performance (e.g., /year=2023/month=10/day=01/).

The "Big Idea"

Metadata management is the foundation of a modern data lake. Without a catalog, your data is just a "data swamp"—unstructured and unsearchable. AWS Glue acts as the automated cartographer of your data ecosystem. By creating Connections and running Crawlers, you transform raw, unorganized files in S3 or tables in RDS into a queryable technical catalog that tools like Athena, Redshift Spectrum, and EMR can immediately consume.

Formula / Concept Box

Connector Category	Source Examples	Description
Built-in	S3, JDBC (MySQL, Postgres, Oracle, SQL Server), MongoDB	Native AWS support, easy to configure via console/CLI.
Custom	Snowflake, Teradata, SAP HANA	Developed using Spark, Athena, or JDBC; requires manual code/JARs.
Marketplace	Salesforce, Google BigQuery, MongoDB Atlas	Third-party managed connectors available via AWS Marketplace.

[!IMPORTANT] Note on Type "UNKNOWN": Connections created using Custom or Marketplace connectors in AWS Glue Studio often appear in the Glue console with the type set to UNKNOWN.

Hierarchical Outline

AWS Glue Connections
- Functionality: Bridges the gap between transformation jobs and data stores.
- Required Credentials: Stored securely; integrates with AWS Secrets Manager.
- Networking: Requires VPC, Subnet, and Security Group info for data stores inside a private network.
AWS Glue Crawlers
- Automation: Automatically discovers schemas and populates the Data Catalog.
- Synchronization: Updates the catalog when partitions are added or the schema evolves.
- Schedule: Can run on-demand or via Cron expressions.
Networking & Security
- Inbound Rules: JDBC connections require a security group rule that allows the Glue service to communicate with the data store.
- Self-Referencing SGs: A common best practice is to allow all inbound traffic from the same security group attached to the Glue job.
Data Cataloging Systems
- Technical Catalogs: Glue Data Catalog, Apache Hive Metastore.
- Business Catalogs: Amazon SageMaker Catalog (for ML-specific metadata).

Visual Anchors

The Data Discovery Pipeline

Loading Diagram...

JDBC Connectivity Architecture

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Term: JDBC URL
- Definition: A string used by a Java application to connect to a database.
- Example: jdbc:redshift://my-cluster.123456789.us-east-1.redshift.amazonaws.com:5439/dev
Term: Partition Projection
- Definition: A mechanism to calculate partition information from a configuration rather than crawling the S3 file system.
- Example: Using partition projection for highly granular logs (hourly) to avoid crawler latency.
Term: Schema Evolution
- Definition: The ability of a data system to handle changes in the structure of the data over time.
- Example: Adding a new_customer_tier column to a CSV file in S3; the Glue Crawler detects the change and updates the table definition without breaking existing jobs.

Worked Example: Connecting to Amazon Redshift

To enable AWS Glue to write data to a Redshift cluster, follow these steps:

Collect Details: Obtain the Redshift JDBC URL from the Redshift console (General Information section).
Network Config: Identify the VPC, Subnet, and Security Group used by the Redshift cluster.
Security Group Update:
- Edit the Redshift cluster's security group.
- Add an Inbound Rule for the database port (e.g., 5439 for Redshift).
- Set the Source to the same Security Group ID (self-referencing) to allow Glue to enter the VPC via an ENI.
Create Connection in Glue:
- Navigate to AWS Glue > Connections > Create connection.
- Select JDBC as the source type.
- Paste the URL and provide credentials (ideally via Secrets Manager).
Verification: Select the connection in the Glue console and click Test Connection to ensure the IAM role and networking are configured correctly.

Checkpoint Questions

What happens to the Glue Data Catalog when a crawler detects a new column in an S3-based dataset?
Which connector type should you use for a non-natively supported data store like Snowflake?
Why is a self-referencing security group rule often necessary for Glue JDBC connections?
True or False: Glue Crawlers can synchronize partitions in the Data Catalog automatically.

▶Click to see answers

The Crawler updates the table metadata in the Data Catalog to reflect the new schema (Schema Evolution).
A Custom Connector or a Marketplace Connector.
Because Glue creates Elastic Network Interfaces (ENIs) within your VPC; the database must allow traffic from those ENIs (which share the security group).
True. This is one of the primary functions of a Crawler.

Comparison Tables

Glue Crawler vs. Manual Entry

Feature	Glue Crawler	Manual Entry
Effort	Low (Automated)	High (Manual SQL/DDL)
Accuracy	High for standard formats	High (if maintained)
Schema Evolution	Automatic detection	Manual updates required
Best For	Discovery of unknown schemas	Fixed, strictly governed schemas

Muddy Points & Cross-Refs

Crawler vs. Partition Projection: Crawlers physically list S3 objects, which can be slow for millions of files. Partition Projection (configured via table properties) is faster for predictable, high-volume partitions.
IAM Permissions: A common failure point. The IAM role used by the Crawler must have s3:GetObject and s3:ListBucket for S3 sources, plus glue:GetSecurityConfiguration if encryption is used.
Data Quality (DQDL): Remember that Glue now supports Data Quality Definition Language to validate data during the crawl/ETL process.