Data Integration Mastery: Combining Multiple Sources for AWS Data Engineering
Integrate data from multiple sources
Data Integration Mastery: Combining Multiple Sources
This guide explores the strategies and AWS services used to aggregate, transform, and unify data from disparate sources—including SaaS applications, on-premises databases, and multi-cloud environments—to provide a comprehensive business view.
Learning Objectives
After studying this chapter, you should be able to:
- Differentiate between ETL-based integration (Glue/DataBrew) and Query Federation (Athena).
- Identify supported data sources for Amazon QuickSight and AWS Glue DataBrew.
- Explain the mechanics of Amazon Athena Federated Queries using Lambda connectors.
- Determine when to use
UNIONversusJOINtransformations for dataset combination. - Manage complex data structures using nesting and unnesting functions in AWS Glue DataBrew.
The "Big Idea"
In modern data engineering, data is rarely in one place. The "Big Idea" is Data Gravity vs. Unified Access. While data naturally settles in specific silos (SaaS for CRM, On-prem for legacy ERP, S3 for logs), business value is only unlocked when these silos are bridged. AWS provides two paths: Physical Integration (moving data into a central lake/warehouse) and Virtual Integration (querying data where it lives). Mastering when to move data and when to query it in place is the core of the Data Engineer's role.
Key Terms & Glossary
- Federated Query: A method of querying data across multiple disparate sources (databases, object stores) using a single SQL interface without moving the data.
- Push-down Predicate: An optimization where the query engine (Athena) sends the filtering logic (e.g.,
WHERE age > 21) directly to the source database to reduce the amount of data transferred. - Recipe: In AWS Glue DataBrew, a set of instructions or transformations (steps) applied to a dataset to clean and prepare it.
- Union: A transformation that combines two or more datasets vertically by appending rows with similar schemas.
- Join: A transformation that combines datasets horizontally based on a related column (key).
Formula / Concept Box
| Transformation | Logical Direction | Requirement | Real-World Use Case |
|---|---|---|---|
| Union | Vertical (Rows) | Matching column headers/types | Combining monthly sales logs from Jan and Feb into one year-to-date file. |
| Join | Horizontal (Columns) | Common "Key" column | Attaching "Customer Names" from a SQL DB to "Transaction IDs" from a CSV file. |
| Unnesting | Flattening | Nested structure (Array/Map) | Expanding a JSON array of "Items Purchased" into individual rows for analysis. |
Hierarchical Outline
- I. Data Ingestion Sources
- AWS Services: S3 (CSV, JSON, Parquet), Redshift, RDS.
- SaaS Platforms: Salesforce, ServiceNow, Jira, GitHub (via AppFlow/QuickSight).
- External/Hybrid: On-premises via Direct Connect/VPC, Multi-cloud (Azure ADLS, Google GCS).
- II. Visual Transformation (AWS Glue DataBrew)
- Nesting:
NEST_TO_ARRAY,NEST_TO_MAP,NEST_TO_STRUCT. - Formatting: Case modification, date standardization, regex extraction.
- Nesting:
- III. Query Federation (Amazon Athena)
- Lambda Connectors: Translates SQL to source-specific APIs.
- SDK: Ability to build custom connectors for proprietary sources.
- IV. Orchestration & Sequencing
- Step Functions: Coordination of multi-step ETL workflows.
- EventBridge: Triggering integrations based on S3 uploads or schedules.
Visual Anchors
Data Integration Flow
Athena Federated Query Logic
Definition-Example Pairs
- Nesting: Consolidating multiple related columns into a single complex field.
- Example: Taking
home_phone,work_phone, andcell_phoneand nesting them into a single JSON map calledcontact_info.
- Example: Taking
- Push-down Predicates: Sending the "Filter" to the data, rather than bringing the data to the filter.
- Example: When querying a 1TB MySQL table via Athena, the
WHERE status = 'active'clause is sent to MySQL so only 'active' rows are sent over the network.
- Example: When querying a 1TB MySQL table via Athena, the
Worked Examples
Example 1: Combining Disparate E-commerce Data
Scenario: A company has customer profiles in a CSV on S3, order history in a JSON on S3, and product reviews in an on-premises database. They need a unified view in Redshift with minimal coding.
- Tool Selection: AWS Glue DataBrew.
- Connection: Connect DataBrew to the S3 bucket and the on-premises DB via a JDBC connection (VPC required).
- Transformation:
- Apply a Join between the Customer CSV and Order JSON using
customer_id. - Use Format Functions to capitalize all names.
- Use Unnest on the product reviews if they are stored as a JSON array.
- Apply a Join between the Customer CSV and Order JSON using
- Output: Configure the DataBrew job to output the final recipe results directly to an Amazon Redshift table.
Example 2: Multi-Cloud Querying
Scenario: An analyst needs to join a 10GB table in Amazon S3 with a 500MB table residing in Google Cloud Storage (GCS) for a one-time report.
- Tool Selection: Amazon Athena with Federated Query.
- Setup: Deploy the Athena GCS Connector (Lambda function) from the AWS Serverless Application Repository.
- Execution: Run a standard SQL query in the Athena console:
sql
SELECT s3.customer_name, gcs.loyalty_score FROM "s3_catalog"."default"."customers" s3 JOIN "gcs_catalog"."default"."scores" gcs ON s3.id = gcs.customer_id
Checkpoint Questions
- Which AWS service allows you to query data in Google BigQuery or Snowflake using standard SQL without moving the data?
- What is the difference between
NEST_TO_ARRAYandNEST_TO_STRUCTin DataBrew? - True or False: To use Athena Federated Queries, you must write custom Java code for every data source.
- Which transformation should you use to combine a "January_Sales" table and a "February_Sales" table that have identical columns?
▶Click to see answers
- Amazon Athena (using Federated Queries/Lambda connectors).
NEST_TO_ARRAYcreates a list of values, whileNEST_TO_STRUCTcreates a structured object with defined sub-fields.- False. AWS provides pre-built connectors for many sources; custom code is only needed for unsupported/proprietary sources.
- UNION.
Comparison Tables
| Feature | Amazon Athena Federation | AWS Glue DataBrew | Amazon AppFlow |
|---|---|---|---|
| Primary Goal | Querying in place | Visual data cleaning/ETL | SaaS data ingestion |
| Coding Level | SQL Only | No-code (Visual) | No-code (Configuration) |
| Data Movement | None (Virtual) | Moves data to target | Moves data to S3/Redshift |
| Best For | Ad-hoc cross-source analysis | Complex data preparation | Scheduled SaaS syncing |
Muddy Points & Cross-Refs
- Athena vs. Redshift Spectrum: Both query S3, but Athena is for general ad-hoc queries, while Spectrum is specifically for extending Redshift tables into S3. Use Athena Federation when the data isn't in S3 at all (e.g., in a different cloud).
- VPC Requirements: When integrating with on-premises databases, remember that your Glue or Lambda functions must be configured with VPC access, Security Groups, and potentially a NAT Gateway or Direct Connect.
- Cost Check: Athena Federated Queries charge for the data scanned + the Lambda execution time. For high-frequency queries, it is often cheaper to perform a one-time ETL into S3/Redshift.