Study Guide1,050 words

Data Integration Mastery: Combining Multiple Sources for AWS Data Engineering

Integrate data from multiple sources

Data Integration Mastery: Combining Multiple Sources

This guide explores the strategies and AWS services used to aggregate, transform, and unify data from disparate sources—including SaaS applications, on-premises databases, and multi-cloud environments—to provide a comprehensive business view.

Learning Objectives

After studying this chapter, you should be able to:

  • Differentiate between ETL-based integration (Glue/DataBrew) and Query Federation (Athena).
  • Identify supported data sources for Amazon QuickSight and AWS Glue DataBrew.
  • Explain the mechanics of Amazon Athena Federated Queries using Lambda connectors.
  • Determine when to use UNION versus JOIN transformations for dataset combination.
  • Manage complex data structures using nesting and unnesting functions in AWS Glue DataBrew.

The "Big Idea"

In modern data engineering, data is rarely in one place. The "Big Idea" is Data Gravity vs. Unified Access. While data naturally settles in specific silos (SaaS for CRM, On-prem for legacy ERP, S3 for logs), business value is only unlocked when these silos are bridged. AWS provides two paths: Physical Integration (moving data into a central lake/warehouse) and Virtual Integration (querying data where it lives). Mastering when to move data and when to query it in place is the core of the Data Engineer's role.

Key Terms & Glossary

  • Federated Query: A method of querying data across multiple disparate sources (databases, object stores) using a single SQL interface without moving the data.
  • Push-down Predicate: An optimization where the query engine (Athena) sends the filtering logic (e.g., WHERE age > 21) directly to the source database to reduce the amount of data transferred.
  • Recipe: In AWS Glue DataBrew, a set of instructions or transformations (steps) applied to a dataset to clean and prepare it.
  • Union: A transformation that combines two or more datasets vertically by appending rows with similar schemas.
  • Join: A transformation that combines datasets horizontally based on a related column (key).

Formula / Concept Box

TransformationLogical DirectionRequirementReal-World Use Case
UnionVertical (Rows)Matching column headers/typesCombining monthly sales logs from Jan and Feb into one year-to-date file.
JoinHorizontal (Columns)Common "Key" columnAttaching "Customer Names" from a SQL DB to "Transaction IDs" from a CSV file.
UnnestingFlatteningNested structure (Array/Map)Expanding a JSON array of "Items Purchased" into individual rows for analysis.

Hierarchical Outline

  • I. Data Ingestion Sources
    • AWS Services: S3 (CSV, JSON, Parquet), Redshift, RDS.
    • SaaS Platforms: Salesforce, ServiceNow, Jira, GitHub (via AppFlow/QuickSight).
    • External/Hybrid: On-premises via Direct Connect/VPC, Multi-cloud (Azure ADLS, Google GCS).
  • II. Visual Transformation (AWS Glue DataBrew)
    • Nesting: NEST_TO_ARRAY, NEST_TO_MAP, NEST_TO_STRUCT.
    • Formatting: Case modification, date standardization, regex extraction.
  • III. Query Federation (Amazon Athena)
    • Lambda Connectors: Translates SQL to source-specific APIs.
    • SDK: Ability to build custom connectors for proprietary sources.
  • IV. Orchestration & Sequencing
    • Step Functions: Coordination of multi-step ETL workflows.
    • EventBridge: Triggering integrations based on S3 uploads or schedules.

Visual Anchors

Data Integration Flow

Loading Diagram...

Athena Federated Query Logic

Compiling TikZ diagram…
Running TeX engine…
This may take a few seconds

Definition-Example Pairs

  • Nesting: Consolidating multiple related columns into a single complex field.
    • Example: Taking home_phone, work_phone, and cell_phone and nesting them into a single JSON map called contact_info.
  • Push-down Predicates: Sending the "Filter" to the data, rather than bringing the data to the filter.
    • Example: When querying a 1TB MySQL table via Athena, the WHERE status = 'active' clause is sent to MySQL so only 'active' rows are sent over the network.

Worked Examples

Example 1: Combining Disparate E-commerce Data

Scenario: A company has customer profiles in a CSV on S3, order history in a JSON on S3, and product reviews in an on-premises database. They need a unified view in Redshift with minimal coding.

  1. Tool Selection: AWS Glue DataBrew.
  2. Connection: Connect DataBrew to the S3 bucket and the on-premises DB via a JDBC connection (VPC required).
  3. Transformation:
    • Apply a Join between the Customer CSV and Order JSON using customer_id.
    • Use Format Functions to capitalize all names.
    • Use Unnest on the product reviews if they are stored as a JSON array.
  4. Output: Configure the DataBrew job to output the final recipe results directly to an Amazon Redshift table.

Example 2: Multi-Cloud Querying

Scenario: An analyst needs to join a 10GB table in Amazon S3 with a 500MB table residing in Google Cloud Storage (GCS) for a one-time report.

  1. Tool Selection: Amazon Athena with Federated Query.
  2. Setup: Deploy the Athena GCS Connector (Lambda function) from the AWS Serverless Application Repository.
  3. Execution: Run a standard SQL query in the Athena console:
    sql
    SELECT s3.customer_name, gcs.loyalty_score FROM "s3_catalog"."default"."customers" s3 JOIN "gcs_catalog"."default"."scores" gcs ON s3.id = gcs.customer_id

Checkpoint Questions

  1. Which AWS service allows you to query data in Google BigQuery or Snowflake using standard SQL without moving the data?
  2. What is the difference between NEST_TO_ARRAY and NEST_TO_STRUCT in DataBrew?
  3. True or False: To use Athena Federated Queries, you must write custom Java code for every data source.
  4. Which transformation should you use to combine a "January_Sales" table and a "February_Sales" table that have identical columns?
Click to see answers
  1. Amazon Athena (using Federated Queries/Lambda connectors).
  2. NEST_TO_ARRAY creates a list of values, while NEST_TO_STRUCT creates a structured object with defined sub-fields.
  3. False. AWS provides pre-built connectors for many sources; custom code is only needed for unsupported/proprietary sources.
  4. UNION.

Comparison Tables

FeatureAmazon Athena FederationAWS Glue DataBrewAmazon AppFlow
Primary GoalQuerying in placeVisual data cleaning/ETLSaaS data ingestion
Coding LevelSQL OnlyNo-code (Visual)No-code (Configuration)
Data MovementNone (Virtual)Moves data to targetMoves data to S3/Redshift
Best ForAd-hoc cross-source analysisComplex data preparationScheduled SaaS syncing

Muddy Points & Cross-Refs

  • Athena vs. Redshift Spectrum: Both query S3, but Athena is for general ad-hoc queries, while Spectrum is specifically for extending Redshift tables into S3. Use Athena Federation when the data isn't in S3 at all (e.g., in a different cloud).
  • VPC Requirements: When integrating with on-premises databases, remember that your Glue or Lambda functions must be configured with VPC access, Security Groups, and potentially a NAT Gateway or Direct Connect.
  • Cost Check: Athena Federated Queries charge for the data scanned + the Lambda execution time. For high-frequency queries, it is often cheaper to perform a one-time ETL into S3/Redshift.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free