Amazon Redshift: Data Migration and Remote Access Methods

This guide covers the critical methods for accessing and analyzing data in Amazon Redshift without traditional, heavy ETL processes. Specifically, we focus on Redshift Spectrum, Federated Queries, and Materialized Views as tools for a modern "Zero-ETL" architecture.

Learning Objectives

After studying this guide, you should be able to:

Differentiate between Redshift Spectrum and Federated Queries based on data source (S3 vs. RDS).
Configure external schemas and tables for data lake analysis.
Implement Materialized Views to optimize complex query performance.
Understand the "Zero-ETL" philosophy and its benefits for data freshness and reduced maintenance.

Key Terms & Glossary

Redshift Spectrum: A feature allowing Redshift to query data directly from Amazon S3 using open formats (Parquet, CSV, etc.) without loading it.
Federated Query: The ability to query live data in operational databases (Amazon RDS or Aurora) directly from Redshift.
Materialized View (MV): A database object that contains the results of a query, precomputed and stored for fast retrieval.
External Schema: A logical namespace in Redshift that references metadata in an external catalog (like AWS Glue) for Spectrum or Federated queries.
Partition Pruning: An optimization where Redshift skips scanning irrelevant S3 folders based on query filters.

The "Big Idea"

The modern data strategy shifts away from "copying everything into the warehouse" toward Data Gravity and Zero-ETL. Instead of moving massive datasets (which creates latency and cost), Redshift acts as a central hub that "reaches out" to where the data lives—whether it's cold data in an S3 Data Lake or hot operational data in an Aurora database. This allows for a unified view of the entire organization's data with minimal infrastructure overhead.

Formula / Concept Box

Command / Concept	Syntax / Rule	Purpose
Spectrum Schema	`CREATE EXTERNAL SCHEMA s3_data FROM DATA CATALOG...`	Links Redshift to S3 metadata
Federated Schema	`CREATE EXTERNAL SCHEMA rds_data FROM POSTGRES...`	Links Redshift to an RDS instance
Materialized View	`CREATE MATERIALIZED VIEW mv_name AS SELECT...`	Precomputes and caches results
MV Refresh	`REFRESH MATERIALIZED VIEW mv_name;`	Updates the cached data

Hierarchical Outline

Remote Access: Amazon Redshift Spectrum
- Data Lake Integration: Queries S3 directly (Parquet, ORC, JSON, CSV).
- Decoupled Storage: Scale compute (Redshift) and storage (S3) independently.
- Performance: Uses a dedicated Spectrum fleet; leverages Partitioning to limit data scans.
Remote Access: Federated Queries
- OLTP Integration: Directly queries Amazon RDS and Aurora.
- Unified Analysis: Join Redshift warehouse tables with live operational data.
- Computational Pushdown: Redshift pushes part of the query execution to the remote database to minimize data transfer.
Performance Optimization: Materialized Views
- Precomputation: Ideal for expensive joins and aggregations used in BI dashboards.
- Automatic Query Rewriting: Redshift can automatically route queries to an MV even if the user queries the base table.
- Incremental Refresh: Only processes data that has changed since the last refresh.

Visual Anchors

Data Flow: Spectrum vs. Federated Queries

Loading Diagram...

Materialized View Mechanism

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Partition Pruning: The ability to skip files based on directory structure.
- Example: If S3 data is stored as /year=2023/month=01/, a query filtered for January 2023 will ignore all other year and month folders, saving costs.
Computational Pushdown: Moving the query logic to the source data.
- Example: In a Federated Query to RDS, Redshift sends the WHERE clause to RDS so only the relevant rows are sent back over the network.
Data Sharing: Sharing data across Redshift clusters without copying.
- Example: A central "Producer" cluster shares sales data with a "Consumer" cluster used by the marketing team to isolate workloads.

Worked Examples

1. Querying the Data Lake (Spectrum)

To query Parquet files in S3, you first define the schema and then query the external table.

sql

-- Create external schema
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'external_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/MyRedshiftRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

-- Query external table joined with local table
SELECT s.product_id, l.product_name, SUM(s.units)
FROM spectrum_schema.sales_external s
JOIN local_product_dim l ON s.product_id = l.product_id
WHERE s.sale_date > '2023-01-01'
GROUP BY 1, 2;

2. Creating an Optimized Materialized View

sql

CREATE MATERIALIZED VIEW mv_daily_revenue
BACKUP YES
AS
SELECT 
    transaction_date, 
    store_id, 
    SUM(amount) as daily_total
FROM sales_transactions
GROUP BY 1, 2;

-- Refreshing the view
REFRESH MATERIALIZED VIEW mv_daily_revenue;

Checkpoint Questions

Which feature would you use to query historical logs stored in .json format on S3 without loading them into Redshift local storage?
What is the main performance benefit of using a Materialized View for a dashboard that runs every minute?
True or False: Federated Queries require you to use AWS Glue to move data from RDS to Redshift.
Which feature supports "Automatic Query Rewriting"?

Comparison Tables

Feature	Source	Primary Use Case	Performance Strategy
Spectrum	Amazon S3	Massive data lake analysis	Partition pruning, external fleet
Federated Query	RDS / Aurora	Real-time operational data	Computational pushdown
Materialized View	Redshift Tables	High-speed dashboards	Precomputing results/cache
Standard Tables	Redshift Local	Hot data, frequent updates	Sort keys, Dist styles

Muddy Points & Cross-Refs

MV Refresh: Remember that MVs are not "live" like standard views. They must be refreshed. Redshift supports Automatic Refresh, but you must ensure the base tables meet the criteria (e.g., no Spectrum tables in some older versions).
Cost Management: Spectrum is billed per Terabyte scanned. Always use compressed columnar formats like Parquet and implement Partitioning to keep costs low.
Permissions: All these methods require an IAM Role attached to the Redshift cluster with permissions to access S3 (for Spectrum) or Secrets Manager (for Federated Query credentials).