SQL Querying and Data Transformation: Amazon Redshift & Athena

This guide covers the core competencies required to query data and create views using SQL in AWS's primary analytical engines: Amazon Redshift (data warehouse) and Amazon Athena (serverless query engine).

Learning Objectives

After studying this guide, you should be able to:

Differentiate between Amazon Redshift and Amazon Athena for specific analytical use cases.
Implement Materialized Views in Redshift to optimize dashboard performance.
Use CTAS (Create Table As Select) in Athena to transform data into optimized formats like Parquet.
Manage metadata and partitions using DDL and MSCK REPAIR TABLE.
Apply cost-optimization strategies such as columnar data selection and query result reuse.

Key Terms & Glossary

CTAS (Create Table As Select): An Athena operation that creates a new table in the Glue Data Catalog from the results of a SELECT statement, often used for ETL.
MPP (Massively Parallel Processing): The architecture used by Redshift to distribute query execution across multiple compute nodes.
MSCK REPAIR TABLE: A command used in Athena to synchronize partition metadata in the Glue Data Catalog with files on S3.
DPU (Data Processing Unit): The unit of compute for Athena Provisioned Capacity (1 DPU = 4 vCPUs, 16 GB memory).
Materialized View (MV): A database object that contains the precomputed results of a query, used to improve performance for complex joins and aggregations.

The "Big Idea"

While both services use standard SQL, they serve different ends of the "Data Spectrum." Amazon Athena is the go-to for "Data Lake" analysis—it is serverless, queries data directly on S3, and is ideal for ad-hoc exploration. Amazon Redshift is the "Enterprise Data Warehouse"—it is optimized for complex, repeatable, and high-performance analytical reporting at petabyte scale using provisioned resources.

Formula / Concept Box

Feature	Amazon Athena	Amazon Redshift
Model	Serverless (Pay-per-scan)	Provisioned or Serverless (MPP)
Data Source	S3 (via Glue Data Catalog)	Local Managed Storage / S3 (via Spectrum)
Best For	Ad-hoc, log analysis, quick ETL	Complex BI, high-concurrency dashboards
Optimization	Parquet/ORC, Partitioning	Sort Keys, Dist Keys, Materialized Views

Hierarchical Outline

I. Amazon Redshift: High-Performance SQL
- Architecture: Massively Parallel Processing (MPP) for scale.
- Querying: Redshift Query Editor v2 (web-based IDE).
- Materialized Views:
  - Caching: Stores precomputed snapshots for dashboards.
  - Incremental Refresh: Updates only changed data (for RMS or external S3 tables).
  - Auto-Rewriting: Optimizer automatically uses MVs even if the query calls the base table.
II. Amazon Athena: Serverless S3 Querying
- SQL Dialect: ANSI SQL based on Presto/Trino.
- Data Definition (DDL):
  - CREATE/ALTER/DROP for Glue Data Catalog objects.
  - DESCRIBE to view column metadata.
- Data Transformation (ETL):
  - CTAS: Converts data formats (e.g., CSV to Parquet) while creating a new table.
  - INSERT INTO: Appends new data to existing tables.
III. Performance & Cost Optimization
- Athena Query Result Reuse: Reduces cost by returning cached results if query/data are identical.
- Columnar Format: Using Parquet/ORC reduces scanned bytes (and therefore cost).
- Athena Workgroups: Used to isolate teams and set per-query data limits.

Visual Anchors

Athena CTAS Workflow

Loading Diagram...

Redshift MPP Architecture

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Materialized View: A stored query result used to speed up slow dashboards.
- Example: Creating a daily_sales_summary MV that aggregates millions of sales_transactions so the dashboard loads in 2 seconds instead of 2 minutes.
Partition Syncing: Updating the data catalog to recognize new folders in S3.
- Example: After daily logs are uploaded to s3://logs/year=2023/month=10/day=27/, running MSCK REPAIR TABLE logs ensures Athena can "see" the Oct 27th data.
Columnar Selection: Requesting specific attributes rather than the whole record.
- Example: Using SELECT user_id, email instead of SELECT * in Athena to save money by not scanning 50 other unused columns.

Worked Examples

Example 1: Redshift Materialized View

To create a dashboard cache for regional sales:

sql

CREATE MATERIALIZED VIEW mv_regional_sales AS
SELECT 
    region, 
    SUM(sale_amount) as total_revenue
FROM sales_data
GROUP BY region;

-- To refresh the data manually
REFRESH MATERIALIZED VIEW mv_regional_sales;

Example 2: Athena CTAS (CSV to Parquet)

To optimize a raw CSV dataset for performance and cost:

sql

CREATE TABLE optimized_inventory
WITH (
  format = 'PARQUET',
  external_location = 's3://my-bucket/optimized-data/'
)
AS SELECT 
    item_id, 
    quantity, 
    last_updated
FROM raw_csv_inventory
WHERE status = 'active';

Checkpoint Questions

Which Redshift feature allows the query optimizer to use a Materialized View even if the user didn't explicitly name it in their SQL?
Why is SELECT * considered an anti-pattern in Amazon Athena?
What command must you run if you manually upload new folders (partitions) to S3 and want Athena to see them?
What are the two billing models available for Amazon Athena?

▶Click to view answers

Automatic Query Rewriting.
Athena charges based on the amount of data scanned; SELECT * scans every column in the row, maximizing cost.
`MSCK REPAIR TABLE

`. 4. Per-Query Billing (bytes scanned) and Provisioned Capacity (DPUs).

Comparison Tables

DDL vs. DML in Cloud SQL

Command Category	Examples	Usage in Athena/Redshift
DDL (Definition)	`CREATE`, `ALTER`, `DROP`	Used to manage table structures and Glue Catalog metadata.
DML (Manipulation)	`INSERT`, `UPDATE`, `DELETE`	Used to modify data content (Note: Athena is primarily for `INSERT`/`SELECT`).
Metadata	`DESCRIBE`, `SHOW`	Used to inspect column types and existing tables.

Muddy Points & Cross-Refs

Athena vs. Redshift Spectrum: Both query S3. Use Athena for serverless, ad-hoc jobs. Use Redshift Spectrum if you already have a Redshift cluster and want to join S3 data with your local data warehouse tables.
MSCK REPAIR vs. Glue Crawlers: MSCK REPAIR is a SQL command for Hive-compatible paths. Glue Crawlers are automated AWS services that can infer schemas and partitions without manual SQL commands.
Partition Projection: A more advanced Athena feature (not detailed here) that allows Athena to "calculate" partitions instead of looking them up in the Glue Catalog, further speeding up queries on highly partitioned data.