SQL Querying and Data Transformation: Amazon Redshift & Athena
Use SQL in Amazon Redshift and Athena to query data or to create views
SQL Querying and Data Transformation: Amazon Redshift & Athena
This guide covers the core competencies required to query data and create views using SQL in AWS's primary analytical engines: Amazon Redshift (data warehouse) and Amazon Athena (serverless query engine).
Learning Objectives
After studying this guide, you should be able to:
- Differentiate between Amazon Redshift and Amazon Athena for specific analytical use cases.
- Implement Materialized Views in Redshift to optimize dashboard performance.
- Use CTAS (Create Table As Select) in Athena to transform data into optimized formats like Parquet.
- Manage metadata and partitions using DDL and
MSCK REPAIR TABLE. - Apply cost-optimization strategies such as columnar data selection and query result reuse.
Key Terms & Glossary
- CTAS (Create Table As Select): An Athena operation that creates a new table in the Glue Data Catalog from the results of a SELECT statement, often used for ETL.
- MPP (Massively Parallel Processing): The architecture used by Redshift to distribute query execution across multiple compute nodes.
- MSCK REPAIR TABLE: A command used in Athena to synchronize partition metadata in the Glue Data Catalog with files on S3.
- DPU (Data Processing Unit): The unit of compute for Athena Provisioned Capacity (1 DPU = 4 vCPUs, 16 GB memory).
- Materialized View (MV): A database object that contains the precomputed results of a query, used to improve performance for complex joins and aggregations.
The "Big Idea"
While both services use standard SQL, they serve different ends of the "Data Spectrum." Amazon Athena is the go-to for "Data Lake" analysis—it is serverless, queries data directly on S3, and is ideal for ad-hoc exploration. Amazon Redshift is the "Enterprise Data Warehouse"—it is optimized for complex, repeatable, and high-performance analytical reporting at petabyte scale using provisioned resources.
Formula / Concept Box
| Feature | Amazon Athena | Amazon Redshift |
|---|---|---|
| Model | Serverless (Pay-per-scan) | Provisioned or Serverless (MPP) |
| Data Source | S3 (via Glue Data Catalog) | Local Managed Storage / S3 (via Spectrum) |
| Best For | Ad-hoc, log analysis, quick ETL | Complex BI, high-concurrency dashboards |
| Optimization | Parquet/ORC, Partitioning | Sort Keys, Dist Keys, Materialized Views |
Hierarchical Outline
- I. Amazon Redshift: High-Performance SQL
- Architecture: Massively Parallel Processing (MPP) for scale.
- Querying: Redshift Query Editor v2 (web-based IDE).
- Materialized Views:
- Caching: Stores precomputed snapshots for dashboards.
- Incremental Refresh: Updates only changed data (for RMS or external S3 tables).
- Auto-Rewriting: Optimizer automatically uses MVs even if the query calls the base table.
- II. Amazon Athena: Serverless S3 Querying
- SQL Dialect: ANSI SQL based on Presto/Trino.
- Data Definition (DDL):
CREATE/ALTER/DROPfor Glue Data Catalog objects.DESCRIBEto view column metadata.
- Data Transformation (ETL):
- CTAS: Converts data formats (e.g., CSV to Parquet) while creating a new table.
- INSERT INTO: Appends new data to existing tables.
- III. Performance & Cost Optimization
- Athena Query Result Reuse: Reduces cost by returning cached results if query/data are identical.
- Columnar Format: Using Parquet/ORC reduces scanned bytes (and therefore cost).
- Athena Workgroups: Used to isolate teams and set per-query data limits.
Visual Anchors
Athena CTAS Workflow
Redshift MPP Architecture
\begin{tikzpicture}[node distance=1.5cm, every node/.style={draw, rectangle, fill=blue!10, text centered, rounded corners}] \node (leader) [fill=green!20] {\textbf{Leader Node} \ (Query Parsing / Execution Plan)}; \node (node1) [below left of=leader, xshift=-1cm] {Compute Node 1}; \node (node2) [below of=leader] {Compute Node 2}; \node (node3) [below right of=leader, xshift=1cm] {Compute Node 3};
\draw[<->, thick] (leader) -- (node1);
\draw[<->, thick] (leader) -- (node2);
\draw[<->, thick] (leader) -- (node3);
\node[draw=none, fill=none, below of=node2, yshift=0.5cm] {\textit{Parallel Execution of Slices}};\end{tikzpicture}
Definition-Example Pairs
- Materialized View: A stored query result used to speed up slow dashboards.
- Example: Creating a
daily_sales_summaryMV that aggregates millions ofsales_transactionsso the dashboard loads in 2 seconds instead of 2 minutes.
- Example: Creating a
- Partition Syncing: Updating the data catalog to recognize new folders in S3.
- Example: After daily logs are uploaded to
s3://logs/year=2023/month=10/day=27/, runningMSCK REPAIR TABLE logsensures Athena can "see" the Oct 27th data.
- Example: After daily logs are uploaded to
- Columnar Selection: Requesting specific attributes rather than the whole record.
- Example: Using
SELECT user_id, emailinstead ofSELECT *in Athena to save money by not scanning 50 other unused columns.
- Example: Using
Worked Examples
Example 1: Redshift Materialized View
To create a dashboard cache for regional sales:
CREATE MATERIALIZED VIEW mv_regional_sales AS
SELECT
region,
SUM(sale_amount) as total_revenue
FROM sales_data
GROUP BY region;
-- To refresh the data manually
REFRESH MATERIALIZED VIEW mv_regional_sales;Example 2: Athena CTAS (CSV to Parquet)
To optimize a raw CSV dataset for performance and cost:
CREATE TABLE optimized_inventory
WITH (
format = 'PARQUET',
external_location = 's3://my-bucket/optimized-data/'
)
AS SELECT
item_id,
quantity,
last_updated
FROM raw_csv_inventory
WHERE status = 'active';Checkpoint Questions
- Which Redshift feature allows the query optimizer to use a Materialized View even if the user didn't explicitly name it in their SQL?
- Why is
SELECT *considered an anti-pattern in Amazon Athena? - What command must you run if you manually upload new folders (partitions) to S3 and want Athena to see them?
- What are the two billing models available for Amazon Athena?
▶Click to view answers
- Automatic Query Rewriting.
- Athena charges based on the amount of data scanned;
SELECT *scans every column in the row, maximizing cost. - `MSCK REPAIR TABLE
`. 4. Per-Query Billing (bytes scanned) and Provisioned Capacity (DPUs).
Comparison Tables
DDL vs. DML in Cloud SQL
| Command Category | Examples | Usage in Athena/Redshift |
|---|---|---|
| DDL (Definition) | CREATE, ALTER, DROP | Used to manage table structures and Glue Catalog metadata. |
| DML (Manipulation) | INSERT, UPDATE, DELETE | Used to modify data content (Note: Athena is primarily for INSERT/SELECT). |
| Metadata | DESCRIBE, SHOW | Used to inspect column types and existing tables. |
Muddy Points & Cross-Refs
- Athena vs. Redshift Spectrum: Both query S3. Use Athena for serverless, ad-hoc jobs. Use Redshift Spectrum if you already have a Redshift cluster and want to join S3 data with your local data warehouse tables.
- MSCK REPAIR vs. Glue Crawlers:
MSCK REPAIRis a SQL command for Hive-compatible paths. Glue Crawlers are automated AWS services that can infer schemas and partitions without manual SQL commands. - Partition Projection: A more advanced Athena feature (not detailed here) that allows Athena to "calculate" partitions instead of looking them up in the Glue Catalog, further speeding up queries on highly partitioned data.