Study Guide925 words

SQL Querying and Data Transformation: Amazon Redshift & Athena

Use SQL in Amazon Redshift and Athena to query data or to create views

SQL Querying and Data Transformation: Amazon Redshift & Athena

This guide covers the core competencies required to query data and create views using SQL in AWS's primary analytical engines: Amazon Redshift (data warehouse) and Amazon Athena (serverless query engine).


Learning Objectives

After studying this guide, you should be able to:

  • Differentiate between Amazon Redshift and Amazon Athena for specific analytical use cases.
  • Implement Materialized Views in Redshift to optimize dashboard performance.
  • Use CTAS (Create Table As Select) in Athena to transform data into optimized formats like Parquet.
  • Manage metadata and partitions using DDL and MSCK REPAIR TABLE.
  • Apply cost-optimization strategies such as columnar data selection and query result reuse.

Key Terms & Glossary

  • CTAS (Create Table As Select): An Athena operation that creates a new table in the Glue Data Catalog from the results of a SELECT statement, often used for ETL.
  • MPP (Massively Parallel Processing): The architecture used by Redshift to distribute query execution across multiple compute nodes.
  • MSCK REPAIR TABLE: A command used in Athena to synchronize partition metadata in the Glue Data Catalog with files on S3.
  • DPU (Data Processing Unit): The unit of compute for Athena Provisioned Capacity (1 DPU = 4 vCPUs, 16 GB memory).
  • Materialized View (MV): A database object that contains the precomputed results of a query, used to improve performance for complex joins and aggregations.

The "Big Idea"

While both services use standard SQL, they serve different ends of the "Data Spectrum." Amazon Athena is the go-to for "Data Lake" analysis—it is serverless, queries data directly on S3, and is ideal for ad-hoc exploration. Amazon Redshift is the "Enterprise Data Warehouse"—it is optimized for complex, repeatable, and high-performance analytical reporting at petabyte scale using provisioned resources.

Formula / Concept Box

FeatureAmazon AthenaAmazon Redshift
ModelServerless (Pay-per-scan)Provisioned or Serverless (MPP)
Data SourceS3 (via Glue Data Catalog)Local Managed Storage / S3 (via Spectrum)
Best ForAd-hoc, log analysis, quick ETLComplex BI, high-concurrency dashboards
OptimizationParquet/ORC, PartitioningSort Keys, Dist Keys, Materialized Views

Hierarchical Outline

  • I. Amazon Redshift: High-Performance SQL
    • Architecture: Massively Parallel Processing (MPP) for scale.
    • Querying: Redshift Query Editor v2 (web-based IDE).
    • Materialized Views:
      • Caching: Stores precomputed snapshots for dashboards.
      • Incremental Refresh: Updates only changed data (for RMS or external S3 tables).
      • Auto-Rewriting: Optimizer automatically uses MVs even if the query calls the base table.
  • II. Amazon Athena: Serverless S3 Querying
    • SQL Dialect: ANSI SQL based on Presto/Trino.
    • Data Definition (DDL):
      • CREATE/ALTER/DROP for Glue Data Catalog objects.
      • DESCRIBE to view column metadata.
    • Data Transformation (ETL):
      • CTAS: Converts data formats (e.g., CSV to Parquet) while creating a new table.
      • INSERT INTO: Appends new data to existing tables.
  • III. Performance & Cost Optimization
    • Athena Query Result Reuse: Reduces cost by returning cached results if query/data are identical.
    • Columnar Format: Using Parquet/ORC reduces scanned bytes (and therefore cost).
    • Athena Workgroups: Used to isolate teams and set per-query data limits.

Visual Anchors

Athena CTAS Workflow

Loading Diagram...

Redshift MPP Architecture

\begin{tikzpicture}[node distance=1.5cm, every node/.style={draw, rectangle, fill=blue!10, text centered, rounded corners}] \node (leader) [fill=green!20] {\textbf{Leader Node} \ (Query Parsing / Execution Plan)}; \node (node1) [below left of=leader, xshift=-1cm] {Compute Node 1}; \node (node2) [below of=leader] {Compute Node 2}; \node (node3) [below right of=leader, xshift=1cm] {Compute Node 3};

code
\draw[<->, thick] (leader) -- (node1); \draw[<->, thick] (leader) -- (node2); \draw[<->, thick] (leader) -- (node3); \node[draw=none, fill=none, below of=node2, yshift=0.5cm] {\textit{Parallel Execution of Slices}};

\end{tikzpicture}

Definition-Example Pairs

  • Materialized View: A stored query result used to speed up slow dashboards.
    • Example: Creating a daily_sales_summary MV that aggregates millions of sales_transactions so the dashboard loads in 2 seconds instead of 2 minutes.
  • Partition Syncing: Updating the data catalog to recognize new folders in S3.
    • Example: After daily logs are uploaded to s3://logs/year=2023/month=10/day=27/, running MSCK REPAIR TABLE logs ensures Athena can "see" the Oct 27th data.
  • Columnar Selection: Requesting specific attributes rather than the whole record.
    • Example: Using SELECT user_id, email instead of SELECT * in Athena to save money by not scanning 50 other unused columns.

Worked Examples

Example 1: Redshift Materialized View

To create a dashboard cache for regional sales:

sql
CREATE MATERIALIZED VIEW mv_regional_sales AS SELECT region, SUM(sale_amount) as total_revenue FROM sales_data GROUP BY region; -- To refresh the data manually REFRESH MATERIALIZED VIEW mv_regional_sales;

Example 2: Athena CTAS (CSV to Parquet)

To optimize a raw CSV dataset for performance and cost:

sql
CREATE TABLE optimized_inventory WITH ( format = 'PARQUET', external_location = 's3://my-bucket/optimized-data/' ) AS SELECT item_id, quantity, last_updated FROM raw_csv_inventory WHERE status = 'active';

Checkpoint Questions

  1. Which Redshift feature allows the query optimizer to use a Materialized View even if the user didn't explicitly name it in their SQL?
  2. Why is SELECT * considered an anti-pattern in Amazon Athena?
  3. What command must you run if you manually upload new folders (partitions) to S3 and want Athena to see them?
  4. What are the two billing models available for Amazon Athena?
Click to view answers
  1. Automatic Query Rewriting.
  2. Athena charges based on the amount of data scanned; SELECT * scans every column in the row, maximizing cost.
  3. `MSCK REPAIR TABLE

`. 4. Per-Query Billing (bytes scanned) and Provisioned Capacity (DPUs).

Comparison Tables

DDL vs. DML in Cloud SQL

Command CategoryExamplesUsage in Athena/Redshift
DDL (Definition)CREATE, ALTER, DROPUsed to manage table structures and Glue Catalog metadata.
DML (Manipulation)INSERT, UPDATE, DELETEUsed to modify data content (Note: Athena is primarily for INSERT/SELECT).
MetadataDESCRIBE, SHOWUsed to inspect column types and existing tables.

Muddy Points & Cross-Refs

  • Athena vs. Redshift Spectrum: Both query S3. Use Athena for serverless, ad-hoc jobs. Use Redshift Spectrum if you already have a Redshift cluster and want to join S3 data with your local data warehouse tables.
  • MSCK REPAIR vs. Glue Crawlers: MSCK REPAIR is a SQL command for Hive-compatible paths. Glue Crawlers are automated AWS services that can infer schemas and partitions without manual SQL commands.
  • Partition Projection: A more advanced Athena feature (not detailed here) that allows Athena to "calculate" partitions instead of looking them up in the Glue Catalog, further speeding up queries on highly partitioned data.

Ready to study AWS Certified Data Engineer - Associate (DEA-C01)?

Practice tests, flashcards, and all study notes — free, no sign-up needed.

Start Studying — Free