AWS Data Processing: EMR, Redshift, and Glue

This guide covers the core AWS services used to transform raw data into optimized formats for analytics, focusing on the selection, configuration, and orchestration of processing workloads.

Learning Objectives

After studying this guide, you should be able to:

Select the appropriate transformation service (Glue, EMR, Redshift, or Lambda) based on workload requirements.
Implement data format conversions, specifically optimizing storage by converting CSV to Apache Parquet.
Design serverless and provisioned workflows using orchestration tools like Step Functions and MWAA.
Apply visual data preparation techniques for non-technical users using AWS Glue DataBrew.
Optimize processing costs and performance through partitioning and indexing strategies.

Key Terms & Glossary

ETL (Extract, Transform, Load): The process of gathering data from sources, changing its format/structure, and loading it into a target system.
Parquet: A columnar storage file format that provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
DPU (Data Processing Unit): A relative measure of processing power for AWS Glue jobs.
RPU (Redshift Processing Unit): The unit used to measure compute capacity in Amazon Redshift Serverless.
Stateful Transformation: A transformation that requires knowledge of previous events (e.g., windowing or running totals).
Stateless Transformation: A transformation where each record is processed independently (e.g., masking or filtering).

The "Big Idea"

In the AWS ecosystem, data processing is not a "one size fits all" task. The goal is to move data from a high-entropy, raw state (S3 Data Lake) to a low-entropy, structured state (Redshift/Lake Formation) using the tool that best balances operational overhead, cost, and technical expertise. Glue is for serverless ETL; EMR is for massive, customizable big data clusters; and Redshift is for SQL-heavy analytical processing.

Formula / Concept Box

Service	Scaling Unit	Management Type	Best For
AWS Glue	DPUs	Serverless	Python/Spark ETL, Auto-scaling workloads
Amazon EMR	Instance Nodes	Provisioned / Serverless	Open-source frameworks (Spark, Flink, Hive)
AWS Lambda	Memory (MB)	Serverless	Event-driven, short (<15 min) transformations
Amazon Redshift	RPUs / Nodes	Serverless / Provisioned	SQL-based transformations, Data Warehousing

Hierarchical Outline

I. Serverless Data Transformation (AWS Glue)
- Glue Jobs: Spark-based (Batch/Streaming) or Python Shell (Lightweight).
- Data Catalog: Centralized metadata repository populated by Crawlers.
- Glue DataBrew: 250+ pre-built transformations for non-technical users.
II. Large-Scale Distributed Processing (Amazon EMR)
- Frameworks: Apache Spark, Hive, Flink, and Presto.
- Cluster Types: Provisioned (EC2-based) vs. EMR Serverless.
- Customization: Deep control over Hadoop ecosystem configurations.
III. SQL-Based Transformation (Amazon Redshift)
- Redshift Spectrum: Query data directly in S3 using SQL.
- Federated Query: Access data in RDS/Aurora without moving it.
- Load/Unload: Moving data between S3 and Redshift tables efficiently.
IV. Pipeline Orchestration
- AWS Step Functions: State machine for resilient, multi-step tasks.
- Amazon MWAA: Managed Apache Airflow for complex programmatic workflows.
- EventBridge: Triggering transformations based on schedules or S3 events.

Visual Anchors

Data Processing Flow

Loading Diagram...

Provisioned vs. Serverless Architecture

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

Partition Projection: A Glue feature that calculates partition values from file paths rather than querying the Catalog.
- Example: A dataset organized by /year=2023/month=10/day=27/ can be queried faster because Glue "calculates" where the data is.
SCT (Schema Conversion Tool): A tool to convert database schemas from one engine to another.
- Example: Converting a legacy Oracle database schema into a format compatible with Amazon Redshift.
Dead Letter Queue (DLQ): A queue to store messages that cannot be processed.
- Example: An SQS queue capturing failed Lambda transformation events for manual debugging.

Worked Examples

Converting CSV to Parquet using AWS Glue

Scenario: You have 1TB of raw CSV logs in S3 and need to convert them to Parquet to save costs on Athena queries.

Crawl: Run a Glue Crawler on the S3 bucket to populate the Glue Data Catalog with the schema.
Author: Create a Glue Spark Job. Use the visual editor or script:
python
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "logs", table_name = "raw_csv") datasink4 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path": "s3://optimized-bucket/parquet/"}, format = "parquet")
Optimize: Add partitionKeys (e.g., ['year', 'month']) to the job options to ensure the output is structured for fast querying.

Checkpoint Questions

When should you choose AWS Glue DataBrew over standard AWS Glue ETL?
What is the primary advantage of using Amazon Redshift Spectrum for processing S3 data?
Which orchestration service is best for a team that prefers Python-coded workflows with complex dependencies?
Why is converting data from CSV to Parquet considered a cost-optimization strategy?

Comparison Tables

Orchestration: Step Functions vs. MWAA

Feature	AWS Step Functions	Amazon MWAA (Airflow)
Model	State Machine (JSON-based)	Programmatic (Python)
Scaling	Highly elastic, serverless	Scalable workers, but managed instances
Use Case	Microservices, AWS service integration	Complex data engineering pipelines
Timeout	Up to 1 year	Configurable per task

Muddy Points & Cross-Refs

Athena vs. Redshift Spectrum: Both query S3 via SQL. Use Athena for ad-hoc, serverless discovery. Use Spectrum when you already have a Redshift cluster and want to join S3 data with local warehouse tables.
Glue vs. EMR: Use Glue for standard Spark ETL where you don't want to manage servers. Use EMR if you need specific versions of open-source software (e.g., a specific HIVE version) or massive scale where custom EC2 instance types (like high-memory instances) are needed.