AWS Data Processing: EMR, Redshift, and Glue
Use the features of AWS services to process data (for example, Amazon EMR, Amazon Redshift, AWS Glue)
AWS Data Processing: EMR, Redshift, and Glue
This guide covers the core AWS services used to transform raw data into optimized formats for analytics, focusing on the selection, configuration, and orchestration of processing workloads.
Learning Objectives
After studying this guide, you should be able to:
- Select the appropriate transformation service (Glue, EMR, Redshift, or Lambda) based on workload requirements.
- Implement data format conversions, specifically optimizing storage by converting CSV to Apache Parquet.
- Design serverless and provisioned workflows using orchestration tools like Step Functions and MWAA.
- Apply visual data preparation techniques for non-technical users using AWS Glue DataBrew.
- Optimize processing costs and performance through partitioning and indexing strategies.
Key Terms & Glossary
- ETL (Extract, Transform, Load): The process of gathering data from sources, changing its format/structure, and loading it into a target system.
- Parquet: A columnar storage file format that provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
- DPU (Data Processing Unit): A relative measure of processing power for AWS Glue jobs.
- RPU (Redshift Processing Unit): The unit used to measure compute capacity in Amazon Redshift Serverless.
- Stateful Transformation: A transformation that requires knowledge of previous events (e.g., windowing or running totals).
- Stateless Transformation: A transformation where each record is processed independently (e.g., masking or filtering).
The "Big Idea"
In the AWS ecosystem, data processing is not a "one size fits all" task. The goal is to move data from a high-entropy, raw state (S3 Data Lake) to a low-entropy, structured state (Redshift/Lake Formation) using the tool that best balances operational overhead, cost, and technical expertise. Glue is for serverless ETL; EMR is for massive, customizable big data clusters; and Redshift is for SQL-heavy analytical processing.
Formula / Concept Box
| Service | Scaling Unit | Management Type | Best For |
|---|---|---|---|
| AWS Glue | DPUs | Serverless | Python/Spark ETL, Auto-scaling workloads |
| Amazon EMR | Instance Nodes | Provisioned / Serverless | Open-source frameworks (Spark, Flink, Hive) |
| AWS Lambda | Memory (MB) | Serverless | Event-driven, short (<15 min) transformations |
| Amazon Redshift | RPUs / Nodes | Serverless / Provisioned | SQL-based transformations, Data Warehousing |
Hierarchical Outline
- I. Serverless Data Transformation (AWS Glue)
- Glue Jobs: Spark-based (Batch/Streaming) or Python Shell (Lightweight).
- Data Catalog: Centralized metadata repository populated by Crawlers.
- Glue DataBrew: 250+ pre-built transformations for non-technical users.
- II. Large-Scale Distributed Processing (Amazon EMR)
- Frameworks: Apache Spark, Hive, Flink, and Presto.
- Cluster Types: Provisioned (EC2-based) vs. EMR Serverless.
- Customization: Deep control over Hadoop ecosystem configurations.
- III. SQL-Based Transformation (Amazon Redshift)
- Redshift Spectrum: Query data directly in S3 using SQL.
- Federated Query: Access data in RDS/Aurora without moving it.
- Load/Unload: Moving data between S3 and Redshift tables efficiently.
- IV. Pipeline Orchestration
- AWS Step Functions: State machine for resilient, multi-step tasks.
- Amazon MWAA: Managed Apache Airflow for complex programmatic workflows.
- EventBridge: Triggering transformations based on schedules or S3 events.
Visual Anchors
Data Processing Flow
Provisioned vs. Serverless Architecture
\begin{tikzpicture}[scale=0.8] % Serverless Side \draw[fill=blue!10, rounded corners] (0,0) rectangle (4,3); \node at (2,2.5) {\textbf{Serverless}}; \node[draw, fill=white] (G) at (2,1.5) {AWS Glue}; \node[draw, fill=white] (L) at (2,0.5) {Lambda};
% Provisioned Side \draw[fill=orange!10, rounded corners] (6,0) rectangle (10,3); \node at (8,2.5) {\textbf{Provisioned}}; \node[draw, fill=white] (E) at (8,1.5) {EMR Cluster}; \node[draw, fill=white] (R) at (8,0.5) {Redshift Nodes};
% Center Axis \draw[thick, <->] (4.5,0) -- (5.5,0) node[midway, below] {\small Control vs. Ease}; \end{tikzpicture}
Definition-Example Pairs
- Partition Projection: A Glue feature that calculates partition values from file paths rather than querying the Catalog.
- Example: A dataset organized by
/year=2023/month=10/day=27/can be queried faster because Glue "calculates" where the data is.
- Example: A dataset organized by
- SCT (Schema Conversion Tool): A tool to convert database schemas from one engine to another.
- Example: Converting a legacy Oracle database schema into a format compatible with Amazon Redshift.
- Dead Letter Queue (DLQ): A queue to store messages that cannot be processed.
- Example: An SQS queue capturing failed Lambda transformation events for manual debugging.
Worked Examples
Converting CSV to Parquet using AWS Glue
Scenario: You have 1TB of raw CSV logs in S3 and need to convert them to Parquet to save costs on Athena queries.
- Crawl: Run a Glue Crawler on the S3 bucket to populate the Glue Data Catalog with the schema.
- Author: Create a Glue Spark Job. Use the visual editor or script:
python
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "logs", table_name = "raw_csv") datasink4 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path": "s3://optimized-bucket/parquet/"}, format = "parquet") - Optimize: Add
partitionKeys(e.g.,['year', 'month']) to the job options to ensure the output is structured for fast querying.
Checkpoint Questions
- When should you choose AWS Glue DataBrew over standard AWS Glue ETL?
- What is the primary advantage of using Amazon Redshift Spectrum for processing S3 data?
- Which orchestration service is best for a team that prefers Python-coded workflows with complex dependencies?
- Why is converting data from CSV to Parquet considered a cost-optimization strategy?
Comparison Tables
Orchestration: Step Functions vs. MWAA
| Feature | AWS Step Functions | Amazon MWAA (Airflow) |
|---|---|---|
| Model | State Machine (JSON-based) | Programmatic (Python) |
| Scaling | Highly elastic, serverless | Scalable workers, but managed instances |
| Use Case | Microservices, AWS service integration | Complex data engineering pipelines |
| Timeout | Up to 1 year | Configurable per task |
Muddy Points & Cross-Refs
- Athena vs. Redshift Spectrum: Both query S3 via SQL. Use Athena for ad-hoc, serverless discovery. Use Spectrum when you already have a Redshift cluster and want to join S3 data with local warehouse tables.
- Glue vs. EMR: Use Glue for standard Spark ETL where you don't want to manage servers. Use EMR if you need specific versions of open-source software (e.g., a specific HIVE version) or massive scale where custom EC2 instance types (like high-memory instances) are needed.