Selecting Compute Options for Data Processing: Amazon EMR, AWS Glue, and AWS Batch
Selecting appropriate compute options for data processing (for example, Amazon EMR)
Selecting Compute Options for Data Processing
This guide focuses on Domain 4 of the SAA-C03 exam: Designing Cost-Optimized and High-Performing architectures for data-intensive workloads. Selecting the right compute for data processing requires balancing management overhead, cost, and the specific nature of the data (e.g., streaming vs. batch, structured vs. unstructured).
Learning Objectives
By the end of this module, you should be able to:
- Distinguish between Amazon EMR, AWS Glue, and AWS Batch for different data processing workloads.
- Select appropriate compute resources based on scaling requirements (horizontal vs. vertical).
- Optimize cost using Spot Instances and Savings Plans for long-running or interruptible data jobs.
- Identify the best tool for ETL (Extract, Transform, Load) versus complex big-data analytics.
Key Terms & Glossary
- Amazon EMR (Elastic MapReduce): A managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data.
- AWS Glue: A fully managed, serverless ETL service that makes it simple and cost-effective to categorize, clean, enrich, and move data between various data stores.
- AWS Batch: A regional service that simplifies running batch computing workloads on AWS by automatically provisioning compute resources based on the volume and specific requirements of the jobs.
- ETL (Extract, Transform, Load): A three-phase process where data is extracted from a source, transformed into a different format/structure, and loaded into a destination (e.g., a data warehouse).
- Spot Instances: Spare EC2 capacity available at a significant discount (up to 90%), ideal for stateless, fault-tolerant data processing jobs.
The "Big Idea"
The central challenge in cloud architecture is not just "making it work," but choosing the right level of abstraction. For data processing, AWS provides a spectrum: from low-level control with Amazon EC2, to managed cluster frameworks like Amazon EMR, to fully serverless options like AWS Glue. The goal is to minimize management overhead ("undifferentiated heavy lifting") while maximizing performance per dollar spent.
Formula / Concept Box
Compute Selection Matrix
| Feature | Amazon EMR | AWS Glue | AWS Batch | Amazon Athena |
|---|---|---|---|---|
| Model | Managed Cluster (EC2) | Serverless | Managed Container/EC2 | Serverless Query |
| Primary Use | Complex Big Data/Spark | ETL / Data Cataloging | Parallel Batch Jobs | SQL on S3 |
| Cost Driver | Instance Hours + EMR Fee | DPUs (Data Processing Units) | Underlying EC2/Fargate | TB Scanned |
| Scaling | Manual/Auto-scaling nodes | Automatic | Automatic (Jobs in Queue) | Automatic |
Hierarchical Outline
- Distributed Computing on AWS
- Horizontal Scaling: Adding more nodes to a cluster (Standard for EMR).
- Vertical Scaling: Increasing the RAM/CPU of existing nodes.
- Amazon EMR (Elastic MapReduce)
- Architecture: Master Node (coordinates), Core Nodes (store data + compute), Task Nodes (compute only).
- Use Cases: Petabyte-scale analysis, machine learning using Apache Spark, and interactive querying with Presto.
- Optimization: Use Spot Instances for Task Nodes to reduce costs.
- AWS Glue (Serverless ETL)
- Glue Data Catalog: A central repository to store structural and operational metadata.
- Crawlers: Automatically scan data in S3 to infer schemas.
- Job Types: Python shell or Apache Spark (Serverless).
- AWS Batch (High Performance Computing)
- Components: Job Definitions, Job Queues, and Compute Environments.
- Docker Integration: Best for workloads packaged as containers that need to run in parallel.
Visual Anchors
Decision Flow for Compute Selection
EMR Cluster Architecture
\begin{tikzpicture}[node distance=2cm, every node/.style={rectangle, draw, fill=blue!10, text width=2.5cm, align=center, minimum height=1cm}]
% Nodes \node (master) {\textbf{Master Node} \ (Coordination)}; \node (core1) [below left of=master, xshift=-1cm] {\textbf{Core Node} \ (HDFS + Compute)}; \node (core2) [below right of=master, xshift=1cm] {\textbf{Core Node} \ (HDFS + Compute)}; \node (task1) [below of=core1] {\textbf{Task Node} \ (Compute Only)}; \node (task2) [below of=core2] {\textbf{Task Node} \ (Compute Only)}; \node (s3) [below of=master, yshift=-3cm, fill=green!10] {\textbf{Amazon S3} \ (Persistent Data)};
% Arrows \draw[->, thick] (master) -- (core1); \draw[->, thick] (master) -- (core2); \draw[->, thick] (core1) -- (task1); \draw[->, thick] (core2) -- (task2); \draw[<->, dashed] (core1) -- (s3); \draw[<->, dashed] (core2) -- (s3); \draw[<->, dashed] (task1) -- (s3); \draw[<->, dashed] (task2) -- (s3);
\end{tikzpicture}
Definition-Example Pairs
-
Term: Data Transformation
- Definition: The process of converting data from its source format into a format required by the destination system.
- Example: Using AWS Glue to convert raw
.csvlogs stored in S3 into.parquetformat (columnar storage) to make queries in Amazon Athena faster and cheaper.
-
Term: Spot Instance Diversification
- Definition: Using multiple instance types in a Spot Fleet to minimize the impact of AWS reclaiming a specific instance type.
- Example: Configuring an Amazon EMR cluster to use both
m5.xlargeandr5.xlargeinstances for its task nodes so that if one type is unavailable, the other can continue the job.
Worked Examples
Scenario 1: The Cost-Conscious Startup
Problem: A company needs to process 50 TB of genomic data once a week. The job takes 8 hours. They have a strict budget and the job can be restarted if interrupted.
- Solution: Use Amazon EMR with Spot Instances for Task nodes.
- Why: Spot instances provide the necessary scale at the lowest cost. Since the job is fault-tolerant (restartable), the risk of Spot interruption is acceptable.
Scenario 2: Modernizing the ETL Pipeline
Problem: A data engineer needs to extract data from an RDS PostgreSQL database daily, mask PII (Personally Identifiable Information), and load it into Redshift. They want zero server management.
- Solution: Use AWS Glue.
- Why: Glue is serverless (no EC2 instances to manage) and provides built-in transforms for PII masking and schema mapping.
Checkpoint Questions
- Which node type in an Amazon EMR cluster is responsible for storing data in HDFS?
- You have a Docker-based image processing application that needs to scale to thousands of parallel jobs. Which service is the best fit?
- True or False: Amazon Athena is the best choice for complex ETL transformations that require multi-stage data cleaning.
- How can you reduce costs for an EMR cluster that runs non-critical, long-term analytics?
- What is the purpose of a Glue Crawler?
▶Click to see Answers
- Core Nodes. (Master nodes coordinate; Task nodes only process).
- AWS Batch. It is specifically designed for containerized batch workloads.
- False. Athena is a query engine. AWS Glue is designed for ETL transformations.
- Use Spot Instances for task nodes and Instance Fleets to diversify instance types.
- To infer the schema of raw data in S3 and populate the Glue Data Catalog.