Fundamentals of Distributed Computing for Data Engineering
Define distributed computing
Fundamentals of Distributed Computing for Data Engineering
Welcome to this comprehensive study guide on distributed computing, a core pillar of the AWS Certified Data Engineer – Associate curriculum. This guide explores how processing transitions from single machines to massive clusters, enabling the analysis of big data at scale.
Learning Objectives
After studying this guide, you should be able to:
- Define distributed computing and distinguish it from centralized processing.
- Identify the components of the Hadoop ecosystem, specifically HDFS and MapReduce.
- Compare distributed processing frameworks including Apache Spark and Apache Flink.
- Explain data consistency models (Strong, Eventual, Causal) and their trade-offs.
- Differentiate between micro-batching and continuous streaming architectures.
Key Terms & Glossary
- Distributed Computing: A model where components of a software system are shared among multiple computers (nodes) to improve efficiency and performance.
- Horizontal Scaling: Increasing capacity by adding more machines (nodes) to a cluster, rather than increasing the power of a single machine (Vertical Scaling).
- Fault Tolerance: The ability of a system to continue operating properly in the event of the failure of one or more of its components.
- HDFS (Hadoop Distributed File System): A distributed file system that provides high-throughput access to application data across a cluster.
- Micro-batching: A technique where data streams are processed in small, discrete chunks (batches) rather than one by one.
- Exactly-Once Processing: A guarantee that each incoming event is reflected exactly once in the final state, even if failures occur.
The "Big Idea"
In the era of Big Data, the volume, velocity, and variety of information exceed the physical limits of any single server. The Big Idea behind distributed computing is the "Divide and Conquer" strategy. By breaking a massive dataset into chunks and distributing them across a cluster of commodity hardware, we can process petabytes of data in parallel. This shifts the bottleneck from processor speed to network bandwidth and coordination logic, necessitating specialized frameworks like Spark and Flink to manage the complexity of node failures and data synchronization.
Formula / Concept Box
| Concept | Description | Metric of Concern |
|---|---|---|
| Throughput | The total volume of data processed in a given time period. | Records per second (High is better) |
| Latency | The time taken for a single piece of data to travel through the system. | Milliseconds/Seconds (Low is better) |
| Availability | The percentage of time the system is operational. | "Nines" (e.g., 99.99%) |
| Scalability | The ability to handle growing amounts of work by adding resources. | Cost per TB processed |
Hierarchical Outline
- Distributed Processing Fundamentals
- Cluster Architecture: Master nodes (coordination) vs. Worker nodes (execution).
- Resource Management: Allocating CPU/Memory across the cluster.
- The Hadoop Ecosystem
- Storage: HDFS (replicates data for fault tolerance).
- Processing: MapReduce (the original distributed paradigm).
- Modern Frameworks
- Apache Spark: In-memory processing, micro-batching for streams.
- Apache Flink: Native continuous streaming, stateful computations.
- Data Consistency Models
- Strong: Reads always see the latest write (e.g., Amazon S3).
- Eventual: Reads may be stale but will eventually catch up.
- Causal: Guarantees order for related events.
Visual Anchors
Cluster Architecture
The MapReduce Workflow
\begin{tikzpicture} \node[draw, rectangle, fill=blue!10] (Input) at (0,0) {Input Data (HDFS)}; \node[draw, circle, fill=green!10] (Map) at (3,1) {Map}; \node[draw, circle, fill=green!10] (Map2) at (3,-1) {Map}; \node[draw, rectangle, fill=orange!10] (Shuffle) at (6,0) {Shuffle & Sort}; \node[draw, circle, fill=red!10] (Reduce) at (9,0) {Reduce}; \node[draw, rectangle, fill=blue!10] (Output) at (12,0) {Final Output}; \draw[->] (Input) -- (Map); \draw[->] (Input) -- (Map2); \draw[->] (Map) -- (Shuffle); \draw[->] (Map2) -- (Shuffle); \draw[->] (Shuffle) -- (Reduce); \draw[->] (Reduce) -- (Output); \end{tikzpicture}
Definition-Example Pairs
- Term: Map Process
- Definition: A function that transforms input data into intermediate key-value pairs.
- Example: In a word count application, the Map process takes a line of text and outputs
("apple", 1)for every instance of the word "apple."
- Term: Reduce Process
- Definition: A function that aggregates intermediate values associated with the same key.
- Example: After mapping, the Reduce process takes all
("apple", 1)entries and sums them to output("apple", 42).
- Term: Strong Consistency
- Definition: A consistency model where any read request following a successful write returns the updated value.
- Example: Amazon S3 provides strong read-after-write consistency for PUTs and DELETEs. If you overwrite an image, the very next GET request will return the new version.
Worked Examples
Example 1: Word Count Logic in MapReduce
Scenario: You have a 1 TB file of server logs. You need to count the occurrences of error codes.
- Splitting: The framework splits the 1 TB file into 128 MB chunks across the cluster.
- Mapping: Each node reads its chunk and generates
(ErrorCode, 1)pairs. - Combiner (Optional): A node with five
(404, 1)entries localizes them into one(404, 5)entry to save network bandwidth. - Shuffle: All entries for code
404are sent to the same Reducer node. - Reduce: The Reducer sums all
404counts and writes the total to HDFS.
Example 2: Bounded vs. Unbounded Data
- Bounded: Processing a CSV file containing all sales from January 1st. This is Batch Processing.
- Unbounded: Processing a live stream of GPS coordinates from 10,000 delivery trucks. This is Stream Processing.
Comparison Tables
Spark vs. Flink
| Feature | Apache Spark | Apache Flink |
|---|---|---|
| Core Philosophy | Batch-first (Streams are micro-batches) | Streaming-first (Batch is a special case) |
| Latency | Seconds (higher) | Milliseconds (lower) |
| Delivery Model | At-least-once (default) | Exactly-once (via checkpoints) |
| Best For | ETL, Machine Learning, Interactive Query | Real-time fraud detection, Complex Event Processing |
Consistency Models Comparison
| Model | Latency | Consistency Level | Use Case |
|---|---|---|---|
| Strong | Higher | Highest | Financial transactions, Metadata updates |
| Eventual | Lowest | Lower | Social media "likes," DNS records |
| Causal | Medium | Medium | Comment threads (replies must follow posts) |
Checkpoint Questions
- What is the main difference between horizontal and vertical scaling?
- In the MapReduce framework, what is the role of the "Combiner"?
- Why does Apache Spark generally have higher latency than Apache Flink for stream processing?
- If Amazon S3 is strongly consistent, what happens if two processes perform a PUT on the same key simultaneously?
- Which distributed storage layer is commonly used in Hadoop clusters?
[!TIP] Answers: 1. Horizontal adds nodes; Vertical adds power to one node. 2. Acts as a mini-reducer on a single node to save bandwidth. 3. Spark uses micro-batching; Flink uses continuous processing. 4. The latest timestamp wins. 5. HDFS (Hadoop Distributed File System).
Muddy Points & Cross-Refs
- Micro-batching vs. Continuous: Learners often confuse Spark Streaming with "real-time." Remember that Spark's RDD architecture requires a small "collection time" (the batch), while Flink processes events immediately as they arrive.
- The CAP Theorem: While not explicitly mentioned in the snippet, distributed computing is governed by the CAP theorem (Consistency, Availability, Partition Tolerance). You usually sacrifice one for the others.
- Cross-Reference: For more on how these frameworks are managed on AWS, see the Amazon EMR (Elastic MapReduce) and Amazon Managed Service for Apache Flink sections in Unit 3.