Mastery Guide: Distributed Compute Strategies and Edge Processing
Distributed compute strategies (for example, edge processing)
Mastery Guide: Distributed Compute Strategies and Edge Processing
This guide explores how to design high-performance, cost-effective architectures using AWS distributed compute strategies. We cover everything from tightly coupled High-Performance Computing (HPC) to low-latency edge processing with Amazon CloudFront.
Learning Objectives
- Distinguish between loosely coupled and tightly coupled distributed systems.
- Evaluate the use of Elastic Fabric Adapter (EFA) for low-latency networking.
- Explain the role of edge locations and CloudFront in reducing latency.
- Determine appropriate compute options (EC2, Lambda, Fargate) for distributed workloads.
- Analyze the impact of cluster placement groups on network performance.
Key Terms & Glossary
- Edge Processing: The practice of performing data processing at the edge of the network, closer to the source of the data or the user, to reduce latency.
- Tightly Coupled: An architecture where instances must work in concert as a single unit, requiring high-speed, low-latency interconnects.
- Loosely Coupled: An architecture where components are independent; if one fails, others continue to work. Often managed via queues (e.g., SQS).
- Elastic Fabric Adapter (EFA): A network interface for Amazon EC2 instances that enables customers to run applications requiring high levels of inter-node communications.
- Cluster Placement Group: A logical grouping of instances within a single Availability Zone that provides low-latency network performance.
- Libfabric: An API that allows HPC applications to bypass the OS kernel and communicate directly with the network hardware (EFA).
The "Big Idea"
Distributed computing is the transition from a single "super-server" to a swarm of coordinated resources. By strategically placing compute power—whether physically close together for massive calculations (HPC) or geographically close to the user (Edge)—architects can overcome the physical limits of light-speed and hardware constraints to deliver seamless global experiences.
Formula / Concept Box
| Feature | Tightly Coupled (HPC) | Loosely Coupled (Distributed) |
|---|---|---|
| Network Need | Ultra-low latency / High throughput | Standard network / Asynchronous |
| AWS Feature | EFA / Cluster Placement Groups | SQS / SNS / Auto Scaling |
| Failure Impact | High (one node can stall the cluster) | Low (isolated failures) |
| Use Case | Weather modeling, CFD, ML training | Web apps, image processing, microservices |
| Interface | Libfabric / MPI | REST API / Message Queues |
Hierarchical Outline
- Distributed Compute Fundamentals
- Loose Coupling: Use of asynchronous messaging; independent scaling of components.
- Tight Coupling: Synchronous inter-dependency; requires Cluster Placement Groups.
- High-Performance Computing (HPC)
- Elastic Fabric Adapter (EFA): Bypasses TCP/IP stack for better throughput.
- Infrastructure Requirements: All instances must be in the same Subnet and Security Group for EFA.
- Edge Processing Strategies
- Amazon CloudFront: Caching static and dynamic content at Edge Locations.
- Lambda@Edge / CloudFront Functions: Executing code closer to the user to modify requests/responses.
- Compute Optimization
- Instance Selection: Matching workload to family (e.g., M5 for general, P2 for GPU/ML).
- Serverless: Using Lambda or Fargate to maximize "server density" and cost-efficiency.
Visual Anchors
Edge Processing Flow
Tight vs. Loose Coupling
\begin{tikzpicture} % Tight Coupling \draw[thick, fill=blue!10] (-1,0) rectangle (1,1) node[midway] {Node A}; \draw[thick, fill=blue!10] (2,0) rectangle (4,1) node[midway] {Node B}; \draw[<->, ultra thick, red] (1,0.5) -- (2,0.5) node[midway, above] {\small EFA / Low Latency}; \node at (1.5, -0.5) {\textbf{Tightly Coupled}};
% Loose Coupling
\draw[thick, fill=green!10] (6,0) rectangle (8,1) node[midway] {Node C};
\draw[thick, fill=orange!10] (9,0.5) circle (0.5) node {SQS};
\draw[thick, fill=green!10] (10.5,0) rectangle (12.5,1) node[midway] {Node D};
\draw[->] (8,0.5) -- (9,0.5);
\draw[->] (9.5,0.5) -- (10.5,0.5);
\node at (9.25, -0.5) {\textbf{Loosely Coupled}};\end{tikzpicture}
Definition-Example Pairs
-
Term: Partitioning (Sharding)
- Definition: Breaking a large database into smaller, faster, more easily managed parts called data shards.
- Example: Splitting a global user database so that users with IDs 1-10000 are on Database A and 10001-20000 are on Database B to prevent performance bottlenecks.
-
Term: Edge Caching
- Definition: Storing copies of data at locations geographically closer to users to reduce travel time for data.
- Example: A user in Tokyo accessing a video file stored in a Virginia S3 bucket; CloudFront caches the file in a Tokyo Edge Location for the next local user.
-
Term: Server Density
- Definition: Maximizing the number of applications or tasks running on a single physical or virtual host.
- Example: Using Docker containers on an Amazon ECS cluster to run 50 microservices on 5 large EC2 instances instead of 50 small instances.
Worked Examples
Example 1: Optimizing for Low-Latency Inter-node Communication
Scenario: A research firm needs to run a fluid dynamics simulation across 20 EC2 instances. Every instance must share its state with all others every few milliseconds.
- Solution:
- Launch instances in a Cluster Placement Group to ensure they are physically close within the data center.
- Use instance types that support Elastic Fabric Adapter (EFA).
- Attach the EFA at launch and ensure the Security Group allows all internal traffic.
- Use the Libfabric API to bypass the OS kernel for data transfer.
Example 2: Global Content Delivery with Edge Logic
Scenario: A streaming service wants to show different advertisements to users based on their country, without adding latency by routing every request to a central server.
- Solution:
- Deploy an Amazon CloudFront distribution.
- Use Lambda@Edge to intercept the request at the edge location.
- The Lambda function checks the user's header for location and dynamically modifies the request to fetch the correct localized advertisement from an S3 bucket.
Checkpoint Questions
- What is the primary difference between an Elastic Network Adapter (ENA) and an Elastic Fabric Adapter (EFA)?
- Why must all instances using an EFA be in the same security group and subnet?
- How does loose coupling improve the fault tolerance of a distributed system?
- Which AWS service would you use to automatically analyze compute resources and identify potential cost savings over a 14-day period?
-
[!IMPORTANT] True or False: Tightly coupled workloads can easily be distributed across multiple AWS Regions for higher availability.
▶Click to see answers
- EFA supports the Libfabric API and OS bypass, significantly reducing latency for HPC applications compared to standard ENA.
- EFA traffic is not routable; it requires the proximity provided by the same subnet and the specific permissions of a unified security group.
- In a loosely coupled system (e.g., using SQS), if one component fails, the messages stay in the queue until another component processes them, preventing the entire system from crashing.
- AWS Compute Optimizer.
- False. Tightly coupled workloads require the ultra-low latency found only within a single Availability Zone (Cluster Placement Group).