Mastering MTU: Troubleshooting Packet Size Mismatches in AWS VPCs
Troubleshooting packet size mismatches in a VPC to restore network connectivity
Mastering MTU: Troubleshooting Packet Size Mismatches in AWS VPCs
Learning Objectives
After completing this guide, you should be able to:
- Identify the symptoms of Path MTU Discovery (PMTUD) failure in a VPC environment.
- Distinguish between standard Ethernet frames (1500 bytes) and Jumbo frames (9001 bytes).
- Configure appropriate MTU settings for EC2 instances, VPNs, and Direct Connect.
- Utilize AWS-native tools like VPC Flow Logs and Traffic Mirroring to diagnose packet-level issues.
Key Terms & Glossary
- MTU (Maximum Transmission Unit): The size of the largest protocol data unit (PDU) that can be communicated in a single network layer transaction.
- Example: Standard Ethernet MTU is 1500 bytes.
- Jumbo Frames: Ethernet frames with more than 1500 bytes of payload, typically 9001 bytes in AWS.
- PMTUD (Path MTU Discovery): A technique for determining the MTU size on the network path between two IP hosts to avoid IP fragmentation.
- MSS (Maximum Segment Size): The largest amount of data, specified in bytes, that a computer or communications device can handle in a single, unfragmented piece.
- ICMP Type 3 Code 4: The "Destination Unreachable; Fragmentation Needed and DF set" message essential for PMTUD to function.
The "Big Idea"
Network connectivity isn't just about "up or down"; it's about the capacity of the pipes. A common "silent killer" of network performance is the MTU mismatch. When a source sends a packet larger than a middle-hop can handle, and that middle-hop cannot signal the source to shrink its packets (often due to over-aggressive firewalls blocking ICMP), the connection simply hangs. This is known as a "Black Hole" connection—small packets (like a TCP handshake) pass through, but large data transfers fail.
Formula / Concept Box
| Connection Type | Maximum MTU | Notes |
|---|---|---|
| Internet Gateway | 1500 bytes | Jumbo frames are NOT supported over the internet. |
| Inter-VPC (Peering) | 9001 bytes | Supported within the same region. |
| AWS VPN | 1436 bytes | Reduced due to IPsec encapsulation overhead. |
| Direct Connect | 1500 or 9190 bytes | 9190 supported for private/transit VIFs. |
| Transit Gateway | 8500 bytes | MTU for traffic between VPCs and TGW. |
Hierarchical Outline
- I. Understanding MTU in AWS
- Standard MTU (1500): Default for most internet-bound traffic.
- Jumbo Frames (9001): Used for high-throughput, low-latency requirements (e.g., HPC, Big Data).
- II. Troubleshooting Path MTU Discovery (PMTUD)
- The Role of ICMP: Why blocking all ICMP breaks network connectivity.
- The Don't Fragment (DF) Bit: How it forces PMTUD.
- III. Diagnostic Tools
- VPC Flow Logs: Checking for packet size and
REJECTstatus. - VPC Traffic Mirroring: Capturing raw packets for Wireshark analysis.
- Reachability Analyzer: Verifying the path and identifying blocking Security Groups.
- VPC Flow Logs: Checking for packet size and
- IV. Resolution Strategies
- MSS Clamping: Adjusting the TCP handshake to prevent large packets.
- Security Group Rules: Allowing ICMP Type 3 Code 4.
Visual Anchors
PMTUD Failure Logic
Packet Encapsulation Overhead
\begin{tikzpicture}[scale=0.8] \draw[fill=blue!10] (0,0) rectangle (8,1) node[pos=.5] {Original IP Packet (e.g., 1436 bytes)}; \draw[fill=red!20] (-2,0) rectangle (0,1) node[pos=.5] {ESP/IPsec}; \draw[fill=green!20] (-4,0) rectangle (-2,1) node[pos=.5] {New IP Header}; \draw[<->, thick] (-4,-0.5) -- (8,-0.5) node[midway, below] {Total Frame Size must fit MTU (1500)}; \node at (2, 2) {\textbf{VPN Encapsulation adds ~64 bytes of overhead}}; \end{tikzpicture}
Definition-Example Pairs
- MSS Clamping: A technique used by routers to alter the Maximum Segment Size in the TCP SYN packet.
- Example: A VPN concentrator reduces the MSS of incoming SYN packets to 1396 bytes so that the resulting 1436-byte IP packet fits perfectly inside a 1500-byte MTU after adding encryption headers.
- Black Hole Router: A router that drops packets exceeding the MTU without sending an ICMP response.
- Example: A Security Group that blocks all inbound ICMP traffic effectively makes the associated EC2 instance or gateway a Black Hole router for PMTUD.
Worked Examples
Scenario: The "Hanging" SSH Connection
Issue: A user can connect to an EC2 instance via SSH, but when they run a command that produces a lot of output (like cat large_file.txt), the session freezes.
Step-by-Step Diagnosis:
- Test Ping with Size: Run
ping -s 1472 -M do <IP>. (1472 bytes + 28 bytes header = 1500). - Observe Result: If the ping fails but a standard
ping <IP>works, an MTU bottleneck exists. - Check Flow Logs: Analyze VPC Flow Logs for the
pkt-src-sizeandpkt-dst-sizefields. Look for truncated packets. - Analyze Security Groups: Ensure the inbound rules allow ICMP Type 3, Code 4 from
0.0.0.0/0. - Solution: Enable Jumbo frames only within the VPC; ensure internet-bound traffic is capped at 1500 bytes via MSS clamping on the VPN/Router.
Checkpoint Questions
- What is the default MTU for an AWS Site-to-Site VPN connection?
- Why does blocking all ICMP traffic cause "Black Hole" connectivity issues?
- Which AWS tool allows you to see the actual bytes of a packet to verify if headers are being stripped?
- Can you use Jumbo Frames (9001 bytes) to send data from a VPC to an on-premises server over the public internet?
Muddy Points & Cross-Refs
- MSS vs. MTU: Many students confuse the two. Remember: MTU is the limit for the IP Layer (L3), while MSS is the limit for the TCP Data (L4).
- Where to change MTU: You can change MTU on the OS level (e.g.,
ifconfig eth0 mtu 1500), but it must match the path's capabilities. - Cross-Ref: See Unit 4: Network Security for details on how NACLs (stateless) can inadvertently block ICMP responses needed for PMTUD.
Comparison Tables
MTU Support Comparison
| Feature | Standard Frame | Jumbo Frame |
|---|---|---|
| Byte Size | 1500 | 9001 |
| Efficiency | Lower (Higher overhead %) | Higher (More data per header) |
| Use Case | Internet, VPNs, General | Database clusters, HPC, Storage |
| Fragmentation | Less likely | Common if misconfigured |
| Compatibility | Universal | Within VPC / Direct Connect only |