AWS Network Performance and Reachability Assessment Guide
Appropriate logs and metrics to assess network performance and reachability issues (for example, packet loss)
AWS Network Performance and Reachability Assessment Guide
This guide focuses on the tools, logs, and metrics used to monitor, analyze, and optimize network traffic within AWS. Understanding the nuances between metadata (logs) and raw data (packets) is critical for passing the AWS Certified Advanced Networking Specialty (ANS-C01) exam.
Learning Objectives
After studying this guide, you should be able to:
- Identify appropriate metrics for measuring latency, jitter, and packet loss.
- Differentiate between VPC Flow Logs, CloudWatch Metrics, and VPC Traffic Mirroring use cases.
- Utilize Reachability Analyzer to troubleshoot configuration-based connectivity issues.
- Analyze Transit Gateway Network Manager and Network Performance Monitor (NPM) data for cross-account/region visibility.
Key Terms & Glossary
- Latency: The time delay (usually in milliseconds) for a data packet to travel from source to destination.
- Jitter: The variation or "noise" in the delay of received packets. High jitter can degrade real-time traffic like VoIP.
- Packet Loss: A condition where one or more packets of data traveling across a computer network fail to reach their destination.
- Throughput: The actual amount of data successfully transmitted over a network per unit of time ($bits/sec).
- Promiscuous Mode: A configuration of a network interface that allows it to receive all traffic on a network segment, required for packet capture tools like Wireshark.
The "Big Idea"
Network performance in the cloud is not a binary "up or down" state. It is a spectrum of health defined by constraints. To solve complex issues, an engineer must move down the OSI model: starting with high-level metrics (CloudWatch), moving to metadata logs (VPC Flow Logs) to see if traffic is accepted/rejected, and finally performing Deep Packet Inspection (Traffic Mirroring) to see the actual contents and timing of the frames.
Formula / Concept Box
| Concept | Metric / Calculation | Significance |
|---|---|---|
| Throughput | T = \frac{\text{Data Size}}{\text{Time}}$ | Measures the effective speed of the link. |
| NetworkOut | CloudWatch Metric (Bytes) | Amount of traffic sent out of an instance. |
| NetworkPacketsOut | CloudWatch Metric (Count) | Number of packets; high counts with low bytes suggest small-packet overhead. |
| PPS Limit | Packets Per Second | Hard limit on AWS Nitro instances; exceeding this causes packet drops. |
| MTU | Maximum Transmission Unit | Usually 1500 (Internet) or 9001 (Jumbo Frames within VPC). |
Hierarchical Outline
- High-Level Monitoring (Metrics)
- CloudWatch EC2 Metrics: NetworkIn, NetworkOut, NetworkPacketsIn, NetworkPacketsOut.
- ELB Metrics: ActiveFlowCount, TCP_Client_Reset_Count.
- Metadata Logging (VPC Flow Logs)
- Captures: 5-tuple (Src/Dest IP, Src/Dest Port, Protocol).
- Usage: Determining if Security Groups or ACLs are dropping traffic.
- Network Analysis Tools
- Reachability Analyzer: Logic-based tool (no traffic sent) to check if a path exists.
- Transit Gateway Network Manager: Centralized visualization for global networks.
- Network Performance Monitor (NPM): Real-time visibility into packet loss and latency.
- Deep Packet Analysis
- VPC Traffic Mirroring: Copying ENI traffic to a destination for inspection.
- Wireshark: Analysis tool for troubleshooting obscure L4-L7 issues.
Visual Anchors
Troubleshooting Flowchart
Network Constraint Visualization
\begin{tikzpicture}[scale=1.2, every node/.style={scale=0.8}] % Axes \draw [->, thick] (0,0) -- (5,0) node[right] {Time}; \draw [->, thick] (0,0) -- (0,4) node[above] {Data Arrival};
% Ideal Arrival (Steady)
\draw [blue, thick] (0.5, 0.5) circle (2pt);
\draw [blue, thick] (1.5, 0.5) circle (2pt);
\draw [blue, thick] (2.5, 0.5) circle (2pt);
\draw [blue, thick] (3.5, 0.5) circle (2pt);
\node [blue] at (2,-0.4) {Low Jitter (Ideal)};
% Jitter Arrival (Scattered)
\draw [red, thick] (0.5, 2.5) circle (2pt);
\draw [red, thick] (1.2, 2.5) circle (2pt);
\draw [red, thick] (2.8, 2.5) circle (2pt);
\draw [red, thick] (3.1, 2.5) circle (2pt);
\node [red] at (2, 2.1) {High Jitter (Inconsistent)};
% Packet Loss (Missing)
\draw [black, thick] (0.5, 3.5) circle (2pt);
\draw [dashed, gray] (1.5, 3.5) circle (2pt);
\draw [black, thick] (2.5, 3.5) circle (2pt);
\node at (2, 3.8) {Packet Loss (X)};\end{tikzpicture}
Definition-Example Pairs
- Reachability Analyzer: A tool that performs static analysis of your VPC configuration to determine if a path exists between two points.
- Example: You cannot ping an EC2 instance from an internet gateway. Reachability Analyzer can tell you the specific Route Table entry or Security Group rule blocking the path without you having to send a single packet.
- VPC Traffic Mirroring: A feature that allows you to extract and send network traffic from an ENI to a security/monitoring appliance.
- Example: An application is experiencing mysterious "reset" flags (). You mirror the traffic to a separate EC2 instance running Wireshark to inspect the TCP headers and identify the source of the resets.
Worked Examples
Scenario 1: Identifying Packet Loss on Transit Gateway
Problem: Users report intermittent connection drops between an on-premises data center and an AWS VPC connected via Transit Gateway (TGW). Step-by-Step Solution:
- Open Transit Gateway Network Manager.
- Review the Packet Loss metric for the specific attachment connecting the VPN/Direct Connect.
- If loss is high, check Bandwidth Utilization. If utilization is near 100%, the TGW may be throttling traffic.
- Use CloudWatch Alarms to set a threshold for
PacketLossso that the operations team is notified before users complain.
Scenario 2: Troubleshooting DNS Resolution Failures
Problem: Instances are failing to connect to external APIs by name, but IP-based connectivity works. Step-by-Step Solution:
- Enable Route 53 Resolver Query Logging.
- Analyze the logs to see if the query is reaching the resolver.
- Look for the
RCODE. If it isSERVFAIL, the upstream DNS server is having issues. If it isNXDOMAIN, the hostname is incorrect.
Checkpoint Questions
- Which tool would you use to verify that a security group is the reason for a blocked connection without generating traffic?
- What is the difference between
NetworkInandNetworkPacketsInin CloudWatch? - True or False: VPC Flow Logs capture the payload (data content) of the packets.
- Which service provides a global view of network topology and performance metrics like latency across regions?
▶Click to reveal answers
- Reachability Analyzer.
- NetworkIn measures the volume (Bytes), while NetworkPacketsIn measures the count of packets.
- False (Flow logs only capture metadata/headers).
- Transit Gateway Network Manager.
Muddy Points & Cross-Refs
- Flow Logs vs. Traffic Mirroring: This is a common exam trap. Use Flow Logs for "Who/Where/Result" (Metadata). Use Traffic Mirroring for "What/Why" (Payload/Timing).
- Promiscuous Mode: Remember that for the destination instance in Traffic Mirroring to actually "see" the mirrored traffic, the OS must have the interface in promiscuous mode, or the capture tool (like
tcpdump) must enable it. - Global Accelerator: While it improves performance by moving traffic onto the AWS backbone earlier, it is monitored via its own set of CloudWatch metrics, not VPC Flow Logs (as it sits at the edge).
Comparison Tables
| Tool | Data Depth | Cost | Best Use Case |
|---|---|---|---|
| CloudWatch Metrics | Aggregate Statistics | Low | Trend analysis, alerting on threshold breaches. |
| VPC Flow Logs | 5-Tuple Metadata | Medium | Security auditing, rule verification (Accept/Reject). |
| Traffic Mirroring | Full Packet Content | High | Malware analysis, complex protocol troubleshooting. |
| Reachability Analyzer | Config Path Logic | Per-Path | "Why is this connection blocked?" (Static analysis). |