AWS Network Performance Analysis & Troubleshooting Study Guide
Analyzing tool output to assess network performance and troubleshoot connectivity (for example, VPC Flow Logs, Amazon CloudWatch Logs)
AWS Network Performance Analysis & Troubleshooting
This guide covers the essential tools and techniques required to analyze network performance and troubleshoot connectivity within AWS, specifically focusing on the ANS-C01 curriculum.
Learning Objectives
By the end of this module, you should be able to:
- Configure and Interpret VPC Flow Logs to identify traffic patterns and security rejections.
- Utilize Amazon CloudWatch to create alarms and dashboards for network health monitoring.
- Perform Deep Packet Inspection (DPI) using VPC Traffic Mirroring for complex troubleshooting.
- Validate Connectivity Pathing using AWS Reachability Analyzer and Transit Gateway Network Manager.
- Identify Root Causes of connectivity failures such as Security Group/NACL misconfigurations or MTU mismatches.
Key Terms & Glossary
- VPC Flow Logs: A feature that enables you to capture information about the IP traffic going to and from network interfaces in your VPC.
- CloudWatch Logs: A managed service to monitor, store, and access log files from AWS resources.
- Traffic Mirroring: An Amazon VPC feature that you can use to copy network traffic from an elastic network interface (ENI).
- Reachability Analyzer: A configuration analysis tool that enables you to perform connectivity testing between a source and destination in your VPC.
- 5-Tuple: The five pieces of information that uniquely identify a network connection (Source IP, Destination IP, Source Port, Destination Port, Protocol).
The "Big Idea"
In cloud networking, visibility is often obscured by the shared responsibility model. You cannot plug a physical sniffer into an AWS rack. Therefore, troubleshooting depends on telemetry aggregation. By correlating VPC Flow Logs (Layer 4 metadata) with Reachability Analyzer (Control Plane logic) and Traffic Mirroring (Data Plane reality), you create a comprehensive observability stack to solve complex hybrid networking issues.
Formula / Concept Box
| Log/Metric Component | Purpose | Key Identifier |
|---|---|---|
| Flow Log Format | Base log structure | ${srcaddr} ${dstaddr} ${srcport} ${dstport} ${protocol} |
| Action Status | Result of security check | ACCEPT (Permitted) or REJECT (Denied) |
| Log Status | Quality of log data | OK (Normal) or NODATA / SKIPDATA (Missing) |
| Reachability Status | Logical path check | Reachable or Unreachable (with failure point) |
Hierarchical Outline
- Network Observability Tools
- VPC Flow Logs: Capture IP traffic; can be sent to S3 or CloudWatch.
- CloudWatch Metrics: Monitor throughput, latency, and packet loss.
- CloudWatch Insights: Query language for searching millions of log lines.
- Path Analysis & Routing
- Reachability Analyzer: Tests logical paths without sending packets (Dry run).
- TGW Network Manager: Visualizes global topology across regions.
- Advanced Troubleshooting
- VPC Traffic Mirroring: Captures raw L2-L7 packets for Wireshark analysis.
- MTU Verification: Troubleshooting "jumbo frame" issues (9001 bytes) vs standard internet MTU (1500 bytes).
Visual Anchors
Log Aggregation Workflow
Logical Reachability Check
Definition-Example Pairs
- REJECT in Flow Logs: Indicates traffic was blocked by a Security Group or NACL.
- Example: A web server log shows
REJECTon port 22; this implies the Security Group lacks an ingress rule for SSH.
- Example: A web server log shows
- MTU Mismatch: Occurs when a packet is larger than the network interface can handle without fragmentation.
- Example: A Direct Connect link drops packets larger than 1500 bytes because Jumbo Frames (9001) were enabled on the EC2 but not supported by the router.
- Packet Shaping/Throttling: The intentional slowing of traffic to meet limits.
- Example: Monitoring the
NetworkOutmetric in CloudWatch to see if an instance is hitting its baseline bandwidth limit.
- Example: Monitoring the
Worked Examples
Example 1: The "Unreachable" Web Server
Scenario: An EC2 instance in a private subnet cannot reach a database in another VPC via VPC Peering.
- Step 1: Check VPC Flow Logs. See
REJECTon the source side. Result: Security Group needs update. - Step 2: If Flow Logs show
ACCEPTbut traffic fails, run Reachability Analyzer. - Step 3: Reachability Analyzer reports "Unreachable" due to a missing route in the Route Table for the Peering Connection.
- Solution: Add the destination CIDR to the source subnet route table pointing to the
pcx-xxxxID.
Example 2: Intermittent Latency on Hybrid Links
Scenario: A hybrid app via Direct Connect is experiencing high latency.
- Analyze: Use CloudWatch metrics for the Virtual Private Gateway (VGW).
- Discovery:
DirectConnect_BpsOutis peaking at the provisioned limit. - Solution: Implement CloudFront for static assets or upgrade the DX connection bandwidth.
Checkpoint Questions
- What is the main difference between a
REJECTrecorded by a Security Group versus a NACL in Flow Logs? (Hint: Security groups are stateful; NACLs are stateless). - Which tool would you use to verify if a packet is being malformed during transit?
- If CloudWatch Metrics show 0% packet loss but the application reports timeouts, which AWS tool should you use next?
Muddy Points & Cross-Refs
- Flow Logs vs. Traffic Mirroring: Remember, Flow Logs are metadata (like a phone bill: who called whom and for how long). Traffic Mirroring is the actual recording of the conversation. Use Flow Logs first; use Mirroring only for deep protocol errors.
- Security Groups vs. NACLs: If you see a
REJECTin the Flow Log, it doesn't specify which one blocked it. You must check the Security Group first, then the NACL.
Comparison Tables
| Feature | VPC Flow Logs | Reachability Analyzer | VPC Traffic Mirroring |
|---|---|---|---|
| Layer | Layer 4 (Metadata) | Control Plane (Logic) | Layer 2-7 (Packets) |
| Cost | Low (per GB ingested) | Per analysis ($0.10) | High (per hour + throughput) |
| Use Case | Security auditing / Trending | Debugging pathing/routing | Forensic analysis / IDS |
| Real-time? | Delayed (1-10 mins) | Instant (On-demand) | Real-time stream |