Capturing Baseline Network Performance
Capturing baseline network performance
Capturing Baseline Network Performance
Establishing a performance baseline is a critical task for AWS Network Engineers. It provides the "normal" profile of network behavior, allowing for proactive troubleshooting, capacity planning, and SLA validation.
Learning Objectives
- Define the role of baselining in network monitoring and capacity planning.
- Identify key AWS services used to collect performance data, including CloudWatch, VPC Flow Logs, and Transit Gateway Network Manager.
- Explain the process and requirements for VPC Traffic Mirroring and deep packet inspection.
- Compare different monitoring tools based on their level of visibility (flow-level vs. packet-level).
Key Terms & Glossary
- Network Baseline: A set of metrics representing the normal operating state of a network over a specific period.
- Promiscuous Mode: A configuration of a network interface that allows it to receive all traffic on a network segment, regardless of the destination MAC address.
- Jitter: The variation in the delay of received packets, often critical for voice and video traffic.
- Throughput: The actual amount of data successfully transferred over the network in a given time period.
- Flow Logs: Metadata records that capture information about IP traffic going to and from network interfaces in a VPC.
The "Big Idea"
[!IMPORTANT] Baselines are the yardsticks of the cloud. Without knowing what is "normal," you cannot identify what is "broken." A baseline transforms raw metrics into actionable intelligence by highlighting anomalies that signify security breaches, misconfigurations, or the need for increased capacity.
Formula / Concept Box
| Concept | Metric / Requirement | Purpose |
|---|---|---|
| Utilization | Identify saturation points and bottlenecks. | |
| Packet Loss | Measure link reliability and congestion. | |
| Mirror Target | ENI in Promiscuous Mode | Required for receiving mirrored packet data. |
| CloudWatch Alarms | Automate response to baseline deviations. |
Hierarchical Outline
- I. The Importance of Baselines
- Usage Tracking: Understanding usage patterns over time (daily, weekly, monthly).
- Anomaly Detection: Identifying metrics that exceed baseline ranges to trigger resolutions.
- Predictive Maintenance: Addressing issues before they become critical failures.
- II. AWS Native Monitoring Tools
- Amazon CloudWatch: Collects NetworkIn/Out and NetworkPacketsIn/Out metrics.
- Transit Gateway Network Manager: Provides visibility into packet loss, latency, and global topology.
- Route 53 Resolver Logs: Monitors DNS query latency and resolution failure rates.
- III. Deep Packet Inspection (DPI)
- VPC Traffic Mirroring: Copies L2 traffic from a source ENI to a target device.
- Analysis Tools: Using Wireshark for inspecting source/destination IPs and protocols.
- QoS Adjustments: Using findings to prioritize delay-sensitive traffic (e.g., Voice vs. Storage).
Visual Anchors
Traffic Mirroring Architecture
Visualizing Performance Spikes
\begin{tikzpicture}[scale=0.8] % Axes \draw[->] (0,0) -- (6,0) node[right] {Time}; \draw[->] (0,0) -- (0,4) node[above] {Traffic Volume};
% Baseline (Dashed)
\draw[dashed, blue, thick] (0,1) .. controls (1,1.2) and (2,0.8) .. (3,1)
.. controls (4,1.2) and (5,0.8) .. (6,1);
\node[blue] at (5,0.5) {Baseline};
% Actual Traffic (Solid)
\draw[red, thick] (0,0.8) -- (2,0.9) -- (2.5,3.5) -- (3,1.2) -- (4,1.1) -- (6,1);
\node[red] at (2.5,3.8) {Anomaly};
% Threshold line
\draw[thick, gray] (0,2.5) -- (6,2.5);
\node[gray] at (5.5,2.7) {SLA};\end{tikzpicture}
Definition-Example Pairs
- VPC Flow Logs: Metadata capture of IP traffic flows.
- Example: Checking if a specific Security Group is dropping traffic by looking for
REJECTrecords in the flow logs.
- Example: Checking if a specific Security Group is dropping traffic by looking for
- Transit Gateway Network Manager: A centralized dashboard for global network health.
- Example: Visualizing a 50ms latency spike between a VPC in
us-east-1and an on-premises data center via Direct Connect.
- Example: Visualizing a 50ms latency spike between a VPC in
- Packet Shaping: Modifying the flow of data to optimize performance.
- Example: Applying Quality of Service (QoS) rules to ensure VoIP packets are processed before background database backups.
Worked Examples
Example 1: Calculating Baseline Deviation
Scenario: An EC2 instance usually has a NetworkOut average of 500 MB/hour. Suddenly, CloudWatch reports 5 GB/hour.
- Identify Baseline: 500 MB/hour.
- Compare Current: 5000 MB/hour.
- Calculation: The current load is $10 \times$ the baseline.
- Action: Investigate for data exfiltration or a misconfigured backup job.
Example 2: Configuring Traffic Mirroring
Scenario: You need to inspect packets for an application that is intermittently dropping connections.
- Create Target: Deploy an EC2 instance with an ENI in the same VPC.
- Create Filter: Define a filter for the specific port and protocol used by the app.
- Create Session: Map the Source ENI to the Target ENI using the filter.
- Capture: Run
tcpdumpor Wireshark on the target instance to see the raw payloads.
Checkpoint Questions
- What is the main difference between VPC Flow Logs and VPC Traffic Mirroring?
- Why must the destination interface for Traffic Mirroring be in promiscuous mode?
- Which tool would you use to map the global topology of your AWS Transit Gateways?
- If an application is latency-sensitive, which metric in Transit Gateway Network Manager is most critical?
Muddy Points & Cross-Refs
- Flow Logs vs. Mirroring: Flow logs are cheap and capture metadata (IP/Port), whereas Mirroring is more expensive/complex but captures the actual data inside the packets.
- Promiscuous Mode: Many students forget that the target instance OS must also support promiscuous mode to "see" the traffic redirected to it.
- Cross-Refs: See Chapter 6: Security for using Flow Logs in threat detection, and Unit 1: Design for implementing Direct Connect.
Comparison Tables
| Feature | VPC Flow Logs | VPC Traffic Mirroring | CloudWatch Metrics |
|---|---|---|---|
| Data Type | Metadata (Flows) | Full Packet (Payload) | Aggregated Metrics |
| Granularity | 1 min / 10 min | Real-time | 1 min (Standard) |
| Use Case | Security / Connectivity | Deep Troubleshooting | Capacity Planning |
| Cost | Low | High | Medium |
| Analysis Tool | CloudWatch Insights | Wireshark / Suricata | CloudWatch Dashboards |