Capturing Baseline Network Performance

Establishing a performance baseline is a critical task for AWS Network Engineers. It provides the "normal" profile of network behavior, allowing for proactive troubleshooting, capacity planning, and SLA validation.

Learning Objectives

Define the role of baselining in network monitoring and capacity planning.
Identify key AWS services used to collect performance data, including CloudWatch, VPC Flow Logs, and Transit Gateway Network Manager.
Explain the process and requirements for VPC Traffic Mirroring and deep packet inspection.
Compare different monitoring tools based on their level of visibility (flow-level vs. packet-level).

Key Terms & Glossary

Network Baseline: A set of metrics representing the normal operating state of a network over a specific period.
Promiscuous Mode: A configuration of a network interface that allows it to receive all traffic on a network segment, regardless of the destination MAC address.
Jitter: The variation in the delay of received packets, often critical for voice and video traffic.
Throughput: The actual amount of data successfully transferred over the network in a given time period.
Flow Logs: Metadata records that capture information about IP traffic going to and from network interfaces in a VPC.

The "Big Idea"

[!IMPORTANT] Baselines are the yardsticks of the cloud. Without knowing what is "normal," you cannot identify what is "broken." A baseline transforms raw metrics into actionable intelligence by highlighting anomalies that signify security breaches, misconfigurations, or the need for increased capacity.

Formula / Concept Box

Concept	Metric / Requirement	Purpose
Utilization	$\frac{\text{Current Traffic}}{\text{Max Bandwidth}} \times 100$	Identify saturation points and bottlenecks.
Packet Loss	$\frac{\text{Packets Sent} - \text{Packets Received}}{\text{Packets Sent}}$	Measure link reliability and congestion.
Mirror Target	ENI in Promiscuous Mode	Required for receiving mirrored packet data.
CloudWatch Alarms	$\text{Metric} > \text{Threshold}$	Automate response to baseline deviations.

Hierarchical Outline

I. The Importance of Baselines
- Usage Tracking: Understanding usage patterns over time (daily, weekly, monthly).
- Anomaly Detection: Identifying metrics that exceed baseline ranges to trigger resolutions.
- Predictive Maintenance: Addressing issues before they become critical failures.
II. AWS Native Monitoring Tools
- Amazon CloudWatch: Collects NetworkIn/Out and NetworkPacketsIn/Out metrics.
- Transit Gateway Network Manager: Provides visibility into packet loss, latency, and global topology.
- Route 53 Resolver Logs: Monitors DNS query latency and resolution failure rates.
III. Deep Packet Inspection (DPI)
- VPC Traffic Mirroring: Copies L2 traffic from a source ENI to a target device.
- Analysis Tools: Using Wireshark for inspecting source/destination IPs and protocols.
- QoS Adjustments: Using findings to prioritize delay-sensitive traffic (e.g., Voice vs. Storage).

Visual Anchors

Traffic Mirroring Architecture

Loading Diagram...

Visualizing Performance Spikes

Compiling TikZ diagram…

⏳

Running TeX engine…

This may take a few seconds

Definition-Example Pairs

VPC Flow Logs: Metadata capture of IP traffic flows.
- Example: Checking if a specific Security Group is dropping traffic by looking for REJECT records in the flow logs.
Transit Gateway Network Manager: A centralized dashboard for global network health.
- Example: Visualizing a 50ms latency spike between a VPC in us-east-1 and an on-premises data center via Direct Connect.
Packet Shaping: Modifying the flow of data to optimize performance.
- Example: Applying Quality of Service (QoS) rules to ensure VoIP packets are processed before background database backups.

Worked Examples

Example 1: Calculating Baseline Deviation

Scenario: An EC2 instance usually has a NetworkOut average of 500 MB/hour. Suddenly, CloudWatch reports 5 GB/hour.

Identify Baseline: 500 MB/hour.
Compare Current: 5000 MB/hour.
Calculation: The current load is $$10 \times$$ the baseline.
Action: Investigate for data exfiltration or a misconfigured backup job.

Example 2: Configuring Traffic Mirroring

Scenario: You need to inspect packets for an application that is intermittently dropping connections.

Create Target: Deploy an EC2 instance with an ENI in the same VPC.
Create Filter: Define a filter for the specific port and protocol used by the app.
Create Session: Map the Source ENI to the Target ENI using the filter.
Capture: Run tcpdump or Wireshark on the target instance to see the raw payloads.

Checkpoint Questions

What is the main difference between VPC Flow Logs and VPC Traffic Mirroring?
Why must the destination interface for Traffic Mirroring be in promiscuous mode?
Which tool would you use to map the global topology of your AWS Transit Gateways?
If an application is latency-sensitive, which metric in Transit Gateway Network Manager is most critical?

Muddy Points & Cross-Refs

Flow Logs vs. Mirroring: Flow logs are cheap and capture metadata (IP/Port), whereas Mirroring is more expensive/complex but captures the actual data inside the packets.
Promiscuous Mode: Many students forget that the target instance OS must also support promiscuous mode to "see" the traffic redirected to it.
Cross-Refs: See Chapter 6: Security for using Flow Logs in threat detection, and Unit 1: Design for implementing Direct Connect.

Comparison Tables

Feature	VPC Flow Logs	VPC Traffic Mirroring	CloudWatch Metrics
Data Type	Metadata (Flows)	Full Packet (Payload)	Aggregated Metrics
Granularity	1 min / 10 min	Real-time	1 min (Standard)
Use Case	Security / Connectivity	Deep Troubleshooting	Capacity Planning
Cost	Low	High	Medium
Analysis Tool	CloudWatch Insights	Wireshark / Suricata	CloudWatch Dashboards