TCP Retransmissions: What They Mean and How to Find the Cause
A working guide to TCP retransmissions: what counts as a retransmit, why they happen (loss, congestion, MTU), how to detect them in a pcap, and which Linux commands point to the actual cause.
Every network engineer has opened a packet capture, spotted the orange "TCP Retransmission" rows in Wireshark, and thought "so something is lossy, but what". Retransmissions are the single most useful signal a pcap gives you, because they tell you a packet got dropped somewhere between two machines. This post walks through what a retransmit actually is, the four situations that cause them, and a short investigation playbook you can run the next time a connection feels slow.
What TCP retransmission actually means
TCP promises reliable delivery on top of an unreliable network. When the sender sends a segment, it starts a timer. If the receiver's ACK does not come back before the timer fires, the sender assumes the segment was lost and sends it again. That resend is a retransmission.
From the wire, a retransmission looks like a segment with the same sequence number and same payload length as one sent earlier in the same direction. That is exactly the definition our PCAP analyzer uses in its TCP quality table. Wireshark uses the same heuristic, plus a few extra rules for fast retransmit and spurious retransmit.
A healthy network stays below 0.1 percent retransmits. A stressed but working network sits between 0.1 and 1 percent. Anything over 3 percent sustained is a real problem that will show up as user-visible slowness.
The four things that cause retransmissions
Packet loss on the path. A router in the middle dropped the packet. This is the most common cause on the public internet, especially over wireless or congested links. Loss usually comes in bursts during congestion, not a steady trickle.
Buffer overflow at the receiver. If the receiving host is too slow to read from its socket, the kernel buffer fills up. The receiver's TCP window shrinks toward zero, and eventually segments get dropped. Look for zero-window or very small window values in the pcap.
MTU or MSS mismatch. A middle device cannot fragment the packet and the Don't Fragment bit is set, so it sends an ICMP "fragmentation needed" back. If that ICMP is filtered (very common on corporate firewalls) the sender never learns, and it retransmits the too-large segment forever. This is the classic "works for some sites, hangs for others" problem.
Aggressive timeouts. Less a cause than a false positive: the sender's retransmit timer fired too early because the path RTT grew. The original packet was not lost, just slow. TCP has adaptive timers, but under sudden latency spikes you still see spurious retransmits.
How to detect them in a capture
1. Take a short capture at the host that is suffering, not at the far end. Loss shows up differently depending on where you are on the path. On Linux: sudo tcpdump -i any -w /tmp/cap.pcap -s 200 port 443 and host api.example.com. 200 bytes of snapshot is enough to see all TCP headers without capturing payloads. 2. Keep the capture small. 2 to 3 minutes during the slow period, not an hour. The default PCAPNG format from modern tcpdump is fine. 3. Upload to the analyzer. The PCAP analyzer reads both classic PCAP and PCAPNG natively and shows a TCP quality table. Each row is one one-way flow with its retransmit count and percent. A flow with 5 percent retransmits is the one to investigate. 4. Confirm with Wireshark. Open the same file in Wireshark and apply the display filter tcp.analysis.retransmission. The packets that matched should be the same ones counted in the quality table.
How to find the actual cause
Once you know *which* flow is lossy, the cause is rarely in the capture itself. You have to look at the path and the hosts.
Step 1: Is it loss, or window starvation? In the capture, check the ACK packets from the receiver. If the advertised window stays large (tens of kilobytes) even during retransmits, it is path loss. If the window shrinks toward zero just before retransmits, the receiver is slow. Window starvation is an application bug (slow reader) or kernel bug, not a network issue.
Step 2: Where on the path is the loss? Run mtr --report --report-cycles 100 api.example.com from the affected host. mtr sends probes and shows per-hop loss. If loss is 0 percent at hops 1 through 8 and 7 percent at hop 9, hop 9 is the suspect. Caveat: many routers deprioritize ICMP, so mtr loss at an intermediate hop does not always match TCP loss. But consistent loss at hop N and every hop after it is a real signal.
Step 3: Is it an MTU problem? Try ping -M do -s 1472 api.example.com on Linux (replaces 8-byte ICMP header plus 20-byte IP header to reach the 1500-byte Ethernet MTU). If 1472 succeeds but 1473 fails, your path MTU is 1500. If 1472 also fails, the path MTU is smaller and ICMP fragmentation-needed is probably being filtered. Lower the MSS on the host or firewall.
Step 4: Is it a physical layer problem? On Linux, run ip -s link show on both ends of the lossy path. The RX errors, dropped, and overrun counters should be zero. Non-zero error counts on a local interface are usually a bad cable, a mismatched duplex setting, or a failing NIC. Non-zero drops are usually qdisc backpressure under heavy load.
Step 5: Is it load? If retransmits correlate with traffic spikes, the path is saturated during those spikes. Fixes: lower traffic, upgrade the link, or use QoS / BBR congestion control to handle saturation more gracefully. The BBR algorithm (available on most modern Linux kernels with sysctl net.ipv4.tcp_congestion_control=bbr) handles bursty loss much better than the default CUBIC.
A 30-second triage you can actually remember
Capture 2 minutes on the sending host. Upload to the PCAP analyzer. Pick the flow with the highest retransmit percent. Check if the receiver's window stays open (loss) or collapses (slow receiver). Run mtr to the peer. If mtr shows loss at a specific hop, that is your answer. If mtr is clean, suspect MTU or interface errors.
That covers the vast majority of real-world cases. Fancy causes (middlebox ECN mangling, TCP offload bugs, faulty hardware) exist but are rare, and they usually show up as non-obvious patterns in the capture that an analyzer will flag as anomalies you can investigate further.
The point is: retransmissions are not random. They always mean something specific. A capture plus mtr plus interface counters tells you which one in about five minutes.