sysctl / modules / load order / nf_conntrack
Recently we ran into an issue where connections were unexpectedly aborted. Connections from a NAT-ed client (a K8S pod) to a server would suddently get an old packet (according to the sequence number) in the middle of the data. This triggered the Linux NAT-box to issue a reset packet (RST). Setting the kernel flag to mitigate this behaviour required some knowledge of module load order during boot.
Spurious retransmits causing connection teardown
To start off: we observed that traffic from a pod to a server got
disconnected. We enabled debug logging on the Kubernetes host where
the pod resides. After enabling
modprobe nf_log_ipv4 and
net.netfilter.nf_conntrack_log_invalid=255, we saw this:
kernel: nf_ct_proto_6: SEQ is under the lower bound (already ACKed data retransmitted) IN= OUT= SRC=10.x.x.x DST=10.x.x.x LEN=1480 TOS=0x00 PREC=0x00 TTL=61 ID=53534 DF PROTO=TCP SPT=6379 DPT=26110 SEQ=4213094653 ACK=3402842193 WINDOW=509 RES=0x00 ACK PSH URGP=0 OPT (0101080A084C76F30D12DCAA)
In the middle of a sequence of several packets of data from the server,
an apparently unrelated packet — it had data, but not intended for
this stream — but with the same source/destination tuples and yet a
sequence number that was more than 80K too low. (Wireshark flags this
packet as invalid with a
TCP Spurious Retransmission message.)
This triggered a reset (RST) by the Linux connection tracking module. And that in turned caused (unexpected) RSTs from the server.
POD <-> NAT <-> SRV -------------------- <-- TCP seq 2000000 ack 5555 len 1400 <-- TCP seq 2000000 ack 5555 len 1400 <-- TCP seq 1200000 ack 5555 len 1234 (seq is _way_ off) --> TCP RST seq 5555 len 0 <-- TCP seq 2001400 ack 5555 len 1000 <-- TCP seq 2001400 ack 5555 len 1000
(Made up numbers in the above table, but they illustrate the problem.)
At this point, the non-rejected traffic still got forwarded back to the pod. Its ACKs back to the server were now however rejected by the server with an RST of its own — that end of the connection thinks it was tore down already after all.
kernel: nf_ct_proto_6: invalid rst IN= OUT= SRC=10.x.x.x DST=10.x.x.x LEN=40 TOS=0x00 PREC=0x00 TTL=61 ID=0 DF PROTO=TCP SPT=6379 DPT=26110 SEQ=4213164625 ACK=0 WINDOW=0 RES=0x00 RST URGP=0
The next packet (sequence 2001400 in the above example), was fine though. So if we could convince the Linux kernel to ignore the packet with the unexpected sequence number, our connections might survive.
Luckily there is such a flag:
While this does not explain the root cause, setting said flag mitigates the problem. It makes the the kernel ignore all spurious retransmits.
So, we placed
net.netfilter.nf_conntrack_tcp_be_liberal = 1 in
/etc/sysctl.conf and assumed the symptoms would be gone.
... or so we thought. Because after a reboot, the flag was unset again.
sysctl.conf not picked up?
That's odd. Setting kernel parameters during boot should be done in
sysctl.d). Why did it not get picked up?
The cause turned out to be this: this particular setting is not
built-in. It belongs to a module; the
nf_conntrack module. And that
module is not necessarily loaded before sysctl settings are applied.
nf_conntrack was loaded on demand, and not in a particular
well-defined order. Luckily, loading modules through
/etc/modules-load.d is well defined, as you can see:
# systemd-analyze critical-chain systemd-sysctl.service The time when unit became active or started is printed after the "@" character. The time the unit took to start is printed after the "+" character. systemd-sysctl.service +44ms └─systemd-modules-load.service @314ms +90ms └─systemd-journald.socket @242ms └─system.slice @220ms └─-.slice @220ms
Indeed, it sysctl settings are applied after
# systemctl show systemd-sysctl.service | grep ^After After=systemd-journald.socket system.slice systemd-modules-load.service
So, we can use the systemd-modules-load.service to ensure that the conntrack module is loaded before we attempt to set its parameters:
# cat /etc/modules-load.d/modules.conf # /etc/modules: kernel modules to load at boot time. # # This file contains the names of kernel modules that should be loaded # at boot time, one per line. Lines beginning with "#" are ignored. nf_conntrack
And that works. Now all settings are properly set during boot.
As for the spurious retransmissions: the last word has not yet been said on that...