sysctl / modules / load order / nf_conntrack

sysctl / modules / load order / nf_conntrack

  • Written by
    Walter Doekes
  • Published on

Recently we ran into an issue where connections were unexpectedly aborted. Connections from a NAT-ed client (a K8S pod) to a server would suddently get an old packet (according to the sequence number) in the middle of the data. This triggered the Linux NAT-box to issue a reset packet (RST). Setting the kernel flag to mitigate this behaviour required some knowledge of module load order during boot.

Spurious retransmits causing connection teardown

To start off: we observed that traffic from a pod to a server got disconnected. We enabled debug logging on the Kubernetes host where the pod resides. After enabling modprobe nf_log_ipv4 and net.netfilter.nf_conntrack_log_invalid=255, we saw this:

kernel: nf_ct_proto_6: SEQ is under the lower bound (already ACKed data retransmitted)
  IN= OUT= SRC=10.x.x.x DST=10.x.x.x LEN=1480 TOS=0x00 PREC=0x00 TTL=61 ID=53534 DF
  PROTO=TCP SPT=6379 DPT=26110 SEQ=4213094653 ACK=3402842193 WINDOW=509 RES=0x00
  ACK PSH URGP=0 OPT (0101080A084C76F30D12DCAA)

In the middle of a sequence of several packets of data from the server, an apparently unrelated packet — it had data, but not intended for this stream — but with the same source/destination tuples and yet a sequence number that was more than 80K too low. (Wireshark flags this packet as invalid with a TCP Spurious Retransmission message.)

This triggered a reset (RST) by the Linux connection tracking module. And that in turned caused (unexpected) RSTs from the server.

POD <-> NAT <-> SRV
--------------------
            <-- TCP seq 2000000 ack 5555 len 1400
    <-- TCP seq 2000000 ack 5555 len 1400

            <-- TCP seq 1200000 ack 5555 len 1234 (seq is _way_ off)
            --> TCP RST seq 5555 len 0

            <-- TCP seq 2001400 ack 5555 len 1000
    <-- TCP seq 2001400 ack 5555 len 1000

(Made up numbers in the above table, but they illustrate the problem.)

At this point, the non-rejected traffic still got forwarded back to the pod. Its ACKs back to the server were now however rejected by the server with an RST of its own — that end of the connection thinks it was tore down already after all.

kernel: nf_ct_proto_6: invalid rst
  IN= OUT= SRC=10.x.x.x DST=10.x.x.x LEN=40 TOS=0x00 PREC=0x00
  TTL=61 ID=0 DF PROTO=TCP SPT=6379 DPT=26110 SEQ=4213164625
  ACK=0 WINDOW=0 RES=0x00 RST URGP=0

The next packet (sequence 2001400 in the above example), was fine though. So if we could convince the Linux kernel to ignore the packet with the unexpected sequence number, our connections might survive.

Luckily there is such a flag: net.netfilter.nf_conntrack_tcp_be_liberal=1

While this does not explain the root cause, setting said flag mitigates the problem. It makes the the kernel ignore all spurious retransmits.

So, we placed net.netfilter.nf_conntrack_tcp_be_liberal = 1 in /etc/sysctl.conf and assumed the symptoms would be gone.

... or so we thought. Because after a reboot, the flag was unset again.

sysctl.conf not picked up?

That's odd. Setting kernel parameters during boot should be done in sysctl.conf (or sysctl.d). Why did it not get picked up?

The cause turned out to be this: this particular setting is not built-in. It belongs to a module; the nf_conntrack module. And that module is not necessarily loaded before sysctl settings are applied.

nf_conntrack was loaded on demand, and not in a particular well-defined order. Luckily, loading modules through /etc/modules-load.d is well defined, as you can see:

# systemd-analyze critical-chain systemd-sysctl.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

systemd-sysctl.service +44ms
└─systemd-modules-load.service @314ms +90ms
  └─systemd-journald.socket @242ms
    └─system.slice @220ms
      └─-.slice @220ms

Indeed, it sysctl settings are applied after systemd-modules-load.service:

# systemctl show systemd-sysctl.service | grep ^After
After=systemd-journald.socket system.slice systemd-modules-load.service

So, we can use the systemd-modules-load.service to ensure that the conntrack module is loaded before we attempt to set its parameters:

# cat /etc/modules-load.d/modules.conf
# /etc/modules: kernel modules to load at boot time.
#
# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with "#" are ignored.
nf_conntrack

And that works. Now all settings are properly set during boot.

As for the spurious retransmissions: the last word has not yet been said on that...


Back to overview Newer post: django 1.8 / python 3.10 Older post: recap 2022 - part 2 / stable diffusion