containerd / kubernetes / open file limits

containerd / kubernetes / open file limits

  • Written by
    Walter Doekes
  • Published on

After upgrading a customer's Kubernetes cluster running on Ubuntu/Jammy, we ran into a snag. The vernemq instances wouldn't start completely, instead they gave us EMFILE errors: Too many open files. This came as a surprise. After all, we hadn't touched any limits.

But, it turned out containerd v1.8 changed the LimitNOFILE setting from a very permissive infinity to the systemd default. The result? Processes inside Kubernetes would get a measly max 1024 file descriptors by default.

containerd upgrades in Ubuntu

Normally Ubuntu is super conservative when migrating versions: once a stable LTS is released, you need to move heaven and earth to get a patch in. This time however, they bumped the containerd package from 1.7 to 2.2 within Jammy (22.04) and Noble (24.04) — in fact, Jammy has even seen two version bumps, as it started with version 1.5.

For such a big version jump, you might expect there to be impacting changes — Canonical did not, or at least didn't see any impact.

LimitNOFILE and systemd

For systemd started processes, systemd sets the nofile ulimit to the values set in the service file for the specific application. For example:

[Service]
LimitNOFILE=infinity  # value of /proc/sys/fs/nr_open for soft/hard

Or the default:

[Service]
LimitNOFILE=1024:524288  # 1024 soft, 524288 hard

The systemd exec manual recommends leaving the soft limit to 1024 because of old processes still using select(2).

containerd ulimit history

At containerd they have had a hard time deciding what an appropriate value is:

Year Ver Limit commit PR
2017 v1.0 LimitNOFILE=1048576 b009642 #1846
2018 v1.2 LimitNOFILE=infinity 4972e3f #2601
2019 v1.3 LimitNOFILE=1048576 1a1f8f1 #3202
2020 v1.5 LimitNOFILE=infinity c691c36 #4475
2023 v1.8 # unset 3ca39ef #8924

The maximum amount of open file descriptors available to a process has been alternating between 1048576:1048576 and infinity, which has in fact meant the same (except for pre-2019 systemd when infinity meant 65536:65536). Depending on which systemd version you were using, and which containerd defaults, you got between 65536 and 1048576 as the soft and hard limit.

But now, after 2023, since containerd 1.8 and higher, you get a default of 1024 and 524288 for soft and hard limits respectively. The earlier adjustments had no practical effect, but this last change did. 65536 is in many cases enough for everyone, but 1024 definitely isn't.

Impact on workload

If your containerd instances are starting your Kubernetes containers you may suddenly notice that you're running out of file descriptors. Example:

!!!!
!!!! WARNING: ulimit -n is 1024; 65536 is the recommended minimum.
!!!!
Exec:  /vernemq/bin/../erts-11.1.8/bin/erlexec -boot /vernemq/bin/../releases/1.13.0/vernemq
...
11:27:23.787 [error] File operation error: emfile. Target: /vernemq/bin/../lib/mongodb-3.4.4/ebin/vmq_ql_query.beam. Function: get_file. Process: code_server.

For some processes, like this old vernemq here, this is fatal. For other processes, this results in degraded performance when cache files or database tables have to be closed more quickly than strictly necessary.

The systemd default value of 1024 fixes a real problem — but only for very old software. For Kubernetes it creates a problem to which few applications have an answer.

Solutions

The best solution is if every application assesses their file descriptor need beforehand, and raises their soft limit to an appropriate value. However, not all applications actually do this.

Programmatically, one has to call setrlimit(RLIMIT_NOFILE, ...). Or, you could start your application from a shell and call ulimit first. For instance, for this old vernemq statefulset I had to extract the ENTRYPOINT from the image and then manually set this in the container spec:

command: ["/bin/sh"]
args: ["-c", "ulimit -n 131072; exec /usr/sbin/start_vernemq"]

(Setting ulimit -n like that would typically be done from a Dockerfile entrypoint shell script because Kubernetes does not provide any means to set it from a spec.)

Alternatively, we could set LimitNOFILE=infinity on the containerd daemon ourselves via a systemd drop-in, or at least raise it above 1024.

The question: do we change every application, or do we change the defaults? And to what?

Checking current workload

Before deploying containerd 2.2 everywhere, we'll have to decide what to do. And because we have more than one Kubernetes cluster to examine for this particular issue, we'll use a script to get the details: find_ulimit_nofile.py (view)

The script checks running processes, counts open file descriptors and reports the processes if (a) the soft limit appears unchanged, and (b) the open file descriptor count is getting close to or over 1024.

When run on a few containerd 1.7 nodes, we get output like:

# python3 ./find_ulimit_nofile.py
    PID    FDs        SOFT  COMM            EXE
 867893  16831     1048576  beam.smp        /vernemq/erts-11.1.8/bin/beam.smp
1843653   3064     1048576  argocd-applicat /usr/local/bin/argocd
4080120   2920     1048576  java            /usr/lib/jvm/java-17-openjdk-17.0.11.0.9-2.el8.x86_64/bin/java
   2391   1881     1048576  containerd      /usr/bin/containerd
2610690   1157     1048576  mysqld          /opt/bitnami/mariadb/sbin/mariadbd
2703298   1078     1048576  redis-server    /usr/local/bin/redis-server
1046409    990     1048576  postgres        /usr/lib/postgresql/16/bin/postgres
1046484    989     1048576  postgres        /usr/lib/postgresql/16/bin/postgres
1771834    972     1048576  dd-ipc-helper   /memfd:spawn_worker_trampoline (deleted)
1771390    834     1048576  dd-ipc-helper   /memfd:spawn_worker_trampoline (deleted)
1975742    777     1048576  nginx           /usr/local/nginx/sbin/nginx
1906067    777     1048576  nginx           /usr/local/nginx/sbin/nginx
 767917    771     1048576  java            /usr/share/elasticsearch/jdk/bin/java

This script output shows the LimitNOFILE=infinity situation, before moving to the heavily capped LimitNOFILE=1024:524288 situation. For each of these processes, we can check whether they are provisioned and capable to raise their limits.

As one of the few examples, MariaDB actually does set RLIMIT_NOFILE dynamically at startup, according to the source. Postgres does not, but does read the value and dynamically limits how many tables it can keep open. Containerd sets its own limit (hard = soft), but does not propagate it to its children.

Those two nginx ingress controllers? Yes, they set it, provided worker_rlimit_nofile is configured, which it normally is.

Elasticsearch? I doubt it sets anything from Java, but maybe an entrypoint.sh script does. That other Java process? Kafka this time. No idea. After investigating a bunch of different processes you can probably tell my enthusiasm is wearing off.

Conclusions

I think we can draw two conclusions here.

One: trying to figure out whether all workload sets its soft open file limits, is a losing battle. Only if you have a well defined limited scope of applications you run, would it make sense to raise it only for individual applications.

Two: seeing that the maximum amount of open file descriptors doesn't exceed 16k and usually not even 4k, we can instead raise the default limits to something generous but not ridiculous: 65536 or 131072.

The choice is a trade-off:

We'll settle for this:

[Service]
LimitNOFILE=131072:524288

Back to overview Older post: linux / local root exploit / module vetting