containerd / kubernetes / open file limits
After upgrading a customer's Kubernetes cluster running on Ubuntu/Jammy, we ran into a snag. The vernemq instances wouldn't start completely, instead they gave us EMFILE errors: Too many open files. This came as a surprise. After all, we hadn't touched any limits.
But, it turned out containerd v1.8 changed the LimitNOFILE setting
from a very permissive infinity to the systemd
default. The
result? Processes inside Kubernetes would get a measly max 1024 file
descriptors by default.
containerd upgrades in Ubuntu
Normally Ubuntu is super conservative when migrating versions: once a stable LTS is released, you need to move heaven and earth to get a patch in. This time however, they bumped the containerd package from 1.7 to 2.2 within Jammy (22.04) and Noble (24.04) — in fact, Jammy has even seen two version bumps, as it started with version 1.5.
For such a big version jump, you might expect there to be impacting changes — Canonical did not, or at least didn't see any impact.
LimitNOFILE and systemd
For systemd started processes, systemd sets the nofile ulimit to
the values set in the service file for the specific application. For
example:
[Service]
LimitNOFILE=infinity # value of /proc/sys/fs/nr_open for soft/hard
Or the default:
[Service]
LimitNOFILE=1024:524288 # 1024 soft, 524288 hard
The systemd exec
manual
recommends leaving the soft limit to 1024 because of old processes still
using select(2).
containerd ulimit history
At containerd they have had a hard time deciding what an appropriate value is:
| Year | Ver | Limit | commit | PR |
|---|---|---|---|---|
| 2017 | v1.0 | LimitNOFILE=1048576 |
b009642 | #1846 |
| 2018 | v1.2 | LimitNOFILE=infinity |
4972e3f | #2601 |
| 2019 | v1.3 | LimitNOFILE=1048576 |
1a1f8f1 | #3202 |
| 2020 | v1.5 | LimitNOFILE=infinity |
c691c36 | #4475 |
| 2023 | v1.8 | # unset |
3ca39ef | #8924 |
The maximum amount of open file descriptors available to a process has
been alternating between 1048576:1048576 and infinity, which has in
fact meant the same (except for pre-2019 systemd when infinity meant
65536:65536). Depending on which systemd version you were using, and
which containerd defaults, you got between 65536 and 1048576 as
the soft and hard limit.
But now, after 2023, since containerd 1.8 and higher, you get a default
of 1024 and 524288 for soft and hard limits respectively. The
earlier adjustments had no practical effect, but this last change did.
65536 is in many cases enough for everyone, but 1024 definitely
isn't.
Impact on workload
If your containerd instances are starting your Kubernetes containers you may suddenly notice that you're running out of file descriptors. Example:
!!!!
!!!! WARNING: ulimit -n is 1024; 65536 is the recommended minimum.
!!!!
Exec: /vernemq/bin/../erts-11.1.8/bin/erlexec -boot /vernemq/bin/../releases/1.13.0/vernemq
...
11:27:23.787 [error] File operation error: emfile. Target: /vernemq/bin/../lib/mongodb-3.4.4/ebin/vmq_ql_query.beam. Function: get_file. Process: code_server.
For some processes, like this old vernemq here, this is fatal. For other processes, this results in degraded performance when cache files or database tables have to be closed more quickly than strictly necessary.
The systemd default value of 1024 fixes a real problem — but only for very old software. For Kubernetes it creates a problem to which few applications have an answer.
Solutions
The best solution is if every application assesses their file descriptor need beforehand, and raises their soft limit to an appropriate value. However, not all applications actually do this.
Programmatically, one has to call setrlimit(RLIMIT_NOFILE, ...). Or,
you could start your application from a shell and call ulimit first.
For instance, for this old vernemq statefulset I had to extract the
ENTRYPOINT from the image and then manually set this in the container
spec:
command: ["/bin/sh"]
args: ["-c", "ulimit -n 131072; exec /usr/sbin/start_vernemq"]
(Setting ulimit -n like that would typically be done from a
Dockerfile entrypoint shell script because Kubernetes does not
provide any means to set it from a spec.)
Alternatively, we could set LimitNOFILE=infinity on the containerd
daemon ourselves via a systemd drop-in, or at least raise it above
1024.
The question: do we change every application, or do we change the defaults? And to what?
Checking current workload
Before deploying containerd 2.2 everywhere, we'll have to decide what to do. And because we have more than one Kubernetes cluster to examine for this particular issue, we'll use a script to get the details: find_ulimit_nofile.py (view)
The script checks running processes, counts open file descriptors and reports the processes if (a) the soft limit appears unchanged, and (b) the open file descriptor count is getting close to or over 1024.
When run on a few containerd 1.7 nodes, we get output like:
# python3 ./find_ulimit_nofile.py
PID FDs SOFT COMM EXE
867893 16831 1048576 beam.smp /vernemq/erts-11.1.8/bin/beam.smp
1843653 3064 1048576 argocd-applicat /usr/local/bin/argocd
4080120 2920 1048576 java /usr/lib/jvm/java-17-openjdk-17.0.11.0.9-2.el8.x86_64/bin/java
2391 1881 1048576 containerd /usr/bin/containerd
2610690 1157 1048576 mysqld /opt/bitnami/mariadb/sbin/mariadbd
2703298 1078 1048576 redis-server /usr/local/bin/redis-server
1046409 990 1048576 postgres /usr/lib/postgresql/16/bin/postgres
1046484 989 1048576 postgres /usr/lib/postgresql/16/bin/postgres
1771834 972 1048576 dd-ipc-helper /memfd:spawn_worker_trampoline (deleted)
1771390 834 1048576 dd-ipc-helper /memfd:spawn_worker_trampoline (deleted)
1975742 777 1048576 nginx /usr/local/nginx/sbin/nginx
1906067 777 1048576 nginx /usr/local/nginx/sbin/nginx
767917 771 1048576 java /usr/share/elasticsearch/jdk/bin/java
This script output shows the LimitNOFILE=infinity situation, before
moving to the heavily capped LimitNOFILE=1024:524288 situation. For
each of these processes, we can check whether they are provisioned and
capable to raise their limits.
As one of the few examples, MariaDB actually does set RLIMIT_NOFILE
dynamically at startup, according to the source. Postgres does not,
but does read the value and dynamically limits how many tables it can
keep open. Containerd sets its own limit (hard = soft), but does not
propagate it to its children.
Those two nginx ingress controllers? Yes, they set it, provided
worker_rlimit_nofile is configured, which it normally is.
Elasticsearch? I doubt it sets anything from Java, but maybe an
entrypoint.sh script does. That other Java process? Kafka this
time. No idea. After investigating a bunch of different processes you
can probably tell my enthusiasm is wearing off.
Conclusions
I think we can draw two conclusions here.
One: trying to figure out whether all workload sets its soft open file limits, is a losing battle. Only if you have a well defined limited scope of applications you run, would it make sense to raise it only for individual applications.
Two: seeing that the maximum amount of open file descriptors doesn't exceed 16k and usually not even 4k, we can instead raise the default limits to something generous but not ridiculous: 65536 or 131072.
The choice is a trade-off:
- We accept that old-style
select(2)-using applications might not work: we can fix them with a custom entrypoint ulimit if needed. - We could set the soft limit to the hard limit and give everyone 512K or 1024K limits. But keeping them slightly more modest gives us earlier detection of file descriptor leaks (in broken applications) and better behaviour if the process tries to close all possible file descriptors. (Examples of that and further reading at rsyslog doing 100% CPU trying to close a billion file descriptors, CPython adding close_range(2) support and a containerd summary of the rationale to remove LimitNOFILE=infinity.)
We'll settle for this:
[Service]
LimitNOFILE=131072:524288