umount -l / needs --make-slave

umount -l / needs --make-slave

  • Written by
    Walter Doekes
  • Published on

The other day I learned — the hard way — that umount -l can be dangerous. Using the --make-slave mount option makes it safer.

The scenario went like this:

A virtual machine on our Proxmox VE cluster wouldn't boot. No biggie, I thought. Just mount the filesystem on the host and do a proper grub-install from a chroot:

# fdisk -l /dev/zvol/zl-pve2-ssd1/vm-215-disk-3
/dev/zvol/zl-pve2-ssd1/vm-215-disk-3p1 *         2048 124999679 124997632 59.6G 83 Linux
/dev/zvol/zl-pve2-ssd1/vm-215-disk-3p2      124999680 125827071    827392  404M 82 Linux swap / Solaris
# mount /dev/zvol/zl-pve2-ssd1/vm-215-disk-3p1 /mnt/root
# cd /mnt/root
# for x in dev proc sys; do mount --rbind /$x $x; done
# chroot /mnt/root

There I could run the necessary commands to fix the boot procedure.

All done? Exit the chroot, unmount and start the VM:

# logout
# umount -l /mnt/root
# qm start 215

And at that point, things started failing miserably.

You see, in my laziness, I used umount -l instead of four umounts for: /mnt/root/dev, /mnt/root/proc, /mnt/root/sys and lastly /mnt/root. But what I was unaware of, was that there were mounts inside dev, proc and sys too, that now also got unmounted.

And that led to an array of failures:

systemd complained about binfmt_misc.automount breakage:

systemd[1]: proc-sys-fs-binfmt_misc.automount: Got invalid poll event 16 on pipe (fd=44)
systemd[1]: proc-sys-fs-binfmt_misc.automount: Failed with result 'resources'.

pvedaemon could not bring up any VMs:

pvedaemon[32825]: <root@pam> starting task qmstart:215:root@pam:
pvedaemon[46905]: start VM 215: UPID:pve2:ID:qmstart:215:root@pam:
systemd[1]: 215.scope: Failed to create cgroup /qemu.slice/215.scope:
  No such file or directory
systemd[1]: 215.scope: Failed to create cgroup /qemu.slice/215.scope:
  No such file or directory
systemd[1]: 215.scope: Failed to add PIDs to scope's control group:
  No such file or directory
systemd[1]: 215.scope: Failed with result 'resources'.
systemd[1]: Failed to start 215.scope.
pvedaemon[46905]: start failed: systemd job failed
pvedaemon[32825]: <root@pam> end task qmstart:215:root@pam:
  start failed: systemd job failed

The root runtime dir could not get auto-created:

systemd[1]: user-0.slice: Failed to create cgroup
  /user.slice/user-0.slice: No such file or directory
systemd[1]: Created slice User Slice of UID 0.
systemd[1]: user-0.slice: Failed to create cgroup
  /user.slice/user-0.slice: No such file or directory
systemd[1]: Starting User Runtime Directory /run/user/0...
systemd[4139]: user-runtime-dir@0.service: Failed to attach to
  cgroup /user.slice/user-0.slice/user-runtime-dir@0.service:
  No such file or directory
systemd[4139]: user-runtime-dir@0.service:
  Failed at step CGROUP spawning /lib/systemd/systemd-user-runtime-dir:
  No such file or directory

The Proxmox VE replication runner failed to start:

systemd[1]: pvesr.service: Failed to create cgroup
  /system.slice/pvesr.service: No such file or directory
systemd[1]: Starting Proxmox VE replication runner...
systemd[5538]: pvesr.service: Failed to attach to cgroup
  /system.slice/pvesr.service: No such file or directory
systemd[5538]: pvesr.service: Failed at step CGROUP spawning
  /usr/bin/pvesr: No such file or directory
systemd[1]: pvesr.service: Main process exited, code=exited,
  status=219/CGROUP
systemd[1]: pvesr.service: Failed with result 'exit-code'.
systemd[1]: Failed to start Proxmox VE replication runner.

And, worst of all, new ssh logins to the host machine failed:

sshd[24551]: pam_systemd(sshd:session):
  Failed to create session: Connection timed out
sshd[24551]: error: openpty: No such file or directory
sshd[31553]: error: session_pty_req: session 0 alloc failed
sshd[31553]: Received disconnect from 10.x.x.x port 55190:11:
  disconnected by user

As you understand by now. This was my own doing, and it was caused by various missing mount points.

The failing ssh? A missing /dev/pts.

Most of the other failures? Mostly mounts missing in /sys/fs/cgroup.

Fixing

First order of business was to get this machine to behave again. Luckily I had a different machine where I could take a peek at what was supposed to be mounted.

On the other machine, I ran this one-liner:

$ mount | sed -e '/ on \/\(dev\|proc\|sys\)\//!d
   s#^\([^ ]*\) on \([^ ]*\) type \([^ ]*\) (\([^)]*\)).*#'\
'mountpoint -q \2 || '\
'( mkdir -p \2; mount -n -t \3 \1 -o \4 \2 || rmdir \2 )#' |
  sort -V

That resulted in this output that could be pasted into the one ssh shell I still had at my disposal:

mountpoint -q /dev/hugepages || ( mkdir -p /dev/hugepages; mount -n -t hugetlbfs hugetlbfs -o rw,relatime,pagesize=2M /dev/hugepages || rmdir /dev/hugepages )
mountpoint -q /dev/mqueue || ( mkdir -p /dev/mqueue; mount -n -t mqueue mqueue -o rw,relatime /dev/mqueue || rmdir /dev/mqueue )
mountpoint -q /dev/pts || ( mkdir -p /dev/pts; mount -n -t devpts devpts -o rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 /dev/pts || rmdir /dev/pts )
mountpoint -q /dev/shm || ( mkdir -p /dev/shm; mount -n -t tmpfs tmpfs -o rw,nosuid,nodev,inode64 /dev/shm || rmdir /dev/shm )
mountpoint -q /proc/sys/fs/binfmt_misc || ( mkdir -p /proc/sys/fs/binfmt_misc; mount -n -t autofs systemd-1 -o rw,relatime,fd=28,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=45161 /proc/sys/fs/binfmt_misc || rmdir /proc/sys/fs/binfmt_misc )
mountpoint -q /sys/fs/bpf || ( mkdir -p /sys/fs/bpf; mount -n -t bpf none -o rw,nosuid,nodev,noexec,relatime,mode=700 /sys/fs/bpf || rmdir /sys/fs/bpf )
mountpoint -q /sys/fs/cgroup || ( mkdir -p /sys/fs/cgroup; mount -n -t tmpfs tmpfs -o ro,nosuid,nodev,noexec,mode=755,inode64 /sys/fs/cgroup || rmdir /sys/fs/cgroup )
mountpoint -q /sys/fs/cgroup/blkio || ( mkdir -p /sys/fs/cgroup/blkio; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,blkio /sys/fs/cgroup/blkio || rmdir /sys/fs/cgroup/blkio )
mountpoint -q /sys/fs/cgroup/cpuset || ( mkdir -p /sys/fs/cgroup/cpuset; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,cpuset /sys/fs/cgroup/cpuset || rmdir /sys/fs/cgroup/cpuset )
mountpoint -q /sys/fs/cgroup/cpu,cpuacct || ( mkdir -p /sys/fs/cgroup/cpu,cpuacct; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct || rmdir /sys/fs/cgroup/cpu,cpuacct )
mountpoint -q /sys/fs/cgroup/devices || ( mkdir -p /sys/fs/cgroup/devices; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,devices /sys/fs/cgroup/devices || rmdir /sys/fs/cgroup/devices )
mountpoint -q /sys/fs/cgroup/freezer || ( mkdir -p /sys/fs/cgroup/freezer; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,freezer /sys/fs/cgroup/freezer || rmdir /sys/fs/cgroup/freezer )
mountpoint -q /sys/fs/cgroup/hugetlb || ( mkdir -p /sys/fs/cgroup/hugetlb; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,hugetlb /sys/fs/cgroup/hugetlb || rmdir /sys/fs/cgroup/hugetlb )
mountpoint -q /sys/fs/cgroup/memory || ( mkdir -p /sys/fs/cgroup/memory; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,memory /sys/fs/cgroup/memory || rmdir /sys/fs/cgroup/memory )
mountpoint -q /sys/fs/cgroup/net_cls,net_prio || ( mkdir -p /sys/fs/cgroup/net_cls,net_prio; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,net_cls,net_prio /sys/fs/cgroup/net_cls,net_prio || rmdir /sys/fs/cgroup/net_cls,net_prio )
mountpoint -q /sys/fs/cgroup/perf_event || ( mkdir -p /sys/fs/cgroup/perf_event; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,perf_event /sys/fs/cgroup/perf_event || rmdir /sys/fs/cgroup/perf_event )
mountpoint -q /sys/fs/cgroup/pids || ( mkdir -p /sys/fs/cgroup/pids; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,pids /sys/fs/cgroup/pids || rmdir /sys/fs/cgroup/pids )
mountpoint -q /sys/fs/cgroup/rdma || ( mkdir -p /sys/fs/cgroup/rdma; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,rdma /sys/fs/cgroup/rdma || rmdir /sys/fs/cgroup/rdma )
mountpoint -q /sys/fs/cgroup/systemd || ( mkdir -p /sys/fs/cgroup/systemd; mount -n -t cgroup cgroup -o rw,nosuid,nodev,noexec,relatime,xattr,name=systemd /sys/fs/cgroup/systemd || rmdir /sys/fs/cgroup/systemd )
mountpoint -q /sys/fs/cgroup/unified || ( mkdir -p /sys/fs/cgroup/unified; mount -n -t cgroup2 cgroup2 -o rw,nosuid,nodev,noexec,relatime /sys/fs/cgroup/unified || rmdir /sys/fs/cgroup/unified )
mountpoint -q /sys/fs/fuse/connections || ( mkdir -p /sys/fs/fuse/connections; mount -n -t fusectl fusectl -o rw,relatime /sys/fs/fuse/connections || rmdir /sys/fs/fuse/connections )
mountpoint -q /sys/fs/pstore || ( mkdir -p /sys/fs/pstore; mount -n -t pstore pstore -o rw,nosuid,nodev,noexec,relatime /sys/fs/pstore || rmdir /sys/fs/pstore )
mountpoint -q /sys/kernel/config || ( mkdir -p /sys/kernel/config; mount -n -t configfs configfs -o rw,relatime /sys/kernel/config || rmdir /sys/kernel/config )
mountpoint -q /sys/kernel/debug || ( mkdir -p /sys/kernel/debug; mount -n -t debugfs debugfs -o rw,relatime /sys/kernel/debug || rmdir /sys/kernel/debug )
mountpoint -q /sys/kernel/debug/tracing || ( mkdir -p /sys/kernel/debug/tracing; mount -n -t tracefs tracefs -o rw,relatime /sys/kernel/debug/tracing || rmdir /sys/kernel/debug/tracing )
mountpoint -q /sys/kernel/security || ( mkdir -p /sys/kernel/security; mount -n -t securityfs securityfs -o rw,nosuid,nodev,noexec,relatime /sys/kernel/security || rmdir /sys/kernel/security )

Finishing touches:

$ for x in /sys/fs/cgroup/*; do
    test -L $x &&
    echo ln -s $(readlink $x) $x
  done
ln -s cpu,cpuacct /sys/fs/cgroup/cpu
ln -s cpu,cpuacct /sys/fs/cgroup/cpuacct
ln -s net_cls,net_prio /sys/fs/cgroup/net_cls
ln -s net_cls,net_prio /sys/fs/cgroup/net_prio

Running those commands returned the system to a usable state.

The real fix

Next time, I shall refrain from doing the lazy -l umount.

But, as a better solution, I'll also be adding --make-slave to the rbind mount command. Doing that will ensure that an unmount in the bound locations does not unmount the original mount points:

# for x in dev proc sys; do
    mount --rbind --make-slave /$x $x
  done

With --make-slave a umount -l of your chroot path does not break your system.


Back to overview Newer post: letsencrypt root / certificate validation on jessie Older post: a singal 17 is raised