nvme drive refusing efi boot
UEFI is the current boot standard. Instead of fighting it, we've adopted it as the default for all hardware machines we install. We've had some issues in the past, but they could all be attributed to a lack of knowledge by the operator, not by a problem with EFI itself. But, this time we couldn't figure out why the SuperMicro machine refused to boot from these newly installed EFI partitions: no bootable UEFI device found.
Spoiler: it was an improperly formatted FAT32 filesystem, but we'll get to that in a moment.
Setting up storage on new machines
When setting up new hardware, we use a script that partitions and formats drives per our needs. For NVMe drives, we must however first select the logical sector size. By default these drives are in 512-byte compatibility mode. For their performance and lifespan, it's better to choose their native sector size, which is usually 4096 bytes.
Doing that, looks somewhat like this:
DEV=/dev/nvme0n1
# List possible lba formats. The output might
# look like this, where lower rp is better:
# ...
# lbaf 0 : ms:0 lbads:9 rp:0x2 (in use)
# lbaf 1 : ms:8 lbads:9 rp:0x2
# lbaf 2 : ms:16 lbads:9 rp:0x2
# lbaf 3 : ms:0 lbads:12 rp:0
# lbaf 4 : ms:8 lbads:12 rp:0
# lbaf 5 : ms:64 lbads:12 rp:0
# lbaf 6 : ms:128 lbads:12 rp:0
nvme id-ns $DEV
# Select 4096 (1<<12) bytes per cluster:
nvme format --lbaf=3 $DEV
(Once it's set, one can check that it's optimal by running nvme_check_best_sector() in sadfscheck.)
Next up, is partitioning:
# First usable sector is 34 (x512b or 6x4096b), we skip the entire MB.
sgdisk -n1:1M:+1M -t1:EF02 -c1:biosboot $DEV # BIOS-boot, or
sgdisk -n2:0:+510M -t2:EF00 -c2:efi $DEV # EFI System
sgdisk -n3:0:+1G -t3:8300 -c3:boot $DEV # /boot (type 8300 for Linux, or BE01 for ZFS)
sgdisk -n4:0:0 -t4:8300 -c4:root $DEV # /
Adding a small 1 MiB partition allows us to fall back to old style BIOS-boot. But commonly the real booting happens off the second partition, which is 510 MiB large: more than sufficient for any EFI binaries we may need.
Formatting the filesystems:
mkfs.fat -F 32 -n EFI ${DEV}p2 # (*BAD)
mkfs.ext2 -t ext2 -L boot ${DEV}p3 # or something else for ZFS
mkfs.ext4 -t ext4 -L root ${DEV}p4 # or something else for ZFS
At this point, we let debootstrap(8) work its magic. And after some waiting and some finalizing, we have a bootable Ubuntu/Linux system.
Mirror disks and booting
For the boot/root filesystems, we generally want a mirror setup, so disk failure isn't an immediate catastrophy. We use either ZFS mirrors or Linux Software Raid for this. The mirroring components make sure that the partitions on either drive are redundant and interchangeable. This ensures that if one drive completely fails, we can still boot from the other.
Except for that EFI partition... Because it has a FAT filesystem... And it therefore does not do any fancy mirroring.
Luckily, we got that sorted quite easily using efibootmirrorsetup — a helper script that keeps the two EFI partitions in sync, and places the EFI partitions of both drives first in the BootOrder.
Running it is as easy as calling efibootmirrorsetup with the two mirror drives as arguments:
efibootmirrorsetup /dev/nvme0n1 /dev/nvme1n1
It's an interactive tool, which won't do anything without your permission. With your permission, it formats the EFI partition on the second drive with a FAT filesystem and ensures that grub(8) updates are always applied to both partitions.
EFI set up correctly
At this point, we should have everything set up correctly:
# efibootmgr
BootCurrent: 0004
Timeout: 1 seconds
BootOrder: 0004,0000,0002,0001
Boot0000* nvme-KCD6DLUL1T92_61A
Boot0001 Network Card
Boot0002* UEFI: Built-in EFI Shell
Boot0004* nvme-KCD6DLUL1T92_41E
Two boot images have been prepended to the BootOrder. efibootmirrorsetup has been kind enough to use the device name/serial so you can easily identify and select a non-faulty drive in case of hardware failures.
Looking at efibootmgr -v
also shows the drive and file:
HD(2,GPT,df57901e-9a47-4393-9470-94afaa56a58f,0x200,0x1fe00)/File(\EFI\UBUNTU\SHIMX64.EFI)
The values listed, are:
2
= partition numberGPT
= the partitions follow GUID Partition Table (GPT) layoutdf57901e-9a47-4393-9470-94afaa56a58f
= the partition UUID (see also:blkid
)0x200
= the EFI filesystem starts at the 512th sector (at 2 MiB when using 4 KiB sectors)0x1fe00
= the filesystem fits on 130560 sectors (510 MiB, when using 4 KiB sectors)File(\EFI\UBUNTU\SHIMX64.EFI)
= path to the boot loader
No EFI filesystem found
Yet, when rebooting, we found ourselves dropped into an UEFI shell. Suddenly the setup that had worked previously did not work.
If you're new to the EFI shell, it feels both cryptic and old
fashioned. The most important tip I have for you is -b
for pagination:
Shell> help -b
...
(listing of all available commands, paginated)
Shell> help -b map
...
(listing help for map, showing among others "map fs*")
Shell> map -b
...
BLK0: Alias(s):
PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/NVMe(0x1,00-01-0D-91-FE-A2-C1-08)
Handle: [1D2]
Media Type: Unknown
Removable: No
Current Dir: BLK0:
BLK1: Alias(s):
PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/NVMe(0x1,00-01-0D-91-FE-A2-C1-08)/HD(1,GPT,243727B8-73F9-41C9-8D06-70EB75472690,0x100,0x100)
Handle: [1D3]
Media Type: HardDisk
Removable: No
Current Dir: BLK1:
BLK2: Alias(s):
PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/NVMe(0x1,00-01-0D-91-FE-A2-C1-08)/HD(2,GPT,DF57901E-9A47-4393-9470-94AFAA56A58F,0x200,0x1FE00)
Handle: [1D4]
Media Type: HardDisk
Removable: No
Current Dir: BLK2:
BLK3: Alias(s):
PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/NVMe(0x1,00-01-0D-91-FE-A2-C1-08)/HD(3,GPT,2AD728A8-2142-43F0-8FB7-54B399B2AEB6,0x20000,0x40000)
Handle: [1D5]
Media Type: HardDisk
Removable: No
Current Dir: BLK3:
...
There they were, the partitions, and particularly that BLK2
. But there
should be more than just partitions. There should also be filesystems,
denoted by FSn
.
(For a brief moment, I entertained the thought that UEFI couldn't cope with 4096 byte sector sizes, but that would be too improbable. Larger sector sizes had existed long before the UEFI standard was drafted.)
I messed around quite a bit in the UEFI shell. Listing hex dumps of
BLK2
(dblk
), displaying the current boot config (bcfg boot dump
)
and so forth.
Sidenote: the interactive hex editor (hexedit
) on the Intel platform
said Ctrl-E for help. But ctrl-E did nothing. Function keys did work: F1
= jump to offset, F2 = save, F3 = exit, ..., F7 = paste, F8 = open, F9
block open). You may be most interested in "exit". I was ;-)
After giving in to the improbable hunch that large sectors were the culprit, I created an EFI partition with a 512 byte sector size. Booting worked again! There definitely was something going on with the sector size.
Here is output of EFI subsystem handles, where one filesystem (with 512 byte sectors) worked, and one (with 4096 byte sectors) did not:
Shell> dh -p diskio
...
1FE: SimpleFileSystem DiskIO EFISystemPartition PartitionInfo BlockIO DevicePath(..6AB46D5FA0DF,0x1000,0xFF000))
...
203: DiskIO EFISystemPartition PartitionInfo BlockIO DevicePath(..-94AFAA56A58F,0x200,0x1FE00))
...
Observe how the second one does not list the SimpleFileSystem
. (As
we've seen earlier, 0x1000 and 0x200 are the sector offsets. For the
4096 byte sector size, the offset is lower, but points to the same byte
offset.)
And, even more verbosely:
Shell> dh 1FE -v
1FE: 85F2A898
SimpleFileSystem(85F0B030)
DiskIO(85F297A0)
EFISystemPartition(0)
PartitionInfo(85F2A368)
Partition Type : GPT
EFI System Partition : Yes
BlockIO(85F2A2B0)
Fixed MId:0 bsize 200, lblock FEFFF (534,773,760), partition rw !cached
DevicePath(85F2AB18)
PciRoot(0x0)/Pci(0x1,0x1)/Pci(0x0,0x0)/NVMe(0x1,00-01-0D-91-FF-A1-C1-07)/HD(2,GPT,DFE346BA-05A8-4B9C-A6D2-6AB46D5FA0DF,0x1000,0xFF000)
Shell> dh 203 -v
203: 85EEF718
DiskIO(85EA57A0)
EFISystemPartition(0)
PartitionInfo(85EEF368)
Partition Type : GPT
EFI System Partition : Yes
BlockIO(85EEF2B0)
Fixed MId:0 bsize 1000, lblock 1FDFF (534,773,760), partition rw !cached
DevicePath(85EEFC98)
PciRoot(0x0)/Pci(0x1,0x2)/Pci(0x0,0x0)/NVMe(0x1,00-01-0D-91-FE-A2-C1-08)/HD(2,GPT,DF57901E-9A47-4393-9470-94AFAA56A58F,0x200,0x1FE00)
(At this point you may be wondering how I got these screen dumps. The UEFI shell was on a remote SuperMicro machine connected over IPMI using iKVM, an ancient Java applicationwhich does not support screen grabbing at all. The answer: I wrote a quick OCR tool for this purpose, ikvmocr, which converts screenshots of a console to text. It depends on Python PIL only, and is more than 99% accurate.)
FAT versions
Yesterday, I wrote about the FAT16 filesystem layout for this reason.
When creating the tools that set up EFI partitions, I had incorrectly assumed that FAT32 was the default nowadays. It is not. The appropriate FAT version for your needs depends on the size of the partition and the desired cluster size.
And due to a bug in dosfstools 4.1 (and older), we were creating too few clusters on the FAT32 partition when the drives had 4096 byte sectors ("This only works correctly for 512 byte sectors!"). These partitions would work just fine on Linux as it uses different heuristics to detect the FAT version. But the EFI SimpleFileSystem Driver would use the official recommendation, detecting FAT16 and then not recognising the rest of the filesystem, dropping it from the possible boot options.
The (official) rules: begin by taking the (data) sector count, which is
a slight bit less than the total partition size, divided by the sector
size in bytes. Thus, take 510 MiB, divide by sector-size 4096, get
130,560 sectors. Divide by the sectors per cluster, commonly 2 or 4 (use
file -s
if you don't know). Now you have 65,280 or 32,640 clusters
(slightly less, actually).
Then, check this table:
clusters | FAT version |
---|---|
< 4,085 | FAT12 |
< 65,525 | FAT16 (both 65,280 and 32,640 are here) |
>= 65,525 | FAT32 |
For the smaller 512 byte cluster size, mkfs.fat would have selected a cluster size of 16: 510 MiB / 512 / 16, also 65,280, except if you selected FAT32, in which case it would lower the sectors per cluster so the cluster count would exceed 65,525. For the larger cluster size, this calculation was not performed correctly.
Lessons learnt: don't use FAT32 for partitions smaller than 1 GiB unless you also check the sectors-per-cluster.
Addendum (2022-03-24)
The UEFI Spec states in 13.3 File System Format that “EFI encompasses the use of FAT32 for a system partition, and FAT12 or FAT16 for removable media.” That suggests using FAT32 (and many websites out there suggest the same). However, the specification also mandates support for the other FAT versions.
Most important is that the partition is aligned on the (larger of the logical or physical) sector size. If you opt to run with FAT32, make sure the sectors per cluster value is low enough, so you get at least 65,525 clusters: for 4 KiB clusters you'll need at least 256.5 MiB (*), for 8 KiB clusters at least 512.5 MiB (**).
(*) 2*4096 (reserved) + 2*256*1024 (2xFAT32) + 4096*65525 (data)
(**) 2*4096 (reserved) + 2*256*1024 (2xFAT32) + 8192*65525 (data)
\