Trying to code on LF OS

Some days ago, I backported quite a lot of commits from the 9p feature branch in LF OS back to the main branch, including some important bugfixes. Everything worked fine on my machine, but it broke in CI because a random command line flag working great locally, but not in the docker image (this turned out to be a problem of different mtools versions, I have a random version on my laptop - not from debian repositories). Since this isn’t the first time mtools annoys me with such things, I decided to finally ditch it in favor of a small tool to create a fresh FAT32 filesystem from a bunch of files and directories - think of it like tar, but with FAT32 as output format. After not a lot of time, I had a working fatcreate tool and integrated it into LF OS’ build system.

LF OS not booting

But it wouldn’t boot. Sure, fatcreate didn’t yet create a perfect FAT32 filesystem. The FS info block was not correct (showing more used clusters than there are), it does not implement long file names, doesn’t have a backup superblock … but nothing that should prevent it from booting. I figured out some problems in the dd command to add the FAT32 image to the sgdisk prepared hard disk image, but it still wouldn’t boot. Curious. Frustrating.

I first tried with manually starting qemu, which worked fine. I then tried to use a debug build of OVMF (which I had laying around already, from debugging LF OS’ bootloader on PXE), that also worked fine and didn’t give any hints in debug output. I then thought everything is alright and went on - except it didn’t work when I reset my build directory for a fresh build - it was broken again, using the debian-provided OVMF.

At some point, I realized it’s not even showing the block device in UEFI shell - only BLK0, which is the CD-ROM drive, is there. I also validated the debug build working and the debian-provided being faulty. I figured why it was working when manually starting QEMU: I didn’t add the hard disk as NVMe disk but default interface, ATA. So the problem in my mindset changed from “fatcreate does something really wrong” to “QEMU or OVMF have broken NVMe support!”. I searched the internet but only found posts from years ago, mostly about booting windows from a PCI passthrough’d NVMe SSD.. nothing relevant at all. No matching bug reports in debian bugtracker for either QEMU nor OVMF. Nothing. Damn, a problem I experience first.

Trying to figure it out

I knew it was working not long ago, so the change must be recent .. so I looked at the changelog for QEMU (didn’t see anything relevant), so I digged through the history of QEMUs emulated NVMe controller - still, nothing relevant found. I then looked into OVMF, edk2. Looking at the commits OvmfPkg but again nothing relevant found. To pin the problem down to either QEMU or OVMF, I updated my local edk2 repository, built it and used that debug build to get some output for the broken state (should have done this way earlier), I now had a line of code that gives an error!

1
2
3
4
5
// MdeModulePkg/Bus/Pci/NvmExpressDxe/NvmExpressHci.c:764
if (Private->Cap.Css != 0x01) {
    DEBUG ((DEBUG_INFO, "NvmeControllerInit: the controller doesn't support NVMe command set\n"));
    return EFI_UNSUPPORTED;
}

So it thinks QEMUs NVMe controller isn’t a NVMe controller? Strange. (I later learned NVMe controllers can support different command sets, NVMe being the command set you’d expect from a controller with usable storage attached).

As said before, I didn’t see any recent changes to either QEMU nor OVMF, but maybe in the NVMe driver itself? Can it be UEFI that’s broken?

Finding the problem and fixing it

NvmExpressDxe in MdeModulePkg really had a recent change: back in january someone added a new feature: support for boot partitions. NVMe boot partitions allow a system to boot with only a subset of NVMe implemented, e.g. for booting without requiring another storage for the firmware (like a SPI Flash).

Adding this required some changes to the existing structs for the NVMe registers, notably increasing the attribute Css (CSS: Command Sets Supported; a bit mask for which command sets this controller supports) in Cap (Capabilities, a way for NVMe controllers to communicate their supported features) from 4 bit (with 7 reserved bits following it) to 8 bit (with the new boot partition supported bit and two reserved bits following it).

Per NVMe spec, the CSS register is 8 bits wide, so this part of the change is correct. Prior to NVMe v1.4, all of these bits were reserved except the LSB, showing support for the NVM command set. Since NVMe v1.4, the MSB is defined as showing “No I/O Command Set is supported (i.e., only the Admin Command Set is supported)”.

QEMUs NVMe controller returns 0xC1 - LSB, MSB and MSB-1 being set. This is in violation of the NVMe v1.4 spec, because

  • MSB cannot be set when any I/O command set is supported and LSB shows support for the NVM command set
  • MSB-1 is reserved

MSB-1 is defined in NVMe v2.0, which QEMU seems to implement. The bits roughly means “I support I/O command sets other than the NVM command set, use the new feature to check which ones” or at least “I support the host sending the command set to use with each command”. Setting the MSB while LSB is set still isn’t allowed - but setting MSB-1 and LSB is allowed and would trigger this problem in UEFI, too.

What is the actual problem? The condition in the if checks for equality with 1, not if bit 1 (LSB) is set. This means the same when the NVMe controller only sets that one bit (makes enough sense, as all others were reserved when this driver was added to UEFI) and accidentally worked until now, because LSB+1, LSB+2 and LSB+3 weren’t used by any controller - but with the full 8 bits now added to the register, this fails with NVMe 2.0 controllers setting MSB-1 (and also controller setting MSB and LSB, but this seems still out of spec).

The fix for this is simple: changing the Private->Cap.Css != 0x01 to !(Private->Cap.Css & 0x01), doing an actual check for the LSB being set. This is the shortest and easiest fix and matches the error message, so I created the patch, tested it some more times to make sure and opened a pull request on GitHub. I also sent the summary and link to edk2-devel, but so far didn’t receive any response - maybe mailing lists being broken with DMARC strikes again… For good measure, and being a good debian testing user, I also reported the bug to debian.