What the Nova GPU driver needs

By Daroc Alden
September 25, 2024

Kangrejos 2024

In March, Danilo Krummrich announced the new Nova GPU driver — a successor to Nouveau for controlling NVIDIA GPUs. At Kangrejos 2024, Krummrich gave a presentation about what it is, why it's needed, and where it's going next. Hearing about the needs of the driver provoked extended discussion on related topics, including what level of safety is reasonable to expect from drivers, given that they must interact with the hardware.

Krummrich started off by covering the motivation for the new driver. He had been working on multiple drivers that required a particular component, GPUVM, the dedicated memory-management system for some GPUs — but that component only had one core contributor. There are a few reasons that nobody had stepped up to help: it was reverse-engineered from the hardware, so there was little documentation, and the hardware itself is complicated. Now, Krummrich needed to add another complication: the new GPU System Processor (GSP), a CPU intended for low-latency GPU configuration included in some recent NVIDIA CPUs.

The parts of the driver outside of GPUVM and the GSP aren't as bad, Krummrich said. The DRM layer does not suffer as much from missing documentation — but some driver code is still hard to understand. For example, the addition of virtual memory management for Vulkan saw the page table for GPU memory be implemented separately from the memory management. That doesn't work out well. In theory, lockups are possible, he explained. Ultimately, he determined it was better to have a clean cut between the GSP code and legacy code.

And, if a clean cut is necessary, Rust is a good choice, he explained. Using Rust has the normal benefits for memory safety that people talk about, but there are other, more specific, reasons to pick it. The GSP firmware interface is unstable — the firmware generally works by placing messages in queues in shared memory, but the details are not guaranteed to remain the same from one version to another. This is partly because of how NVIDIA distributes changes: it can bundle a new firmware and new driver together, so it doesn't need to keep the interface stable.

That approach doesn't work for the upstream kernel, Krummrich said. Once the kernel supports a firmware version, that support must be maintained. This is, of course, possible in C, but it becomes "really messy". Not every version changes everything, so there's lots of common code, with occasional version-specific hacks.

In C, this caused the code to slowly devolve into "macro hell". He hopes that with Rust, he and his collaborators can do something better — Rust's procedural macros are a lot more flexible, understandable, and maintainable. His proposed approach is to generate Rust structures from NVIDIA's C headers, and then generate separate code for each version implementing a common interface. Then the right version can be picked at run time.

Discussion

Writing a complex graphics driver is a big task, though, so Krummrich has a plan to tackle it a step at a time. He noticed that a problem with Rust drivers so far is a chicken-and-egg problem of needing abstractions and users of those abstractions to both be merged at the same time. Asahi Linux had problems with that. So for Nova, Krummrich wants to keep it simple, and start with a stub driver. Then, take the time to work things out and improve it incrementally.

Paul McKenney agreed that getting work upstreamed can be difficult, but wondered if there was some way to get access to in-progress work before it was upstreamed. Krummrich said that there are some staging branches in the Rust-for-Linux tree and in the DRM tree. He maintains a branch that merges them all together on top of the latest kernel.

Maciej Falkowski asked to hear more details about how versioning using procedural macros works. Krummrich explained that the exact details are somewhat in flux, but that conceptually, for each function, they annotate parts of the code as only applying to a specific version. Then a macro picks and chooses code blocks to assemble the source for each version, and generates a trait implementation using those.

Benno Lossin asked what Krummrich thought the core problems he was facing were, and how the other people in the room could help — other than reviewing, which he knows always needs more people. The best thing would be to help get abstractions upstream, Krummrich said. "So, reviewing," Gary Guo replied. Krummrich agreed, to scattered chuckles.

Alice Ryhl had questions about the chicken-and-egg problem he had described; she hasn't run into it. For her, telling maintainers "here's this abstraction, I'll upstream the user later" has worked just fine. Krummrich said that for a few patch series, he had been explicitly asked to show a user. Ryhl suggested that it might help to reference specific patches that are not upstream yet.

Miguel Ojeda asked how much safe and unsafe code had been required so far. Krummrich thought it was a fair question, but that the answer might be misleading without the context of which features have actually been implemented yet. The current Nova stub can represent PCI buses, read and write some values, and use those to initialize the GPU, he said. It can also use a DRM abstraction to create an object representing a message queue, but it does not actually allocate memory or communicate with the GPU yet. And, in its current state, it does not use any unsafe code.

Lossin asked how much unsafe code Krummrich expected to need when the driver was complete. Ideally, unsafe code would only be needed for the firmware interface, where there is shared memory, Krummrich answered.

Carlos Bilbao was surprised that it was possible to get even that far with only safe code — doesn't the driver need to memory-map registers? When the code sets up the PCI interface, the size of the base address registers (BARs) is known either at compile time or at run time from the PCI subsystem, Krummrich explained. So the Nova driver says "I need N bytes out of a PCI BAR", and this allocation either fails or succeeds. If it succeeds, it can be wrapped in a structure with bounds checking that ensures that once the device is unbound, the memory can no longer be accessed. That's a shared abstraction in the kernel crate, so the driver itself never needs unsafe code. As long as the abstraction is sound, the whole thing should be memory safe, albeit with an obvious caveat: the GPU can do whatever it likes in response to changes to memory-mapped registers, potentially including things that subvert Rust's guarantees.

Andreas Hindborg said that he thought they had debated whether the bound on the size of a PCI BAR should be known at run time or compile time. On the GPU, the offset can't be known at compile time, Krummrich said, because the GPU itself tells you how big the allocation should be. Hindborg clarified that this meant that we were trusting the device to give accurate answers.

If the device gives a random address that makes no sense, it will fail the bounds check on the PCI BAR itself, Krummrich replied. Greg Kroah-Hartman clarified that this may be the case, but we're still trusting the hardware. Hindborg said he was convinced of the safety of the approach, however, pointing out that they know the size of the PCI BAR. "Sometimes," Kroah-Hartman ominously objected. Krummrich was of the opinion that if the device was going to lie about the size of the PCI BAR "we can't do anything about that".

Kroah-Hartman warned that this wasn't a hypothetical — there are USB devices that lie about the size of the BAR, and use that to take over your kernel. That's why the kernel has trusted and untrusted modes for PCI (and USB, as well) — do you trust the hardware or not? But he admitted that they do need to trust the hardware at some point.

This kicked off an extensive discussion about whether writing a filesystem that does not trust its underlying block device is possible, including whether there was a way to track data that has not been validated in the type system. After the session, Lossin followed up on that discussion by posting a patch set introducing an abstraction for tracking unvalidated data.

The conclusion in the room was that compile-time tracking like that would be useful, but that it certainly would not come without some work. What is potentially possible soon (with good API design) is making filesystem drivers that, when used with malicious filesystems, don't allow kernel exploits — even if there's no practical way to prevent them from returning arbitrary bad data back to user space.

[ Thanks to the Linux Foundation, LWN's travel sponsor, for supporting our coverage of Kangrejos. ]

Index entries for this article
Conference	Kangrejos/2024

Falsified PCI BAR size exploit?

Posted Sep 25, 2024 20:34 UTC (Wed) by stevie-oh (subscriber, #130795) [Link] (5 responses)

I assume it's an oversimplification, but I don't see how the "lying about BAR size" is exploitable except as a denial-of-service attack.
I ran a few searches and came up empty.

Reporting an undersized BAR would simply mean that MMIO addresses beyond the falsely reported size would not be mapped and therefore inaccessible from anywhere else in the system.

Reporting an oversized BAR would mean that MMIO addresses would get mapped in the system that either don't do anything when read/written, or they alias other addresses.

In theory, reporting a lot of oversized BARs could starve the system of address space, preventing devices that connect later from working.

Also, to my knowledge, USB devices themselves don't have BARs, nor is there any equivalent; all communication with USB devices is with protocol messages, and any and all memory-mapped I/O is to the USB host controller itself, not the devices plugged into it.

Thunderbolt devices are PCIe devices and thus *do* have BARs. Since Thunderbolt versions 3 and up use USB-C connectors, I'm guessing that Kroah-Hartman was probably talking about these devices.

But (again, to my knowledge) that wasn't about BARs, it was about the fact that Thunderbolt, being PCIe, allows connected devices to directly issue DMA commands; without a properly configured IOMMU between the Thunderbolt port, a Thunderbolt device could DMA to/from arbitrary system RAM (or possibly even other conected devices).

However, apart from the responsibility of configuring the IOMMU correctly (if one even exists), the kernel cannot prevent this behavior, regardless of the language it's written in.

Falsified PCI BAR size exploit?

Posted Sep 25, 2024 21:01 UTC (Wed) by daroc (editor, #160859) [Link] (4 responses)

I believe the exploit in question looks like the hardware saying to write something to a ridiculously large offset from the base address, causing the driver to overwrite a different part of kernel memory in a different mapping. That is obviously prevented by having a bounds check on accesses to PCI memory, but if the bound being checked also comes from the device, it's not obvious that this will always help.

In practice, you may very well be right that if the device lies about the size of its BAR that this will just result in it being given a larger chunk of virtual memory; I am not particularly well-acquainted with how the kernel lays out the memory for PCI devices.

Falsified PCI BAR size exploit?

Posted Sep 25, 2024 22:13 UTC (Wed) by intelfx (subscriber, #130118) [Link] (2 responses)

> I believe the exploit in question looks like the hardware saying to write something to a ridiculously large offset from the base address, causing the driver to overwrite a different part of kernel memory in a different mapping. That is obviously prevented by having a bounds check on accesses to PCI memory, but if the bound being checked also comes from the device, it's not obvious that this will always help.

Forgive me if I'm wrong, but isn't this just means the bound and the allocation must simply be coherent with each other?

If a device reports a BAR size "X", it means that the kernel 1) allocates a memory region [base; base+X) to the device, and 2) remembers "X" for further bounds checking.

Then, if a device causes the driver to access offset Y of its BAR, there are two options: 1) Y < X and the access is a) allowed and b) routed to the device over the PCIe bus; and 2) Y >= X and the access is disallowed on the grounds of failing the bounds check.

I don't see how can it ever be possible to cause the kernel to access an unrelated physical address, given that the same bound "X" is used both as the mapping size and as the bound for all subsequent accesses.

Falsified PCI BAR size exploit?

Posted Sep 26, 2024 13:15 UTC (Thu) by daroc (editor, #160859) [Link]

Yes, I believe you're right! If the bound and the allocation are the same, there's no issue. But the kernel is complex, and I wouldn't want to assume without checking that that's always the case.

I don't know if anyone has looked into this further since then, but I think that at the time, Hindborg felt that having the bounds check was sufficient, and Kroah-Hartman wasn't completely convinced. Both of them have far more exposure to the PCI code than I do, so I'm willing to accept that it is at least worth checking to see whether the assumption holds.

Falsified PCI BAR size exploit?

Posted Sep 26, 2024 15:22 UTC (Thu) by farnz (subscriber, #17727) [Link]

Worth noting that PCIe doesn't track the bound and allocation separately; instead, the BAR is naturally aligned by definition (even with Resizable BAR Capability support), and you determine the bound by trying to force it to be misaligned, and seeing which bits the device masks out. However, when you assign a base, you don't have to set any bits to non-zero values; as a result, it's permitted to put a 4 KiB BAR at 0x80000000.

As a result, the only way for X to be incorrect when used for bounds checking is if the kernel has a bug remembering it - there is no reliable way to read back either X, or base + X, from the device, only to read back base.

Resizeable BAR changes this slightly, because you can use the Resizable BAR Command Register to change the size (and thus required alignment) of the region, and this should be reflected by zeroing lower bits of the BAR immediately, but the kernel should be controlling access to that register anyway, and ensuring that it can only be accessed at a point where the kernel is ready to rewrite the BAR anyway.

Falsified PCI BAR size exploit?

Posted Sep 25, 2024 23:55 UTC (Wed) by ejr (subscriber, #51652) [Link]

Not even a exploit... I know of some FPGA PCIe "IP" from a major vendor with utterly buggy, nearly unpredictable BAR interactions. Sometimes the sh...tuff people ship for large price-tags is astounding.

Something other than memory safety

Posted Sep 26, 2024 9:52 UTC (Thu) by kleptog (subscriber, #1183) [Link]

Good to hear someone giving as reason for Rust the higher levels of abstraction it supports. C preprocessor macros get extremely hairy very quickly and are still limited.

I guess a version of C++ with templating but without some of the more complicated stuff (exceptions) would have worked too, but that doesn't exist.

Once the kernel supports a firmware version, that support must not always be maintained.

Posted Nov 19, 2024 5:39 UTC (Tue) by marcH (subscriber, #57642) [Link]

"That approach doesn't work for the upstream kernel, Krummrich said. Once the kernel supports a firmware version, that support must be maintained."

I don't think it's that clear-cut. Some versions of Intel GPU firmware are clearly hardcoded:

drivers/gpu/drm/i915/intel_csr.c-#define I915_CSR_SKL "i915/skl_dmc_ver1_27.bin"

Now good luck dealing with that in backports and other kernel branches - I digress. Or do I?