Handling filesystem interruptibility

By Jake Edge
August 5, 2024

David Howells wanted to discuss changing the way filesystem code handles the ability to interrupt or kill operations, in order to fix some longstanding problems with network (and other) filesystems, in a session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit. As noted in his session proposal, some filesystems may be expecting to not be interruptible, but are calling code can take locks and mutexes that are interruptible (or killable), which are effectively changing the state of the task incorrectly. He would like to find a solution for that problem.

The interruptibility here refers to signal handling. An interruptible process will respond to any signals that are not masked or ignored. Killable is a variant of interruptible that will only respond to fatal signals.

There are multiple places with locks and such that could be taken using the *_interruptible() and *_killable() variants, but those override the higher-level non-interruptible setting. Some kind of mass change is not really practical to address the problem, Howells said, so it will need to be done incrementally. He proposed a multi-year effort to switch to explicit begin and end functions to bracket non-interruptible regions, in a way that is analogous to how hardware interrupts are disabled. Code could disable interruptibility, which would be tracked with a counter, then reenable it when the critical section is finished.

For example, an overlayfs filesystem might include a network filesystem as one of its layers. The overlayfs might not take interruptible locks, but the network filesystem might do so, which results in operations that get interrupted in a way that overlayfs does not expect. Ted Ts'o thought that a change like what was described might be useful in some contexts, but did not think that "it would be something we would want to use all over the place". The interruptible status of a particular mutex, for example, is local to the code that takes it. Unlike the effort to switch to GFP_NOFS, where the eventual plan is to convert everything to use it, this change would only be needed for specific calls.

Both Kent Overstreet and Dave Chinner asked for more concrete examples of the problem being solved and how the code would need to change to accommodate Howells's proposal. The biggest problem he has encountered, Howells said, is that sendmsg() is interruptible, but that an NFS filesystem might be mounted as non-interruptible. "NFS thinks it is not interruptible, but it is because it is using the network interfaces that are." He noted that a conversion would eventually mean that many of the interruptible (and killable) variants of lock and mutex functions could be removed.

Chinner and others objected to that, saying that there will still be a need for those variants. There were also various objections because many of the calls to mutex_lock_interruptible() are not checked for an error return, though there were multiple people all talking at once making it somewhat hard to follow. Al Viro was also concerned about deadlocks resulting from the changes proposed.

Viro said that handling signals (such as from someone using control-C) is the responsibility of the caller of the network function; an NFS mount with -o hard does not want or expect its operations to be interrupted, though, Howells said. However, calling mutex_lock_interruptible() is only applying the interruptibility to that specific call, Wedson Almeida Filho and Ts'o said, not to the whole region between it and the unlock call. Ts'o said that without a specific patch changing a particular code path where there is a problem, it will be difficult for attendees to determine whether it makes sense or not; meanwhile, he reiterated that he did not see a justification for a widespread change.

Instead of having a call to bracket the regions of non-interruptibility, Viro asked, why not just disable signals for the region? But Howells said that SIGKILL cannot be masked, though Christian Brauner pointed out that the kernel can mask that signal even though user space cannot.

Jan Kara agreed with the overall approach, saying that there is a real problem for callers who do not expect to get interrupted. But Brauner was concerned about how someone looking at sendmsg(), which is clearly interruptible, would be able to recognize that in some contexts it can be called in such a way that it is not interruptible. Howells acknowledged that could be a problem.

Chinner suggested having a variant of sendmsg() that is not interruptible, but Howells said that there are multiple calls like sendmsg() that are affected. "The documentation of the uninterruptible state is completely decoupled from where we need to apply that state", Chinner said. It would require large comments wherever these functions are being called, describing how that can happen and what code paths are affected. "Otherwise it is unmaintainable."

Viro said that it sounds to him like what Howells wants is to be able to suspend signal delivery in the network code at times. The TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE states are for sleeping processes, Viro continued, which get changed when the task gets woken up, but the state that is really desired "smells like 'I want signal delivery suspended'" until the end of the code region.

Ts'o agreed, noting that the change could be done without adding new infrastructure and a task flag. Howells said that manipulating the signal mask would affect other threads in the process, though, which Ts'o acknowledged as a problem. Viro said that an alternative might be to simply skip the thread in question when doing signal delivery; "basically it is a 'don't bother me'".

The session ran out of time as that was being discussed, but the picture that emerged is that patches are needed to focus the discussion. As of yet, there is no video for this session in the 2024 LSFMM+BPF playlist at YouTube.

Index entries for this article
Kernel	Filesystems
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2024

Why not interruptible?

Posted Aug 6, 2024 11:40 UTC (Tue) by make (subscriber, #62794) [Link] (3 responses)

It annoys me when processes become unkillable just because the NFS server or network is flaky. And one slow NFS operation can bring down the whole userspace because it's holding the inode lock...

Okay, there's a lot of legacy kernel code that cannot deal with interruptions; but if we're talking about a big multi-year effort, why bother with such a kludge for legacy code - instead of making the whole kernel interruptible - or better - non-blocking?

Network filesystems such as ceph are already implemented in a non-blocking way, but the VFS layer forces them to wrap everything inside blocking calls. So if you do an asynchronous io_uring read, io_uring will call the blocking VFS read in a worker thread, which will then do asynchronous I/O inside the Ceph code - which combines the disadvantages of blocking and non-blocking I/O. You get the combined overhead of both, but I/O is still not interruptible/cancellable.

Why not interruptible?

Posted Aug 6, 2024 18:42 UTC (Tue) by Wol (subscriber, #4433) [Link]

Dunno how this would work, but an obvious mechanism that *could* work with asynchronous reads is "abandon".

If you've got an outstanding i/o you can't cancel, you flag it as abandoned, and when it comes back to io_uring or wherever to be processed, it detects it's abandoned and just throws it away.

Cheers,
Wol

Why not interruptible?

Posted Aug 7, 2024 9:01 UTC (Wed) by Sesse (subscriber, #53779) [Link] (1 responses)

Generally this isn't really a problem anymore; you can kill -9 processes waiting on NFS now. You cannot _interrupt_ them with random signals because so much code out there isn't written with the assumption that something as simple as a read() or getdents() returns EINTR and might be restarted. (Well, you can if you mount with -o intr, but that's probably not a good idea unless you want to lose reads and writes.)

Why not interruptible?

Posted Aug 7, 2024 11:58 UTC (Wed) by joib (subscriber, #8541) [Link]

The intr/nointr options have been no-ops for many many years.

make it killable, please

Posted Aug 6, 2024 21:14 UTC (Tue) by amarao (guest, #87073) [Link] (1 responses)

As operator I hate uninterruptable processes. Why can't I kill it? Filesystem is dead, block device under it is dead (literally, unplugged), but process is not and won't be killed until I reboot os.

make it killable, please

Posted Aug 6, 2024 21:43 UTC (Tue) by willy (subscriber, #9762) [Link]

The answer to "why can't I kill it" is usually to be found in `cat /proc/$pid/stack`. That will tell you where it is sleeping. If you look at that function, you'll see something like a mutex_lock(). Change it to mutex_lock_killable() and handle the -EINTR return correctly. Then send a patch.

Or send an email pointing at the offender to the appropriate mailing list (probably linux-fsdevel), and ask someone to do it for you.

These things pretty much have to be found and fixed one by one. It's a lot of work to unroll some of these error cases, so nobody wants to do it for ones which don't matter.

A bit confused..

Posted Aug 7, 2024 3:24 UTC (Wed) by neilbrown (subscriber, #359) [Link] (2 responses)

The "-o hard" mount option is not related to interrupts. It relates to timeouts waiting for a reply from the server. "-o hard" means "Retry indefinitely". "-o soft" means "abort after the configured retries". [Don't use -o soft when you value data]

The old "-o intr" mount option was related to interrupts. It doesn't do anything any more. NFS can always(*) be killed by a fatal signal, and non-fatal signals are always ignored

(* - there are believed to be some places if VFS/MM code which wait non-killable. NFS cannot fix that. As willy says, they need to be found and fixed. That is a separate issue).

sendmsg is interruptible - by only if it is told to wait. NFS (via net/sunrpc) always sets MSG_DONTWAIT. So sock_sendmsg() when called for NFS never reacts to a signal. If the send fails due to lack of buffer space (the only time it might abort if there is a signal), EAGAIN is returned to the state-machine in net/sunrpc/clnt.c which will retry or abort depending on context. If it wants to retry it can get a notification when space is available. It might abort due to a pending signal, but only if NFS wants that. If NFS ever wants the request to never-ever abort, even with a fatal signal, it runs the request from a separate kernel thread.

So I wonder what the real problem is here.

A bit confused..

Posted Aug 7, 2024 4:56 UTC (Wed) by willy (subscriber, #9762) [Link] (1 responses)

If you're up for fixing something ...

Last time I looked at this, when NFS closes a file, it calls fsync(). So if you try to kill a process that is in the middle of a write(), the write aborts but when the process dies, it closes all its files and so it hangs in the fsync code.

I gave up trying to fix this, but the easy way to test is to add a IP route that black holes all traffic to your NFS server's address. Instant flaky NFS server!

A bit confused..

Posted Aug 7, 2024 7:12 UTC (Wed) by neilbrown (subscriber, #359) [Link]

> Last time I looked at this, when NFS closes a file, it calls fsync(). .....

I see it blocking in

[<0>] folio_wait_writeback+0x22/0xc0
[<0>] __filemap_fdatawait_range+0x79/0xf0
[<0>] filemap_write_and_wait_range+0x83/0xb0
[<0>] nfs_wb_all+0x3f/0x1c0
[<0>] nfs4_file_flush+0x71/0xa0

and I wonder why __filemap_fdatawait_range() doesn't use folio_wait_writeback_killable().

I guess it would need to return an error and current callers don't expect one. So maybe add filemap_write_and_wait_range_killable() which nfs_wb_all_killable() could call and cold be called by nfs4_file_flush and nfs_file_flush (which are only ever called in the close() path).

However that doesn't work because __fatal_signal_pending() doesn't return True in a process which has exited. Maybe it should check if PF_SIGNALED is set.

Maybe I should post patches.