summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-03-10nfsd: remove unneeded forward declaration of nfsd4_mark_cb_fault()Jeff Layton
This isn't needed. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: reorganize struct nfs4_delegation for better packingJeff Layton
Move dl_type field above dl_time, which shaves 8 bytes off this struct. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: handle errors from rpc_call_async()Jeff Layton
It's possible for rpc_call_async() to fail (mainly due to memory allocation failure). If it does, there isn't much recourse other than to requeue the callback and try again later. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: move cb_need_restart flag into cb_flagsJeff Layton
Since there is now a cb_flags word, use a new NFSD4_CALLBACK_REQUEUE flag in that instead of a separate boolean. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: replace CB_GETATTR_BUSY with NFSD4_CALLBACK_RUNNINGJeff Layton
These flags serve essentially the same purpose and get set and cleared at the same time. Drop CB_GETATTR_BUSY and just use NFSD4_CALLBACK_RUNNING instead. For this to work, we must use clear_and_wake_up_bit(), but doing that on for other types of callbacks is wasteful. Declare a new NFSD4_CALLBACK_WAKE flag in cb_flags to indicate that wake_up is needed, and only set that for CB_GETATTRs. Also, make the wait use a TASK_UNINTERRUPTIBLE sleep. This is done in the context of an nfsd thread, and it should never need to deal with signals. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: eliminate cl_ra_cblist and NFSD4_CLIENT_CB_RECALL_ANYJeff Layton
deleg_reaper() will walk the client_lru list and put any suitable entries onto "cblist" using the cl_ra_cblist pointer. It then walks the objects outside the spinlock and queues callbacks for them. None of the operations that deleg_reaper() does outside the nn->client_lock are blocking operations. Just queue their workqueue jobs under the nn->client_lock instead. Also, the NFSD4_CLIENT_CB_RECALL_ANY and NFSD4_CALLBACK_RUNNING flags serve an identical purpose now. Drop the NFSD4_CLIENT_CB_RECALL_ANY flag and just use the one in the callback. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: prevent callback tasks running concurrentlyJeff Layton
The nfsd4_callback workqueue jobs exist to queue backchannel RPCs to rpciod. Because they run in different workqueue contexts, the rpc_task can run concurrently with the workqueue job itself, should it become requeued. This is problematic as there is no locking when accessing the fields in the nfsd4_callback. Add a new unsigned long to nfsd4_callback and declare a new NFSD4_CALLBACK_RUNNING flag to be set in it. When attempting to run a workqueue job, do a test_and_set_bit() on that flag first, and don't queue the workqueue job if it returns true. Clear NFSD4_CALLBACK_RUNNING in nfsd41_destroy_cb(). This also gives us a more reliable mechanism for handling queueing failures in codepaths where we have to take references under spinlocks. We can now do the test_and_set_bit on NFSD4_CALLBACK_RUNNING first, and only take references to the objects if that returns false. Most of the nfsd4_run_cb() callers are converted to use this new flag or the nfsd4_try_run_cb() wrapper. The main exception is the callback channel probe, which has its own synchronization. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: disallow file locking and delegations for NFSv4 reexportMike Snitzer
We do not and cannot support file locking with NFS reexport over NFSv4.x for the same reason we don't do it for NFSv3: NFS reexport server reboot cannot allow clients to recover locks because the source NFS server has not rebooted, and so it is not in grace. Since the source NFS server is not in grace, it cannot offer any guarantees that the file won't have been changed between the locks getting lost and any attempt to recover/reclaim them. The same applies to delegations and any associated locks, so disallow them too. Clients are no longer allowed to get file locks or delegations from a reexport server, any attempts will fail with operation not supported. Update the "Reboot recovery" section accordingly in Documentation/filesystems/nfs/reexport.rst Signed-off-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: drop the list_lru lock during lock gc scansNeilBrown
Under a high NFSv3 load with lots of different files being accessed, the LRU list of garbage-collectable files can become quite long. Asking list_lru_scan_node() to scan the whole list can result in a long period during which a spinlock is held, blocking the addition of new LRU items. So ask list_lru_scan_node() to scan only a few entries at a time, and repeat until the scan is complete. If the shrinker runs between two consecutive calls of list_lru_scan_node() it could invalidate the "remaining" counter which could lead to premature freeing. So add a spinlock to avoid that. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: don't repeatedly add/remove files on the lru listNeilBrown
There is no need to remove a file from the lru every time we access it, and then add it back. It is sufficient to set the REFERENCED flag every time we put the file. The order in the lru of REFERENCED files is largely irrelevant as they will all be moved to the end. With this patch, files are added only when they are allocated (if want_gc) and they are removed only by the list_lru_(shrink_)walk callback or when forcibly removing a file. This should reduce contention on the list_lru spinlock(s) and reduce memory traffic a little. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: introduce NFSD_FILE_RECENTNeilBrown
The filecache lru is walked in 2 circumstances for 2 different reasons. 1/ When called from the shrinker we want to discard the first few entries on the list, ignoring any with NFSD_FILE_REFERENCED set because they should really be at the end of the LRU as they have been referenced recently. So those ones are ROTATED. 2/ When called from the nfsd_file_gc() timer function we want to discard anything that hasn't been used since before the previous call, and mark everything else as unused at this point in time. Using the same flag for both of these can result in some unexpected outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then nfsd_file_gc() will think the file hasn't been used in a while, while really it has. I think it is easier to reason about the behaviour if we instead have two flags. NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please put it there when convenient" NFSD_FILE_RECENT means "this has been used recently - since the last run of nfsd_file_gc() When either caller finds an NFSD_FILE_REFERENCED entry, that entry should be moved to the end of the LRU and the flag cleared. This can safely happen at any time. The actual order on the lru might not be strictly least-recently-used, but that is normal for linux lrus. The shrinker callback can ignore the "recent" flag. If it ends up freeing something that is "recent" that simply means that memory pressure is sufficient to limit the acceptable cache age to less than the nfsd_file_gc frequency. The gc callback should primarily focus on NFSD_FILE_RECENT. It should free everything that doesn't have this flag set, and should clear the flag on everything else. When it clears the flag it is convenient to clear the "REFERENCED" flag and move to the end of the LRU too. With this, calls from the shrinker do not prematurely age files. It will focus only on freeing those that are least recently used. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: use list_lru_walk_node() in nfsd_file_gc()NeilBrown
list_lru_walk() is only useful when the aim is to remove all elements from the list_lru. It will repeatedly visit rotated elements of the first per-node sublist before proceeding to subsequent sublists. This patch changes nfsd_file_gc() to use list_lru_walk_node() and list_lru_count_node() on each NUMA node. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync()NeilBrown
nfsd_file_close_inode_sync() contains an exact copy of nfsd_file_dispose_list(). This patch removes that copy and calls nfsd_file_dispose_list() instead. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Re-organize nfsd_file_gc_worker()Chuck Lever
Dave opines: IMO, there is no need to do this unnecessary work on every object that is added to the LRU. Changing the gc worker to always run every 2s and check if it has work to do like so: static void nfsd_file_gc_worker(struct work_struct *work) { - nfsd_file_gc(); - if (list_lru_count(&nfsd_file_lru)) - nfsd_file_schedule_laundrette(); + if (list_lru_count(&nfsd_file_lru)) + nfsd_file_gc(); + nfsd_file_schedule_laundrette(); } means that nfsd_file_gc() will be run the same way and have the same behaviour as the current code. When the system it idle, it does a list_lru_count() check every 2 seconds and goes back to sleep. That's going to be pretty much unnoticable on most machines that run NFS servers. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: filecache: remove race handling.NeilBrown
The race that this code tries to protect against is not interesting. The code is problematic as we access the "nf" after we have given our reference to the lru system. While that takes 2+ seconds to free things, it is still poor form. The only interesting race I can find would be with nfsd_file_close_inode_sync(); This is the only place that really doesn't want the file to stay on the LRU when unhashed (which is the direct consequence of the race). However for the race to happen, some other thread must own a reference to a file and be putting it while nfsd_file_close_inode_sync() is trying to close all files for an inode. If this is possible, that other thread could simply call nfsd_file_put() a little bit later and the result would be the same: not all files are closed when nfsd_file_close_inode_sync() completes. If this was really a problem, we would need to wait in close_inode_sync for the other references to be dropped. We probably don't want to do that. So it is best to simply remove this code. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10fs: nfs: acl: Avoid -Wflex-array-member-not-at-end warningGustavo A. R. Silva
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. So, in order to avoid ending up with a flexible-array member in the middle of other structs, we use the `struct_group_tagged()` helper to create a new tagged `struct posix_acl_hdr`. This structure groups together all the members of the flexible `struct posix_acl` except the flexible array. As a result, the array is effectively separated from the rest of the members without modifying the memory layout of the flexible structure. We then change the type of the middle struct member currently causing trouble from `struct posix_acl` to `struct posix_acl_hdr`. We also want to ensure that when new members need to be added to the flexible structure, they are always included within the newly created tagged struct. For this, we use `static_assert()`. This ensures that the memory layout for both the flexible structure and the new tagged struct is the same after any changes. This approach avoids having to implement `struct posix_acl_hdr` as a completely separate structure, thus preventing having to maintain two independent but basically identical structures, closing the door to potential bugs in the future. We also use `container_of()` whenever we need to retrieve a pointer to the flexible structure, through which we can access the flexible-array member, if necessary. So, with these changes, fix the following warning: fs/nfs_common/nfsacl.c:45:26: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Fix callback decoder status codesChuck Lever
fs/nfsd/nfs4callback.c implements a callback client. Thus its XDR decoders are decoding replies, not calls. NFS4ERR_BAD_XDR is an on-the-wire status code that reports that the client sent a corrupted RPC /call/. It's not used as the internal error code when a /reply/ can't be decoded, since that kind of failure is never reported to the sender of that RPC message. Instead, a reply decoder should return -EIO, as the reply decoders in the NFS client do. Fixes: 6487a13b5c6b ("NFSD: add support for CB_GETATTR callback") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: eliminate special handling of NFS4ERR_SEQ_MISORDEREDJeff Layton
On a SEQ_MISORDERED error, the current code will reattempt the call, but set the slot sequence ID to 1. I can find no mention of this remedy in the spec, and it seems potentially dangerous. It's possible that the last call was sent with seqid 1, and doing this will cause a retransmission of the reply. Drop this special handling, and always treat SEQ_MISORDERED like BADSLOT. Retry the call, but leak the slot so that it is no longer used. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: handle NFS4ERR_BADSLOT on CB_SEQUENCE betterJeff Layton
Currently it just restarts the call, without getting a new slot. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: when CB_SEQUENCE gets ESERVERFAULT don't increment seq_nrJeff Layton
ESERVERFAULT means that the server sent a successful and legitimate reply, but the session info didn't match what was expected. Don't increment the seq_nr in that case. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: only check RPC_SIGNALLED() when restarting rpc_taskJeff Layton
nfsd4_cb_sequence_done() currently checks RPC_SIGNALLED() when processing the compound and releasing the slot. If RPC_SIGNALLED() returns true, then that means that the client is going to be torn down. Don't check RPC_SIGNALLED() after processing a successful reply. Check it only before restarting the rpc_task. If it returns true, then requeue the callback instead of restarting the task. Also, handle rpc_restart_call() and rpc_restart_call_prepare() failures correctly, by requeueing the callback. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: always release slot when requeueing callbackJeff Layton
If the callback is going to be requeued to the workqueue, then release the slot. The callback client and session could change and the slot may no longer be valid after that point. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: lift NFSv4.0 handling out of nfsd4_cb_sequence_done()Jeff Layton
It's a bit strange to call nfsd4_cb_sequence_done() on a callback with no CB_SEQUENCE. Lift the handling of restarting a call into a new helper, and move the handling of NFSv4.0 into nfsd4_cb_done(). Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: prepare nfsd4_cb_sequence_done() for error handling reworkJeff Layton
There is only one case where we want to proceed with processing the rest of the CB_COMPOUND, and that's when the cb_seq_status is 0. Make the default return value be false, and only set it to true in that case. Rename the "need_restart" label to "requeue", to better indicate that it's being requeued to the workqueue. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: put dl_stid if fail to queue dl_recallLi Lingfeng
Before calling nfsd4_run_cb to queue dl_recall to the callback_wq, we increment the reference count of dl_stid. We expect that after the corresponding work_struct is processed, the reference count of dl_stid will be decremented through the callback function nfsd4_cb_recall_release. However, if the call to nfsd4_run_cb fails, the incremented reference count of dl_stid will not be decremented correspondingly, leading to the following nfs4_stid leak: unreferenced object 0xffff88812067b578 (size 344): comm "nfsd", pid 2761, jiffies 4295044002 (age 5541.241s) hex dump (first 32 bytes): 01 00 00 00 6b 6b 6b 6b b8 02 c0 e2 81 88 ff ff ....kkkk........ 00 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 ad 4e ad de .kkkkkkk.....N.. backtrace: kmem_cache_alloc+0x4b9/0x700 nfsd4_process_open1+0x34/0x300 nfsd4_open+0x2d1/0x9d0 nfsd4_proc_compound+0x7a2/0xe30 nfsd_dispatch+0x241/0x3e0 svc_process_common+0x5d3/0xcc0 svc_process+0x2a3/0x320 nfsd+0x180/0x2e0 kthread+0x199/0x1d0 ret_from_fork+0x30/0x50 ret_from_fork_asm+0x1b/0x30 unreferenced object 0xffff8881499f4d28 (size 368): comm "nfsd", pid 2761, jiffies 4295044005 (age 5541.239s) hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 30 4d 9f 49 81 88 ff ff ........0M.I.... 30 4d 9f 49 81 88 ff ff 20 00 00 00 01 00 00 00 0M.I.... ....... backtrace: kmem_cache_alloc+0x4b9/0x700 nfs4_alloc_stid+0x29/0x210 alloc_init_deleg+0x92/0x2e0 nfs4_set_delegation+0x284/0xc00 nfs4_open_delegation+0x216/0x3f0 nfsd4_process_open2+0x2b3/0xee0 nfsd4_open+0x770/0x9d0 nfsd4_proc_compound+0x7a2/0xe30 nfsd_dispatch+0x241/0x3e0 svc_process_common+0x5d3/0xcc0 svc_process+0x2a3/0x320 nfsd+0x180/0x2e0 kthread+0x199/0x1d0 ret_from_fork+0x30/0x50 ret_from_fork_asm+0x1b/0x30 Fix it by checking the result of nfsd4_run_cb and call nfs4_put_stid if fail to queue dl_recall. Cc: stable@vger.kernel.org Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: allow SC_STATUS_FREEABLE when searching via nfs4_lookup_stateid()Jeff Layton
The pynfs DELEG8 test fails when run against nfsd. It acquires a delegation and then lets the lease time out. It then tries to use the deleg stateid and expects to see NFS4ERR_DELEG_REVOKED, but it gets bad NFS4ERR_BAD_STATEID instead. When a delegation is revoked, it's initially marked with SC_STATUS_REVOKED, or SC_STATUS_ADMIN_REVOKED and later, it's marked with the SC_STATUS_FREEABLE flag, which denotes that it is waiting for s FREE_STATEID call. nfs4_lookup_stateid() accepts a statusmask that includes the status flags that a found stateid is allowed to have. Currently, that mask never includes SC_STATUS_FREEABLE, which means that revoked delegations are (almost) never found. Add SC_STATUS_FREEABLE to the always-allowed status flags, and remove it from nfsd4_delegreturn() since it's now always implied. Fixes: 8dd91e8d31fe ("nfsd: fix race between laundromat and free_stateid") Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: don't ignore the return code of svc_proc_register()Jeff Layton
Currently, nfsd_proc_stat_init() ignores the return value of svc_proc_register(). If the procfile creation fails, then the kernel will WARN when it tries to remove the entry later. Fix nfsd_proc_stat_init() to return the same type of pointer as svc_proc_register(), and fix up nfsd_net_init() to check that and fail the nfsd_net construction if it occurs. svc_proc_register() can fail if the dentry can't be allocated, or if an identical dentry already exists. The second case is pretty unlikely in the nfsd_net construction codepath, so if this happens, return -ENOMEM. Reported-by: syzbot+e34ad04f27991521104c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-nfs/67a47501.050a0220.19061f.05f9.GAE@google.com/ Cc: stable@vger.kernel.org # v6.9 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Fix trace_nfsd_slot_seqid_sequenceChuck Lever
While running down the problem triggered by disconnect injection, I noticed the "in use" string was actually never hooked up in this trace point, so it always showed the traced slot as not in use. But what might be more useful is showing all the slot status flags. Also, this trace point can record and report the slot's index number, which among other things is useful for troubleshooting slot table expansion and contraction. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Return NFS4ERR_FILE_OPEN only when linking an open fileChuck Lever
RFC 8881 Section 18.9.4 paragraphs 1 - 2 tell us that RENAME should return NFS4ERR_FILE_OPEN only when the target object is a file that is currently open. If the target is a directory, some other status must be returned. The VFS is unlikely to return -EBUSY, but NFSD has to ensure that errno does not leak to clients as a status code that is not permitted by spec. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Return NFS4ERR_FILE_OPEN only when renaming over an open fileChuck Lever
RFC 8881 Section 18.26.4 paragraphs 1 - 3 tell us that RENAME should return NFS4ERR_FILE_OPEN only when the target object is a file that is currently open. If the target is a directory, some other status must be returned. Generally I expect that a delegation recall will be triggered in some of these circumstances. In other cases, the VFS might return -EBUSY for other reasons, and NFSD has to ensure that errno does not leak to clients as a status code that is not permitted by spec. There are some error flows where the target dentry hasn't been found yet. The default value for @type therefore is S_IFDIR to return an alternate status code in those cases. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Never return NFS4ERR_FILE_OPEN when removing a directoryChuck Lever
RFC 8881 Section 18.25.4 paragraph 5 tells us that the server should return NFS4ERR_FILE_OPEN only if the target object is an opened file. This suggests that returning this status when removing a directory will confuse NFS clients. This is a version-specific issue; nfsd_proc_remove/rmdir() and nfsd3_proc_remove/rmdir() already return nfserr_access as appropriate. Unfortunately there is no quick way for nfsd4_remove() to determine whether the target object is a file or not, so the check is done in in nfsd_unlink() for now. Reported-by: Trond Myklebust <trondmy@hammerspace.com> Fixes: 466e16f0920f ("nfsd: check for EBUSY from vfs_rmdir/vfs_unink.") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: nfsd_unlink() clobbers non-zero status returned from fh_fill_pre_attrs()Chuck Lever
If fh_fill_pre_attrs() returns a non-zero status, the error flow takes it through out_unlock, which then overwrites the returned status code with err = nfserrno(host_err); Fixes: a332018a91c4 ("nfsd: handle failure to collect pre/post-op attrs more sanely") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: remove the redundant mapping of nfserr_mlinkLi Lingfeng
There two mappings of nfserr_mlink in nfs_errtbl. Remove one of them. Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10NFSD: Skip sending CB_RECALL_ANY when the backchannel isn't upChuck Lever
NFSD sends CB_RECALL_ANY to clients when the server is low on memory or that client has a large number of delegations outstanding. We've seen cases where NFSD attempts to send CB_RECALL_ANY requests to disconnected clients, and gets confused. These calls never go anywhere if a backchannel transport to the target client isn't available. Before the server can send any backchannel operation, the client has to connect first and then do a BIND_CONN_TO_SESSION. This patch doesn't address the root cause of the confusion, but there's no need to queue up these optional operations if they can't go anywhere. Fixes: 44df6f439a17 ("NFSD: add delegation reaper to react to low memory condition") Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: adjust WARN_ON_ONCE in revoke_delegationOlga Kornievskaia
A WARN_ON_ONCE() is added to revoke delegations to make sure that the state has been marked for revocation. However, that's only true for 4.1+ stateids. For 4.0 stateids, in unhash_delegation_locked() the sc_status is set to SC_STATUS_CLOSED. Modify the check to reflect it, otherwise a WARN_ON_ONCE is erronously triggered. Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10nfsd: fix management of listener transportsOlga Kornievskaia
Currently, when no active threads are running, a root user using nfsdctl command can try to remove a particular listener from the list of previously added ones, then start the server by increasing the number of threads, it leads to the following problem: [ 158.835354] refcount_t: addition on 0; use-after-free. [ 158.835603] WARNING: CPU: 2 PID: 9145 at lib/refcount.c:25 refcount_warn_saturate+0x160/0x1a0 [ 158.836017] Modules linked in: rpcrdma rdma_cm iw_cm ib_cm ib_core nfsd auth_rpcgss nfs_acl lockd grace overlay isofs uinput snd_seq_dummy snd_hrtimer nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables qrtr sunrpc vfat fat uvcvideo videobuf2_vmalloc videobuf2_memops uvc videobuf2_v4l2 videodev videobuf2_common snd_hda_codec_generic mc e1000e snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore sg loop dm_multipath dm_mod nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci vsock xfs libcrc32c crct10dif_ce ghash_ce vmwgfx sha2_ce sha256_arm64 sr_mod sha1_ce cdrom nvme drm_client_lib drm_ttm_helper ttm nvme_core drm_kms_helper nvme_auth drm fuse [ 158.840093] CPU: 2 UID: 0 PID: 9145 Comm: nfsd Kdump: loaded Tainted: G B W 6.13.0-rc6+ #7 [ 158.840624] Tainted: [B]=BAD_PAGE, [W]=WARN [ 158.840802] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24006586.BA64.2406042154 06/04/2024 [ 158.841220] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 158.841563] pc : refcount_warn_saturate+0x160/0x1a0 [ 158.841780] lr : refcount_warn_saturate+0x160/0x1a0 [ 158.842000] sp : ffff800089be7d80 [ 158.842147] x29: ffff800089be7d80 x28: ffff00008e68c148 x27: ffff00008e68c148 [ 158.842492] x26: ffff0002e3b5c000 x25: ffff600011cd1829 x24: ffff00008653c010 [ 158.842832] x23: ffff00008653c000 x22: 1fffe00011cd1829 x21: ffff00008653c028 [ 158.843175] x20: 0000000000000002 x19: ffff00008653c010 x18: 0000000000000000 [ 158.843505] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 158.843836] x14: 0000000000000000 x13: 0000000000000001 x12: ffff600050a26493 [ 158.844143] x11: 1fffe00050a26492 x10: ffff600050a26492 x9 : dfff800000000000 [ 158.844475] x8 : 00009fffaf5d9b6e x7 : ffff000285132493 x6 : 0000000000000001 [ 158.844823] x5 : ffff000285132490 x4 : ffff600050a26493 x3 : ffff8000805e72bc [ 158.845174] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000098588000 [ 158.845528] Call trace: [ 158.845658] refcount_warn_saturate+0x160/0x1a0 (P) [ 158.845894] svc_recv+0x58c/0x680 [sunrpc] [ 158.846183] nfsd+0x1fc/0x348 [nfsd] [ 158.846390] kthread+0x274/0x2f8 [ 158.846546] ret_from_fork+0x10/0x20 [ 158.846714] ---[ end trace 0000000000000000 ]--- nfsd_nl_listener_set_doit() would manipulate the list of transports of server's sv_permsocks and close the specified listener but the other list of transports (server's sp_xprts list) would not be changed leading to the problem above. Instead, determined if the nfsdctl is trying to remove a listener, in which case, delete all the existing listener transports and re-create all-but-the-removed ones. Fixes: 16a471177496 ("NFSD: add listener-{set,get} netlink command") Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10lockd: add netlink control interfaceJeff Layton
The legacy rpc.nfsd tool will set the nlm_grace_period if the NFSv4 grace period is set. nfsdctl is missing this functionality, so add a new netlink control interface for lockd that it can use. For now, it only allows setting the grace period, and the tcp and udp listener ports. lockd currently uses module parameters and sysctls for configuration, so all of its settings are global. With this change, lockd now tracks these values on a per-net-ns basis. It will only fall back to using the global values if any of them are 0. Finally, as a backward compatibility measure, if updating the nlm settings in the init_net namespace, also update the legacy global values to match. Link: https://issues.redhat.com/browse/RHEL-71698 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2025-03-10Merge afs RCU pathwalk fixChristian Brauner
Bring in the fix for afs_atcell_get_link() to handle RCU pathwalk from the afs branch for this cycle. This fix has to go upstream now. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-03-10afs: Simplify cell record handlingDavid Howells
Simplify afs_cell record handling to avoid very occasional races that cause module removal to hang (it waits for all cell records to be removed). There are two things that particularly contribute to the difficulty: firstly, the code tries to pass a ref on the cell to the cell's maintenance work item (which gets awkward if the work item is already queued); and, secondly, there's an overall cell manager that tries to use just one timer for the entire cell collection (to avoid having loads of timers). However, both of these are probably unnecessarily restrictive. To simplify this, the following changes are made: (1) The cell record collection manager is removed. Each cell record manages itself individually. (2) Each afs_cell is given a second work item (cell->destroyer) that is queued when its refcount reaches zero. This is not done in the context of the putting thread as it might be in an inconvenient place to sleep. (3) Each afs_cell is given its own timer. The timer is used to expire the cell record after a period of unuse if not otherwise pinned and can also be used for other maintenance tasks if necessary (of which there are currently none as DNS refresh is triggered by filesystem operations). (4) The afs_cell manager work item (cell->manager) is no longer given a ref on the cell when queued; rather, the manager must be deleted. This does away with the need to deal with the consequences of losing a race to queue cell->manager. Clean up of extra queuing is deferred to the destroyer. (5) The cell destroyer work item makes sure the cell timer is removed and that the normal cell work is cancelled before farming the actual destruction off to RCU. (6) When a network namespace is destroyed or the kafs module is unloaded, it's now a simple matter of marking the namespace as dead then just waking up all the cell work items. They will then remove and destroy themselves once all remaining activity counts and/or a ref counts are dropped. This makes sure that all server records are dropped first. (7) The cell record state set is reduced to just four states: SETTING_UP, ACTIVE, REMOVING and DEAD. The record persists in the active state even when it's not being used until the time comes to remove it rather than downgrading it to an inactive state from whence it can be restored. This means that the cell still appears in /proc and /afs when not in use until it switches to the REMOVING state - at which point it is removed. Note that the REMOVING state is included so that someone wanting to resurrect the cell record is forced to wait whilst the cell is torn down in that state. Once it's in the DEAD state, it has been removed from net->cells tree and is no longer findable and can be replaced. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-16-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-12-dhowells@redhat.com/ # v4
2025-03-10afs: Fix afs_server ref accountingDavid Howells
The current way that afs_server refs are accounted and cleaned up sometimes cause rmmod to hang when it is waiting for cell records to be removed. The problem is that the cell cleanup might occasionally happen before the server cleanup and then there's nothing that causes the cell to garbage-collect the remaining servers as they become inactive. Partially fix this by: (1) Give each afs_server record its own management timer that rather than relying on the cell manager's central timer to drive each individual cell's maintenance work item to garbage collect servers. This timer is set when afs_unuse_server() reduces a server's activity count to zero and will schedule the server's destroyer work item upon firing. (2) Give each afs_server record its own destroyer work item that removes the record from the cell's database, shuts down the timer, cancels any pending work for itself, sends an RPC to the server to cancel outstanding callbacks. This change, in combination with the timer, obviates the need to try and coordinate so closely between the cell record and a bunch of other server records to try and tear everything down in a coordinated fashion. With this, the cell record is pinned until the server RCU is complete and namespace/module removal will wait until all the cell records are removed. (3) Now that incoming calls are mapped to servers (and thus cells) using data attached to an rxrpc_peer, the UUID-to-server mapping tree is moved from the namespace to the cell (cell->fs_servers). This means there can no longer be duplicates therein - and that allows the mapping tree to be simpler as there doesn't need to be a chain of same-UUID servers that are in different cells. (4) The lock protecting the UUID mapping tree is switched to an rw_semaphore on the cell rather than a seqlock on the namespace as it's now only used during mounting in contexts in which we're allowed to sleep. (5) When it comes time for a cell that is being removed to purge its set of servers, it just needs to iterate over them and wake them up. Once a server becomes inactive, its destroyer work item will observe the state of the cell and immediately remove that record. (6) When a server record is removed, it is marked AFS_SERVER_FL_EXPIRED to prevent reattempts at removal. The record will be dispatched to RCU for destruction once its refcount reaches 0. (7) The AFS_SERVER_FL_UNCREATED/CREATING flags are used to synchronise simultaneous creation attempts. If one attempt fails, it will abandon the attempt and allow another to try again. Note that the record can't just be abandoned when dead as it's bound into a server list attached to a volume and only subject to replacement if the server list obtained for the volume from the VLDB changes. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-15-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-11-dhowells@redhat.com/ # v4
2025-03-10afs: Use the per-peer app data provided by rxrpcDavid Howells
Make use of the per-peer application data that rxrpc now allows the application to store on the rxrpc_peer struct to hold a back pointer to the afs_server record that peer represents an endpoint for. Then, when a call comes in to the AFS cache manager, this can be used to map it to the correct server record rather than having to use a UUID-to-server mapping table and having to do an additional lookup. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-14-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-10-dhowells@redhat.com/ # v4
2025-03-10afs: Drop the net parameter from afs_unuse_cell()David Howells
Remove the redundant net parameter to afs_unuse_cell() as cell->net can be used instead. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-12-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-8-dhowells@redhat.com/ # v4
2025-03-10afs: Make afs_lookup_cell() take a trace noteDavid Howells
Pass a note to be added to the afs_cell tracepoint to afs_lookup_cell() so that different callers can be distinguished. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-11-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-7-dhowells@redhat.com/ # v4
2025-03-10afs: Improve server refcount/active count tracingDavid Howells
Improve server refcount/active count tracing to distinguish between simply getting/putting a ref and using/unusing the server record (which changes the activity count as well as the refcount). This makes it a bit easier to work out what's going on. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-10-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-6-dhowells@redhat.com/ # v4
2025-03-10afs: Improve afs_volume tracing to display a debug IDDavid Howells
Improve the tracing of afs_volume objects to include displaying a debug ID so that different instances of volumes with the same "vid" can be distinguished. Also be consistent about displaying the volume's refcount (and not the cell's). Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-9-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-5-dhowells@redhat.com/ # v4
2025-03-10afs: Change dynroot to create contents on demandDavid Howells
Change the AFS dynamic root to do things differently: (1) Rather than having the creation of cell records create inodes and dentries for cell mountpoints, create them on demand during lookup. This simplifies cell management and locking as we no longer have to create these objects in advance *and* on speculative lookup by the user for a cell that isn't precreated. (2) Rather than using the libfs dentry-based readdir (the dentries now no longer exist until accessed from (1)), have readdir generate the contents by reading the list of cells. The @cell symlinks get pushed in positions 2 and 3 if rootcell has been configured. (3) Make the @cell symlink dentries persist for the life of the superblock or until reclaimed, but make cell mountpoints disappear immediately if unused. It's not perfect as someone doing an "ls -l /afs" may create a whole bunch of dentries which will be garbage collected immediately. But any dentry that gets automounted will be pinned by the mount, so it shouldn't be too bad. (4) Allocate the inode numbers for the cell mountpoints from an IDR to prevent duplicates appearing in the event it cycles round. The number allocated from the IDR is doubled to provide two inode numbers - one for the normal cell name (RO) and one for the dotted cell name (RW). Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-8-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-4-dhowells@redhat.com/ # v4
2025-03-10afs: Remove the "autocell" mount optionDavid Howells
Remove the "autocell" mount option. It was an attempt to do automounting of arbitrary cells based on what the user looked up but within the root directory of a mounted volume. This isn't really the right thing to do, and using the "dyn" mount option to get the dynamic root is the right way to do it. The kafs-client package uses "-o dyn" when mounting /afs, so it should be safe to drop "-o autocell". Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250224234154.2014840-7-dhowells@redhat.com/ # v1 Link: https://lore.kernel.org/r/20250310094206.801057-3-dhowells@redhat.com/ # v4
2025-03-10afs: Fix afs_atcell_get_link() to handle RCU pathwalkDavid Howells
The ->get_link() method may be entered under RCU pathwalk conditions (in which case, the dentry pointer is NULL). This is not taken account of by afs_atcell_get_link() and lockdep will complain when it tries to lock an rwsem. Fix this by marking net->ws_cell as __rcu and using RCU access macros on it and by making afs_atcell_get_link() just return a pointer to the name in RCU pathwalk without taking net->cells_lock or a ref on the cell as RCU will protect the name storage (the cell is already freed via call_rcu()). Fixes: 30bca65bbbae ("afs: Make /afs/@cell and /afs/.@cell symlinks") Reported-by: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20250310094206.801057-2-dhowells@redhat.com/ # v4
2025-03-10Merge branch 'xfs-6.15-merge' into for-nextCarlos Maiolino
XFS code for 6.15 to be merged into linux-next Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-03-10Merge branch 'xfs-6.15-zoned_devices' into xfs-6.15-mergeCarlos Maiolino
Merge Zoned devices support for XFS Signed-off-by: Carlos Maiolino <cem@kernel.org>