summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-06-30cifs: all initializations for tcon should happen in tcon_info_allocShyam Prasad N
Today, a few work structs inside tcon are initialized inside cifs_get_tcon and not in tcon_info_alloc. As a result, if a tcon is obtained from tcon_info_alloc, but not called as a part of cifs_get_tcon, we may trip over. Cc: <stable@vger.kernel.org> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-30smb: client: fix warning when reconnecting channelPaulo Alcantara
When reconnecting a channel in smb2_reconnect_server(), a dummy tcon is passed down to smb2_reconnect() with ->query_interface uninitialized, so we can't call queue_delayed_work() on it. Fix the following warning by ensuring that we're queueing the delayed worker from correct tcon. WARNING: CPU: 4 PID: 1126 at kernel/workqueue.c:2498 __queue_delayed_work+0x1d2/0x200 Modules linked in: cifs cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CPU: 4 UID: 0 PID: 1126 Comm: kworker/4:0 Not tainted 6.16.0-rc3 #5 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-4.fc42 04/01/2014 Workqueue: cifsiod smb2_reconnect_server [cifs] RIP: 0010:__queue_delayed_work+0x1d2/0x200 Code: 41 5e 41 5f e9 7f ee ff ff 90 0f 0b 90 e9 5d ff ff ff bf 02 00 00 00 e8 6c f3 07 00 89 c3 eb bd 90 0f 0b 90 e9 57 f> 0b 90 e9 65 fe ff ff 90 0f 0b 90 e9 72 fe ff ff 90 0f 0b 90 e9 RSP: 0018:ffffc900014afad8 EFLAGS: 00010003 RAX: 0000000000000000 RBX: ffff888124d99988 RCX: ffffffff81399cc1 RDX: dffffc0000000000 RSI: ffff888114326e00 RDI: ffff888124d999f0 RBP: 000000000000ea60 R08: 0000000000000001 R09: ffffed10249b3331 R10: ffff888124d9998f R11: 0000000000000004 R12: 0000000000000040 R13: ffff888114326e00 R14: ffff888124d999d8 R15: ffff888114939020 FS: 0000000000000000(0000) GS:ffff88829f7fe000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffe7a2b4038 CR3: 0000000120a6f000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> queue_delayed_work_on+0xb4/0xc0 smb2_reconnect+0xb22/0xf50 [cifs] smb2_reconnect_server+0x413/0xd40 [cifs] ? __pfx_smb2_reconnect_server+0x10/0x10 [cifs] ? local_clock_noinstr+0xd/0xd0 ? local_clock+0x15/0x30 ? lock_release+0x29b/0x390 process_one_work+0x4c5/0xa10 ? __pfx_process_one_work+0x10/0x10 ? __list_add_valid_or_report+0x37/0x120 worker_thread+0x2f1/0x5a0 ? __kthread_parkme+0xde/0x100 ? __pfx_worker_thread+0x10/0x10 kthread+0x1fe/0x380 ? kthread+0x10f/0x380 ? __pfx_kthread+0x10/0x10 ? local_clock_noinstr+0xd/0xd0 ? ret_from_fork+0x1b/0x1f0 ? local_clock+0x15/0x30 ? lock_release+0x29b/0x390 ? rcu_is_watching+0x20/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x15b/0x1f0 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> irq event stamp: 1116206 hardirqs last enabled at (1116205): [<ffffffff8143af42>] __up_console_sem+0x52/0x60 hardirqs last disabled at (1116206): [<ffffffff81399f0e>] queue_delayed_work_on+0x6e/0xc0 softirqs last enabled at (1116138): [<ffffffffc04562fd>] __smb_send_rqst+0x42d/0x950 [cifs] softirqs last disabled at (1116136): [<ffffffff823d35e1>] release_sock+0x21/0xf0 Cc: linux-cifs@vger.kernel.org Reported-by: David Howells <dhowells@redhat.com> Fixes: 42ca547b13a2 ("cifs: do not disable interface polling on failure") Reviewed-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Steve French <stfrench@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-30btrfs: stop parsing crc32c driver nameEric Biggers
To determine whether the crc32c implementation is "fast", use crc32_optimizations() instead of parsing the crypto_shash driver name. This keeps the code working as intended after the driver name is changed by the next commit. Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250613183753.31864-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-06-30xfs: add FALLOC_FL_ALLOCATE_RANGE to supported flags maskYouling Tang
Add FALLOC_FL_ALLOCATE_RANGE to the set of supported fallocate flags in XFS_FALLOC_FL_SUPPORTED. This change improves code clarity and maintains by explicitly showing this flag in the supported flags mask. Note that since FALLOC_FL_ALLOCATE_RANGE is defined as 0x00, this addition has no functional modifications. Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-29statmount_mnt_basic(): simplify the logics for group idAl Viro
We are holding namespace_sem shared and we have not done any group id allocations since we grabbed it. Therefore IS_MNT_SHARED(m) is equivalent to non-zero m->mnt_group_id. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29invent_group_ids(): zero ->mnt_group_id always implies !IS_MNT_SHARED()Al Viro
All places where we call set_mnt_shared() are guaranteed to have non-zero ->mnt_group_id - either by explicit test, or by having done successful invent_group_ids() covering the same mount since we'd grabbed namespace_sem. The opposite combination (non-zero ->mnt_group_id and !IS_MNT_SHARED()) *is* possible - it means that we have allocated group id, but didn't get around to set_mnt_shared() yet; such state is transient - by the time we do namespace_unlock(), we must either do set_mnt_shared() or unroll the group id allocations by cleanup_group_ids(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29get rid of CL_SHARE_TO_SLAVEAl Viro
the only difference between it and CL_SLAVE is in this predicate in clone_mnt(): if ((flag & CL_SLAVE) || ((flag & CL_SHARED_TO_SLAVE) && IS_MNT_SHARED(old))) { However, in case of CL_SHARED_TO_SLAVE we have not allocated any mount group ids since the time we'd grabbed namespace_sem, so IS_MNT_SHARED() is equivalent to non-zero ->mnt_group_id. And in case of CL_SLAVE old has come either from the original tree, which had ->mnt_group_id allocated for all nodes or from result of sequence of CL_MAKE_SHARED or CL_MAKE_SHARED|CL_SLAVE copies, ultimately going back to the original tree. In both cases we are guaranteed that old->mnt_group_id will be non-zero. In other words, the predicate is always equal to (flags & (CL_SLAVE | CL_SHARED_TO_SLAVE)) && old->mnt_group_id and with that replacement CL_SLAVE and CL_SHARED_TO_SLAVE have exact same behaviour. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29take freeing of emptied mnt_namespace to namespace_unlock()Al Viro
Freeing of a namespace must be delayed until after we'd dealt with mount notifications (in namespace_unlock()). The reasons are not immediately obvious (they are buried in ->prev_ns handling in mnt_notify()), and having that free_mnt_ns() explicitly called after namespace_unlock() is asking for trouble - it does feel like they should be OK to free as soon as they've been emptied. Make the things more explicit by setting 'emptied_ns' under namespace_sem and having namespace_unlock() free the sucker as soon as it's safe to free. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29copy_tree(): don't link the mounts via mnt_listAl Viro
The only place that really needs to be adjusted is commit_tree() - there we need to iterate through the copy and we might as well use next_mnt() for that. However, in case when our tree has been slid under something already mounted (propagation to a mountpoint that already has something mounted on it or a 'beneath' move_mount) we need to take care not to walk into the overmounting tree. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29change_mnt_propagation(): move ->mnt_master assignment into MS_SLAVE caseAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29mnt_slave_list/mnt_slave: turn into hlist_head/hlist_nodeAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29turn do_make_slave() into transfer_propagation()Al Viro
Lift calculation of replacement propagation source, removal from peer group and assignment of ->mnt_master from do_make_slave() into change_mnt_propagation() itself. What remains is switching of what used to get propagation *through* mnt to alternative source. Rename to transfer_propagation(), passing it the replacement source as the second argument. Have it return void, while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29do_make_slave(): choose new master sanelyAl Viro
When mount changes propagation type so that it doesn't propagate events any more (MS_PRIVATE, MS_SLAVE, MS_UNBINDABLE), we need to make sure that event propagation between other mounts is unaffected. We need to make sure that events from peers and master of that mount (if any) still reach everything that used to be on its ->mnt_slave_list. If mount has neither peers nor master, we simply need to dissolve its ->mnt_slave_list and clear ->mnt_master of everything in there. If mount has peers, we transfer everything in ->mnt_slave_list of this mount into that of some of those peers (and adjust ->mnt_master accordingly). If mount has a master but no peers, we transfer everything in ->mnt_slave_list of this mount into that of its master (adjusting ->mnt_master, etc.). There are two problems with the current implementation: * there's a long-obsolete logics in choosing the peer - once upon a time it made sense to prefer the peer that had the same ->mnt_root as our mount, but that had been pointless since 2014 ("smarter propagate_mnt()") * the most common caller of that thing is umount_tree() taking the mounts out of propagation graph. In that case it's possible to have ->mnt_slave_list contents moved many times, since the replacement master is likely to be taken out by the same umount_tree(), etc. Take the choice of replacement master into a separate function (propagation_source()) and teach it to skip the candidates that are going to be taken out. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29change_mnt_propagation(): do_make_slave() is a no-op unless IS_MNT_SHARED()Al Viro
... since mnt->mnt_share and mnt->mnt_slave_list are guaranteed to be empty unless IS_MNT_SHARED(mnt). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29change_mnt_propagation() cleanups, step 1Al Viro
Lift changing ->mnt_slave from do_make_slave() into the caller. Simplifies the next steps... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29propagate_mnt(): fix comment and convert to kernel-doc, while we are at itAl Viro
Mountpoint is passed as struct mountpoint *, not struct dentry * (and called dest_mp, not dest_dentry) since 2013. Roots of created copies are linked via mnt_hash, not mnt_list since a bit before the merge into mainline back in 2005. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29propagate_mnt(): get rid of last_destAl Viro
Its only use is choosing the type of copy - CL_MAKE_SHARED if there already is a copy in that peer group, CL_SLAVE or CL_SLAVE | CL_MAKE_SHARED otherwise. But that's easy to keep track of - just set type in the beginning of group and reset to CL_MAKE_SHARED after the first created secondary in it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29fs/pnode.c: get rid of globalsAl Viro
this stuff can be local in propagate_mnt() now (and in some cases duplicates the existing variables there) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29propagate_one(): fold into the sole callerAl Viro
mechanical expansion; will be cleaned up on the next step Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29propagate_one(): separate the "what should be the master for this copy" partAl Viro
When we create the first copy for a peer group, it becomes a slave of one of the existing copies; take that logics into a separate helper - find_master(parent, last_copy, original). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29propagate_one(): separate the "do we need secondary here?" logicsAl Viro
take the checks into separate helper - need_secondary(mount, mountpoint). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29propagate_mnt(): handle all peer groups in the same loopAl Viro
the only difference is that for the original group we want to skip the first element; not worth having the logics twice... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29propagate_one(): get rid of dest_masterAl Viro
propagate_mnt() takes the subtree we are about to attach and creates its copies, setting the propagation between those. Each copy is cloned either from the original or from one of the already created copies. The tricky part is choosing the right copy to serve as a master when we are starting a new peer group. The algorithm for doing that selection puts temporary marks on the masters of mountpoints that already got a copy created for them; since the initial peer group might have no master at all, we need to special-case that when looking for the mark. Currently we do that by memorizing the master of original peer group. It works, but we get yet another piece of data to pass from propagate_mnt() to propagate_one(). Alternative is to mark the master of original peer group if not NULL, turning the check into "master is NULL or marked". Less data to pass around and memory safety is more obvious that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29mount: separate the flags accessed only under namespace_semAl Viro
Several flags are updated and checked only under namespace_sem; we are already making use of that when we are checking them without mount_lock, but we have to hold mount_lock for all updates, which makes things clumsier than they have to be. Take MNT_SHARED, MNT_UNBINDABLE, MNT_MARKED and MNT_UMOUNT_CANDIDATE into a separate field (->mnt_t_flags), renaming them to T_SHARED, etc. to avoid confusion. All accesses must be under namespace_sem. That changes locking requirements for mnt_change_propagation() and set_mnt_shared() - only namespace_sem is needed now. The same goes for SET_MNT_MARKED et.al. There might be more flags moved from ->mnt_flags to that field; this is just the initial set. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29don't have mounts pin their parentsAl Viro
Simplify the rules for mount refcounts. Current rules include: * being a namespace root => +1 * being someone's child => +1 * being someone's child => +1 to parent's refcount, unless you've already been through umount_tree(). The last part is not needed at all. It makes for more places where need to decrement refcounts and it creates an asymmetry between the situations for something that has never been a part of a namespace and something that left one, both for no good reason. If mount's refcount has additions from its children, we know that * it's either someone's child itself (and will remain so until umount_tree(), at which point contributions from children will disappear), or * or is the root of namespace (and will remain such until it either becomes someone's child in another namespace or goes through umount_tree()), or * it is the root of some tree copy, and is currently pinned by the caller of copy_tree() (and remains such until it either gets into namespace, or goes to umount_tree()). In all cases we already have contribution(s) to refcount that will last as long as the contribution from children remains. In other words, the lifetime is not affected by refcount contributions from children. It might be useful for "is it busy" checks, but those are actually no harder to express without it. NB: propagate_mnt_busy() part is an equivalent transformation, ugly as it is; the current logics is actually wrong and may give false negatives, but fixing that is for a separate patch (probably earlier in the queue). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29get rid of mountpoint->m_countAl Viro
struct mountpoint has an odd kinda-sorta refcount in it. It's always either equal to or one above the number of mounts attached to that mountpoint. "One above" happens when a function takes a temporary reference to mountpoint. Things get simpler if we express that as inserting a local object into ->m_list and removing it to drop the reference. New calling conventions: 1) lock_mount(), do_lock_mount(), get_mountpoint() and lookup_mountpoint() take an extra struct pinned_mountpoint * argument and returns 0/-E... (or true/false in case of lookup_mountpoint()) instead of returning struct mountpoint pointers. In case of success, the struct mountpoint * we used to get can be found as pinned_mountpoint.mp 2) unlock_mount() (always paired with lock_mount()/do_lock_mount()) takes an address of struct pinned_mountpoint - the same that had been passed to lock_mount()/do_lock_mount(). 3) put_mountpoint() for a temporary reference (paired with get_mountpoint() or lookup_mountpoint()) is replaced with unpin_mountpoint(), which takes the address of pinned_mountpoint we passed to matching {get,lookup}_mountpoint(). 4) all instances of pinned_mountpoint are local variables; they always live on stack. {} is used for initializer, after successful {get,lookup}_mountpoint() we must make sure to call unpin_mountpoint() before leaving the scope and after successful {do_,}lock_mount() we must make sure to call unlock_mount() before leaving the scope. 5) all manipulations of ->m_count are gone, along with ->m_count itself. struct mountpoint lives while its ->m_list is non-empty. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29combine __put_mountpoint() with unhash_mnt()Al Viro
A call of unhash_mnt() is immediately followed by passing its return value to __put_mountpoint(); the shrink list given to __put_mountpoint() will be ex_mountpoints when called from umount_mnt() and list when called from mntput_no_expire(). Replace with __umount_mnt(mount, shrink_list), moving the call of __put_mountpoint() into it (and returning nothing), adjust the callers. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29pivot_root(): reorder tree surgeries, collapse unhash_mnt() and put_mountpoint()Al Viro
attach new_mnt *before* detaching root_mnt; that way we don't need to keep hold on the mountpoint and one more pair of unhash_mnt()/put_mountpoint() gets folded together into umount_mnt(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29take ->mnt_expire handling under mount_lock [read_seqlock_excl]Al Viro
Doesn't take much massage, and we no longer need to make sure that by the time of final mntput() the victim has been removed from the list. Makes life safer for ->d_automount() instances... Rules: * all ->mnt_expire accesses are under mount_lock. * insertion into the list is done by mnt_set_expiry(), and caller (->d_automount() instance) must hold a reference to mount in question. It shouldn't be done more than once for a mount. * if a mount on an expiry list is not yet mounted, it will be ignored by anything that walks that list. * if the final mntput() finds its victim still on an expiry list (in which case it must've never been mounted - umount_tree() would've taken it out), it will remove the victim from the list. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29attach_recursive_mnt(): remove from expiry list on moveAl Viro
... rather than doing that in do_move_mount(). That's the main obstacle to moving the protection of ->mnt_expire from namespace_sem to mount_lock (spinlock-only), which would simplify several failure exits. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29do_move_mount(): get rid of 'attached' flagAl Viro
'attached' serves as a proxy for "source is a subtree of our namespace and not the entirety of anon namespace"; finish massaging it away. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29do_move_mount(): take dropping the old mountpoint into attach_recursive_mnt()Al Viro
... and fold it with unhash_mnt() there - there's no need to retain a reference to old_mp beyond that point, since by then all mountpoints we were going to add are either explicitly pinned by get_mountpoint() or have stuff already added to them. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29attach_recursive_mnt(): get rid of flags entirelyAl Viro
move vs. attach is trivially detected as mnt_has_parent(source_mnt)... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29attach_recursive_mnt(): pass destination mount in all casesAl Viro
... and 'beneath' is no longer used there Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29attach_recursive_mnt(): unify the mnt_change_mountpoint() logicsAl Viro
The logics used for tucking under existing mount differs for original and copies; copies do a mount hash lookup to see if mountpoint to be is already overmounted, while the original is told explicitly. But the same logics that is used for copies works for the original, at which point the only place where we get very close to eliminating the need of passing 'beneath' flag to attach_recursive_mnt(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29make commit_tree() usable in same-namespace move caseAl Viro
Once attach_recursive_mnt() has created all copies of original subtree, it needs to put them in place(s). Steps needed for those are slightly different: 1) in 'move' case, original copy doesn't need any rbtree manipulations (everything's already in the same namespace where it will be), but it needs to be detached from the current location 2) in 'attach' case, original may be in anon namespace; if it is, all those mounts need to removed from their current namespace before insertion into the target one 3) additional copies have a couple of extra twists - in case of cross-userns propagation we need to lock everything other the root of subtree and in case when we end up inserting under an existing mount, that mount needs to be found (for original copy we have it explicitly passed by the caller). Quite a bit of that can be unified; as the first step, make commit_tree() helper (inserting mounts into namespace, hashing the root of subtree and marking the namespace as updated) usable in all cases; (2) and (3) are already using it and for (1) we only need to make the insertion of mounts into namespace conditional. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29Rewrite of propagate_umount()Al Viro
The variant currently in the tree has problems; trying to prove correctness has caught at least one class of bugs (reparenting that ends up moving the visible location of reparented mount, due to not excluding some of the counterparts on propagation that should've been included). I tried to prove that it's the only bug there; I'm still not sure whether it is. If anyone can reconstruct and write down an analysis of the mainline implementation, I'll gladly review it; as it is, I ended up doing a different implementation. Candidate collection phase is similar, but trimming the set down until it satisfies the constraints turned out pretty different. I hoped to do transformation as a massage series, but that turns out to be too convoluted. So it's a single patch replacing propagate_umount() and friends in one go, with notes and analysis in D/f/propagate_umount.txt (in addition to inline comments). As far I can tell, it is provably correct and provably linear by the number of mounts we need to look at in order to decide what should be unmounted. It even builds and seems to survive testing... Another nice thing that fell out of that is that ->mnt_umounting is no longer needed. Compared to the first version: * explicit MNT_UMOUNT_CANDIDATE flag for is_candidate() * trim_ancestors() only clears that flag, leaving the suckers on list * trim_one() and handle_locked() take the stuff with flag cleared off the list. That allows to iterate with list_for_each_entry_safe() when calling trim_one() - it removes at most one element from the list now. * no globals - I didn't bother with any kind of context, not worth it. * Notes updated accordingly; I have not touch the terms yet. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29sanitize handling of long-term internal mountsAl Viro
Original rationale for those had been the reduced cost of mntput() for the stuff that is mounted somewhere. Mount refcount increments and decrements are frequent; what's worse, they tend to concentrate on the same instances and cacheline pingpong is quite noticable. As the result, mount refcounts are per-cpu; that allows a very cheap increment. Plain decrement would be just as easy, but decrement-and-test is anything but (we need to add the components up, with exclusion against possible increment-from-zero, etc.). Fortunately, there is a very common case where we can tell that decrement won't be the final one - if the thing we are dropping is currently mounted somewhere. We have an RCU delay between the removal from mount tree and dropping the reference that used to pin it there, so we can just take rcu_read_lock() and check if the victim is mounted somewhere. If it is, we can go ahead and decrement without and further checks - the reference we are dropping is not the last one. If it isn't, we get all the fun with locking, carefully adding up components, etc., but the majority of refcount decrements end up taking the fast path. There is a major exception, though - pipes and sockets. Those live on the internal filesystems that are not going to be mounted anywhere. They are not going to be _un_mounted, of course, so having to take the slow path every time a pipe or socket gets closed is really obnoxious. Solution had been to mark them as long-lived ones - essentially faking "they are mounted somewhere" indicator. With minor modification that works even for ones that do eventually get dropped - all it takes is making sure we have an RCU delay between clearing the "mounted somewhere" indicator and dropping the reference. There are some additional twists (if you want to drop a dozen of such internal mounts, you'd be better off with clearing the indicator on all of them, doing an RCU delay once, then dropping the references), but in the basic form it had been * use kern_mount() if you want your internal mount to be a long-term one. * use kern_unmount() to undo that. Unfortunately, the things did rot a bit during the mount API reshuffling. In several cases we have lost the "fake the indicator" part; kern_unmount() on the unmount side remained (it doesn't warn if you use it on a mount without the indicator), but all benefits regaring mntput() cost had been lost. To get rid of that bitrot, let's add a new helper that would work with fs_context-based API: fc_mount_longterm(). It's a counterpart of fc_mount() that does, on success, mark its result as long-term. It must be paired with kern_unmount() or equivalents. Converted: 1) mqueue (it used to use kern_mount_data() and the umount side is still as it used to be) 2) hugetlbfs (used to use kern_mount_data(), internal mount is never unmounted in this one) 3) i915 gemfs (used to be kern_mount() + manual remount to set options, still uses kern_unmount() on umount side) 4) v3d gemfs (copied from i915) Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29do_umount(): simplify the "is it still mounted" checksAl Viro
Calls of do_umount() are always preceded by can_umount(), where we'd done a racy check for mount belonging to our namespace; if it wasn't, can_unmount() would've failed with -EINVAL and we wouldn't have reached do_umount() at all. That check needs to be redone once we have acquired namespace_sem and in do_umount() we do that. However, that's done in a very odd way; we check that mount is still in rbtree of _some_ namespace or its mnt_list is not empty. It is equivalent to check_mnt(mnt) - we know that earlier mnt was mounted in our namespace; if it has stayed there, it's going to remain in rbtree of our namespace. OTOH, if it ever had been removed from out namespace, it would be removed from rbtree and it never would've re-added to a namespace afterwards. As for ->mnt_list, for something that had been mounted in a namespace we'll never observe non-empty ->mnt_list while holding namespace_sem - it does temporarily become non-empty during umount_tree(), but that doesn't outlast the call of umount_tree(), let alone dropping namespace_sem. Things get much easier to follow if we replace that with (equivalent) check_mnt(mnt) there. What's more, currently we treat a failure of that test as "quietly do nothing"; we might as well pretend that we'd lost the race and fail on that the same way can_umount() would have. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29clone_mnt(): simplify the propagation-related logicsAl Viro
The underlying rules are simple: * MNT_SHARED should be set iff ->mnt_group_id of new mount ends up non-zero. * mounts should be on the same ->mnt_share cyclic list iff they have the same non-zero ->mnt_group_id value. * CL_PRIVATE is mutually exclusive with MNT_SHARED, MNT_SLAVE, MNT_SHARED_TO_SLAVE and MNT_EXPIRE; the whole point of that thing is to get a clone of old mount that would *not* be on any namespace-related lists. The above allows to make the logics more straightforward; what's more, it makes the proof that invariants are maintained much simpler. The variant in mainline is safe (aside of a very narrow race with unsafe modification of mnt_flags right after we had the mount exposed in superblock's ->s_mounts; theoretically it can race with ro remount of the original, but it's not easy to hit), but proof of its correctness is really unpleasant. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29don't set MNT_LOCKED on parentless mountsAl Viro
Originally MNT_LOCKED meant only one thing - "don't let this mount to be peeled off its parent, we don't want to have its mountpoint exposed". Accordingly, it had only been set on mounts that *do* have a parent. Later it got overloaded with another use - setting it on the absolute root had given free protection against umount(2) of absolute root (was possible to trigger, oopsed). Not a bad trick, but it ended up costing more than it bought us. Unfortunately, the cost included both hard-to-reason-about logics and a subtle race between mount -o remount,ro and mount --[r]bind - lockless &= ~MNT_LOCKED in the end of __do_loopback() could race with sb_prepare_remount_readonly() setting and clearing MNT_HOLD_WRITE (under mount_lock, as it should be). The race wouldn't be much of a problem (there are other ways to deal with it), but the subtlety is. Turns out that nobody except umount(2) had ever made use of having MNT_LOCKED set on absolute root. So let's give up on that trick, clever as it had been, add an explicit check in do_umount() and return to using MNT_LOCKED only for mounts that have a parent. It means that * clone_mnt() no longer copies MNT_LOCKED * copy_tree() sets it on submounts if their counterparts had been marked such, and does that right next to attach_mnt() in there, in the same mount_lock scope. * __do_loopback() no longer needs to strip MNT_LOCKED off the root of subtree it's about to return; no store, no race. * init_mount_tree() doesn't bother setting MNT_LOCKED on absolute root. * lock_mnt_tree() does not set MNT_LOCKED on the subtree's root; accordingly, its caller (loop in attach_recursive_mnt()) does not need to bother stripping that MNT_LOCKED on root. Note that lock_mnt_tree() setting MNT_LOCKED on submounts happens in the same mount_lock scope as __attach_mnt() (from commit_tree()) that makes them reachable. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29__attach_mnt(): lose the second argumentAl Viro
It's always ->mnt_parent of the first one. What the function does is making a mount (with already set parent and mountpoint) visible - in mount hash and in the parent's list of children. IOW, it takes the existing rootwards linkage and sets the matching crownwards linkage. Renamed to make_visible(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29dissolve_on_fput(): use anon_ns_root()Al Viro
that's the condition we are actually trying to check there... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29new predicate: anon_ns_root(mount)Al Viro
checks if mount is the root of an anonymouns namespace. Switch open-coded equivalents to using it. For mounts that belong to anon namespace !mnt_has_parent(mount) is the same as mount == ns->root, and intent is more obvious in the latter form. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29constify is_local_mountpoint()Al Viro
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29new predicate: mount_is_ancestor()Al Viro
mount_is_ancestor(p1, p2) returns true iff there is a possibly empty ancestry chain from p1 to p2. Convert the open-coded checks. Unlike those open-coded variants it does not depend upon p1 not being root... Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29pnode: lift peers() into pnode.hAl Viro
it's going to be useful both in pnode.c and namespace.c Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29constify mnt_has_parent()Al Viro
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29copy_tree(): don't set ->mnt_mountpoint on the root of copyAl Viro
It never made any sense - neither when copy_tree() had been introduced (2.4.11-pre5), nor at any point afterwards. Mountpoint is meaningless without parent mount and the root of copied tree has no parent until we get around to attaching it somewhere. At that time we'll have mountpoint set; before that we have no idea which dentry will be used as mountpoint. IOW, copy_tree() should just leave the default value. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29prevent mount hash conflictsAl Viro
Currently it's still possible to run into a pathological situation when two hashed mounts share both parent and mountpoint. That does not work well, for obvious reasons. We are not far from getting rid of that; the only remaining gap is attach_recursive_mnt() not being careful enough when sliding a tree under existing mount (for propagated copies or in 'beneath' case for the original one). To deal with that cleanly we need to be able to find overmounts (i.e. mounts on top of parent's root); we could do hash lookups or scan the list of children but either would be costly. Since one of the results we get from that will be prevention of multiple parallel overmounts, let's just bite the bullet and store a (non-counting) reference to overmount in struct mount. With that done, closing the hole in attach_recursive_mnt() becomes easy - we just need to follow the chain of overmounts before we change the mountpoint of the mount we are sliding things under. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>