summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-07-01smb: client: set missing retry flag in smb2_writev_callback()Paulo Alcantara
Set NETFS_SREQ_NEED_RETRY flag to tell netfslib that the subreq needs to be retried. Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading") Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250701163852.2171681-7-dhowells@redhat.com Tested-by: Steve French <sfrench@samba.org> Cc: linux-cifs@vger.kernel.org Cc: netfs@lists.linux.dev Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01netfs: Fix ref leak on inserted extra subreq in write retryDavid Howells
The write-retry algorithm will insert extra subrequests into the list if it can't get sufficient capacity to split the range that needs to be retried into the sequence of subrequests it currently has (for instance, if the cifs credit pool has fewer credits available than it did when the range was originally divided). However, the allocator furnishes each new subreq with 2 refs and then another is added for resubmission, causing one to be leaked. Fix this by replacing the ref-getting line with a neutral trace line. Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250701163852.2171681-6-dhowells@redhat.com Tested-by: Steve French <sfrench@samba.org> Reviewed-by: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01netfs: Fix looping in wait functionsDavid Howells
netfs_wait_for_request() and netfs_wait_for_pause() can loop forever if netfs_collect_in_app() returns 2, indicating that it wants to repeat because the ALL_QUEUED flag isn't yet set and there are no subreqs left that haven't been collected. The problem is that, unless collection is offloaded (OFFLOAD_COLLECTION), we have to return to the application thread to continue and eventually set ALL_QUEUED after pausing to deal with a retry - but we never get there. Fix this by inserting checks for the IN_PROGRESS and PAUSE flags as appropriate before cycling round - and add cond_resched() for good measure. Fixes: 2b1424cd131c ("netfs: Fix wait/wake to be consistent about the waitqueue used") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250701163852.2171681-5-dhowells@redhat.com Tested-by: Steve French <sfrench@samba.org> Reviewed-by: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01netfs: Provide helpers to perform NETFS_RREQ_IN_PROGRESS flag wanglingDavid Howells
Provide helpers to clear and test the NETFS_RREQ_IN_PROGRESS and to insert the appropriate barrierage. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250701163852.2171681-4-dhowells@redhat.com Tested-by: Steve French <sfrench@samba.org> Reviewed-by: Paulo Alcantara <pc@manguebit.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01netfs: Fix double put of requestDavid Howells
If a netfs request finishes during the pause loop, it will have the ref that belongs to the IN_PROGRESS flag removed at that point - however, if it then goes to the final wait loop, that will *also* put the ref because it sees that the IN_PROGRESS flag is clear and incorrectly assumes that this happened when it called the collector. In fact, since IN_PROGRESS is clear, we shouldn't call the collector again since it's done all the cleanup, such as calling ->ki_complete(). Fix this by making netfs_collect_in_app() just return, indicating that we're done if IN_PROGRESS is removed. Fixes: 2b1424cd131c ("netfs: Fix wait/wake to be consistent about the waitqueue used") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250701163852.2171681-3-dhowells@redhat.com Tested-by: Steve French <sfrench@samba.org> Reviewed-by: Paulo Alcantara <pc@manguebit.org> cc: Steve French <sfrench@samba.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-cifs@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01netfs: Fix hang due to missing case in final DIO read result collectionDavid Howells
When doing a DIO read, if the subrequests we issue fail and cause the request PAUSE flag to be set to put a pause on subrequest generation, we may complete collection of the subrequests (possibly discarding them) prior to the ALL_QUEUED flags being set. In such a case, netfs_read_collection() doesn't see ALL_QUEUED being set after netfs_collect_read_results() returns and will just return to the app (the collector can be seen unpausing the generator in the trace log). The subrequest generator can then set ALL_QUEUED and the app thread reaches netfs_wait_for_request(). This causes netfs_collect_in_app() to be called to see if we're done yet, but there's missing case here. netfs_collect_in_app() will see that a thread is active and set inactive to false, but won't see any subrequests in the read stream, and so won't set need_collect to true. The function will then just return 0, indicating that the caller should just sleep until further activity (which won't be forthcoming) occurs. Fix this by making netfs_collect_in_app() check to see if an active thread is complete - i.e. that ALL_QUEUED is set and the subrequests list is empty - and to skip the sleep return path. The collector will then be called which will clear the request IN_PROGRESS flag, allowing the app to progress. Fixes: 2b1424cd131c ("netfs: Fix wait/wake to be consistent about the waitqueue used") Reported-by: Steve French <sfrench@samba.org> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250701163852.2171681-2-dhowells@redhat.com Tested-by: Steve French <sfrench@samba.org> Reviewed-by: Paulo Alcantara <pc@manguebit.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01eventpoll: Fix priority inversion problemNam Cao
The ready event list of an epoll object is protected by read-write semaphore: - The consumer (waiter) acquires the write lock and takes items. - the producer (waker) takes the read lock and adds items. The point of this design is enabling epoll to scale well with large number of producers, as multiple producers can hold the read lock at the same time. Unfortunately, this implementation may cause scheduling priority inversion problem. Suppose the consumer has higher scheduling priority than the producer. The consumer needs to acquire the write lock, but may be blocked by the producer holding the read lock. Since read-write semaphore does not support priority-boosting for the readers (even with CONFIG_PREEMPT_RT=y), we have a case of priority inversion: a higher priority consumer is blocked by a lower priority producer. This problem was reported in [1]. Furthermore, this could also cause stall problem, as described in [2]. To fix this problem, make the event list half-lockless: - The consumer acquires a mutex (ep->mtx) and takes items. - The producer locklessly adds items to the list. Performance is not the main goal of this patch, but as the producer now can add items without waiting for consumer to release the lock, performance improvement is observed using the stress test from https://github.com/rouming/test-tools/blob/master/stress-epoll.c. This is the same test that justified using read-write semaphore in the past. Testing using 12 x86_64 CPUs: Before After Diff threads events/ms events/ms 8 6932 19753 +185% 16 7820 27923 +257% 32 7648 35164 +360% 64 9677 37780 +290% 128 11166 38174 +242% Testing using 1 riscv64 CPU (averaged over 10 runs, as the numbers are noisy): Before After Diff threads events/ms events/ms 1 73 129 +77% 2 151 216 +43% 4 216 364 +69% 8 234 382 +63% 16 251 392 +56% Reported-by: Frederic Weisbecker <frederic@kernel.org> Closes: https://lore.kernel.org/linux-rt-users/20210825132754.GA895675@lothringen/ [1] Reported-by: Valentin Schneider <vschneid@redhat.com> Closes: https://lore.kernel.org/linux-rt-users/xhsmhttqvnall.mognet@vschneid.remote.csb/ [2] Signed-off-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/20250527090836.1290532-1-namcao@linutronix.de Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01lib/group_cpus: Let group_cpu_evenly() return the number of initialized masksDaniel Wagner
group_cpu_evenly() might have allocated less groups then requested: group_cpu_evenly() __group_cpus_evenly() alloc_nodes_groups() # allocated total groups may be less than numgrps when # active total CPU number is less then numgrps In this case, the caller will do an out of bound access because the caller assumes the masks returned has numgrps. Return the number of groups created so the caller can limit the access range accordingly. Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250617-isolcpus-queue-counters-v1-1-13923686b54b@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01f2fs: fix to use f2fs_is_valid_blkaddr_raw() in do_write_page()Chao Yu
As syzbot reported as below: F2FS-fs (loop9): inject invalid blkaddr in f2fs_is_valid_blkaddr of do_write_page+0x277/0xb10 fs/f2fs/segment.c:3956 ------------[ cut here ]------------ kernel BUG at fs/f2fs/segment.c:3957! Oops: invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 0 UID: 0 PID: 10538 Comm: syz-executor Not tainted 6.16.0-rc3-next-20250627-syzkaller #0 PREEMPT(full) Call Trace: <TASK> f2fs_outplace_write_data+0x11a/0x220 fs/f2fs/segment.c:4017 f2fs_do_write_data_page+0x12ea/0x1a40 fs/f2fs/data.c:2752 f2fs_write_single_data_page+0xa68/0x1680 fs/f2fs/data.c:2851 f2fs_write_cache_pages fs/f2fs/data.c:3133 [inline] __f2fs_write_data_pages fs/f2fs/data.c:3282 [inline] f2fs_write_data_pages+0x195b/0x3000 fs/f2fs/data.c:3309 do_writepages+0x32b/0x550 mm/page-writeback.c:2636 filemap_fdatawrite_wbc mm/filemap.c:386 [inline] __filemap_fdatawrite_range mm/filemap.c:419 [inline] __filemap_fdatawrite mm/filemap.c:425 [inline] filemap_fdatawrite+0x199/0x240 mm/filemap.c:430 f2fs_sync_dirty_inodes+0x31f/0x830 fs/f2fs/checkpoint.c:1108 block_operations fs/f2fs/checkpoint.c:1247 [inline] f2fs_write_checkpoint+0x95a/0x1df0 fs/f2fs/checkpoint.c:1638 kill_f2fs_super+0x2c3/0x6c0 fs/f2fs/super.c:5081 deactivate_locked_super+0xb9/0x130 fs/super.c:474 cleanup_mnt+0x425/0x4c0 fs/namespace.c:1417 task_work_run+0x1d4/0x260 kernel/task_work.c:227 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop+0xec/0x110 kernel/entry/common.c:114 exit_to_user_mode_prepare include/linux/entry-common.h:330 [inline] syscall_exit_to_user_mode_work include/linux/entry-common.h:414 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:449 [inline] do_syscall_64+0x2bd/0x3b0 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x77/0x7f If we inject block address fault, it may trigger kernel panic, we need to use f2fs_is_valid_blkaddr_raw() instead of f2fs_is_valid_blkaddr() in do_write_page() to avoid such issue. Fixes: 70b6e8500431 ("f2fs: do sanity check on fio.new_blkaddr in do_write_page()") Reported-by: syzbot+9201a61c060513d4be38@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/68639520.a70a0220.3b7e22.17e6.GAE@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-01f2fs: avoid splitting bio when reading multiple pagesJianan Huang
When fewer pages are read, nr_pages may be smaller than nr_cpages. Due to the nr_vecs limit, the compressed pages will be split into multiple bios and then merged at the block level. In this case, nr_cpages should be used to pre-allocate bvecs. To handle this case, align max_nr_pages to cluster_size, which should be enough for all compressed pages. Signed-off-by: Jianan Huang <huangjianan@xiaomi.com> Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-01f2fs: check the generic conditions firstJaegeuk Kim
Let's return errors caught by the generic checks. This fixes generic/494 where it expects to see EBUSY by setattr_prepare instead of EINVAL by f2fs for active swapfile. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-30bcachefs: Fix incorrect transaction restart handlingAlan Huang
Reported-by: syzbot+cc7567f096079cb4146f@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-30cifs: all initializations for tcon should happen in tcon_info_allocShyam Prasad N
Today, a few work structs inside tcon are initialized inside cifs_get_tcon and not in tcon_info_alloc. As a result, if a tcon is obtained from tcon_info_alloc, but not called as a part of cifs_get_tcon, we may trip over. Cc: <stable@vger.kernel.org> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-30smb: client: fix warning when reconnecting channelPaulo Alcantara
When reconnecting a channel in smb2_reconnect_server(), a dummy tcon is passed down to smb2_reconnect() with ->query_interface uninitialized, so we can't call queue_delayed_work() on it. Fix the following warning by ensuring that we're queueing the delayed worker from correct tcon. WARNING: CPU: 4 PID: 1126 at kernel/workqueue.c:2498 __queue_delayed_work+0x1d2/0x200 Modules linked in: cifs cifs_arc4 nls_ucs2_utils cifs_md4 [last unloaded: cifs] CPU: 4 UID: 0 PID: 1126 Comm: kworker/4:0 Not tainted 6.16.0-rc3 #5 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-4.fc42 04/01/2014 Workqueue: cifsiod smb2_reconnect_server [cifs] RIP: 0010:__queue_delayed_work+0x1d2/0x200 Code: 41 5e 41 5f e9 7f ee ff ff 90 0f 0b 90 e9 5d ff ff ff bf 02 00 00 00 e8 6c f3 07 00 89 c3 eb bd 90 0f 0b 90 e9 57 f> 0b 90 e9 65 fe ff ff 90 0f 0b 90 e9 72 fe ff ff 90 0f 0b 90 e9 RSP: 0018:ffffc900014afad8 EFLAGS: 00010003 RAX: 0000000000000000 RBX: ffff888124d99988 RCX: ffffffff81399cc1 RDX: dffffc0000000000 RSI: ffff888114326e00 RDI: ffff888124d999f0 RBP: 000000000000ea60 R08: 0000000000000001 R09: ffffed10249b3331 R10: ffff888124d9998f R11: 0000000000000004 R12: 0000000000000040 R13: ffff888114326e00 R14: ffff888124d999d8 R15: ffff888114939020 FS: 0000000000000000(0000) GS:ffff88829f7fe000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffe7a2b4038 CR3: 0000000120a6f000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> queue_delayed_work_on+0xb4/0xc0 smb2_reconnect+0xb22/0xf50 [cifs] smb2_reconnect_server+0x413/0xd40 [cifs] ? __pfx_smb2_reconnect_server+0x10/0x10 [cifs] ? local_clock_noinstr+0xd/0xd0 ? local_clock+0x15/0x30 ? lock_release+0x29b/0x390 process_one_work+0x4c5/0xa10 ? __pfx_process_one_work+0x10/0x10 ? __list_add_valid_or_report+0x37/0x120 worker_thread+0x2f1/0x5a0 ? __kthread_parkme+0xde/0x100 ? __pfx_worker_thread+0x10/0x10 kthread+0x1fe/0x380 ? kthread+0x10f/0x380 ? __pfx_kthread+0x10/0x10 ? local_clock_noinstr+0xd/0xd0 ? ret_from_fork+0x1b/0x1f0 ? local_clock+0x15/0x30 ? lock_release+0x29b/0x390 ? rcu_is_watching+0x20/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x15b/0x1f0 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> irq event stamp: 1116206 hardirqs last enabled at (1116205): [<ffffffff8143af42>] __up_console_sem+0x52/0x60 hardirqs last disabled at (1116206): [<ffffffff81399f0e>] queue_delayed_work_on+0x6e/0xc0 softirqs last enabled at (1116138): [<ffffffffc04562fd>] __smb_send_rqst+0x42d/0x950 [cifs] softirqs last disabled at (1116136): [<ffffffff823d35e1>] release_sock+0x21/0xf0 Cc: linux-cifs@vger.kernel.org Reported-by: David Howells <dhowells@redhat.com> Fixes: 42ca547b13a2 ("cifs: do not disable interface polling on failure") Reviewed-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Steve French <stfrench@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-30f2fs: don't allow unaligned truncation to smaller/equal size on pinned filewangzijie
To prevent scattered pin block generation, don't allow non-section aligned truncation to smaller or equal size on pinned file. But for truncation to larger size, after commit 3fdd89b452c2("f2fs: prevent writing without fallocate() for pinned files"), we only support overwrite IO to pinned file, so we don't need to consider attr->ia_size > i_size case. Signed-off-by: wangzijie <wangzijie1@honor.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-30f2fs: fix to check upper boundary for gc_no_zoned_gc_percentChao Yu
This patch adds missing upper boundary check while setting gc_no_zoned_gc_percent via sysfs. Fixes: 9a481a1c16f4 ("f2fs: create gc_no_zoned_gc_percent and gc_boost_zoned_gc_percent") Cc: Daeho Jeong <daehojeong@google.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-30f2fs: fix to check upper boundary for gc_valid_thresh_ratioChao Yu
This patch adds missing upper boundary check while setting gc_valid_thresh_ratio via sysfs. Fixes: e791d00bd06c ("f2fs: add valid block ratio not to do excessive GC for one time GC") Cc: Daeho Jeong <daehojeong@google.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-30f2fs: account and print more stats during recoveryChao Yu
F2FS-fs (vdc): f2fs_recover_fsync_data: recovery fsync data, check_only: 0 F2FS-fs (vdc): do_recover_data: start to recover dnode F2FS-fs (vdc): recover_inode: ino = 5, name = testfile.t2, inline = 21 F2FS-fs (vdc): recover_data: ino = 5, nid = 5 (i_size: recover), range (0, 864), recovered = 1, err = 0 F2FS-fs (vdc): do_recover_data: dnode: (recoverable: 256, fsynced: 256, total: 256), recovered: (inode: 256, dentry: 1, dnode: 256), err: 0 Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-30f2fs: enable tuning of boost_zoned_gc_percent via sysfsyohan.joung
to allow users to dynamically tune the boost_zoned_gc_percent parameter Signed-off-by: yohan.joung <yohan.joung@sk.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-30f2fs: fix to check upper boundary for value of gc_boost_zoned_gc_percentyohan.joung
to check the upper boundary when setting gc_boost_zoned_gc_percent Fixes: 9a481a1c16f4 ("f2fs: create gc_no_zoned_gc_percent and gc_boost_zoned_gc_percent") Signed-off-by: yohan.joung <yohan.joung@sk.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-30f2fs: fix KMSAN uninit-value in extent_info usageAbinash Singh
KMSAN reported a use of uninitialized value in `__is_extent_mergeable()` and `__is_back_mergeable()` via the read extent tree path. The root cause is that `get_read_extent_info()` only initializes three fields (`fofs`, `blk`, `len`) of `struct extent_info`, leaving the remaining fields uninitialized. This leads to undefined behavior when those fields are accessed later, especially during extent merging. Fix it by zero-initializing the `extent_info` struct before population. Reported-by: syzbot+b8c1d60e95df65e827d4@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b8c1d60e95df65e827d4 Fixes: 94afd6d6e525 ("f2fs: extent cache: support unaligned extent") Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Abinash Singh <abinashsinghlalotra@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-30btrfs: stop parsing crc32c driver nameEric Biggers
To determine whether the crc32c implementation is "fast", use crc32_optimizations() instead of parsing the crypto_shash driver name. This keeps the code working as intended after the driver name is changed by the next commit. Acked-by: David Sterba <dsterba@suse.com> Link: https://lore.kernel.org/r/20250613183753.31864-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-06-30xfs: add FALLOC_FL_ALLOCATE_RANGE to supported flags maskYouling Tang
Add FALLOC_FL_ALLOCATE_RANGE to the set of supported fallocate flags in XFS_FALLOC_FL_SUPPORTED. This change improves code clarity and maintains by explicitly showing this flag in the supported flags mask. Note that since FALLOC_FL_ALLOCATE_RANGE is defined as 0x00, this addition has no functional modifications. Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-06-29statmount_mnt_basic(): simplify the logics for group idAl Viro
We are holding namespace_sem shared and we have not done any group id allocations since we grabbed it. Therefore IS_MNT_SHARED(m) is equivalent to non-zero m->mnt_group_id. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29invent_group_ids(): zero ->mnt_group_id always implies !IS_MNT_SHARED()Al Viro
All places where we call set_mnt_shared() are guaranteed to have non-zero ->mnt_group_id - either by explicit test, or by having done successful invent_group_ids() covering the same mount since we'd grabbed namespace_sem. The opposite combination (non-zero ->mnt_group_id and !IS_MNT_SHARED()) *is* possible - it means that we have allocated group id, but didn't get around to set_mnt_shared() yet; such state is transient - by the time we do namespace_unlock(), we must either do set_mnt_shared() or unroll the group id allocations by cleanup_group_ids(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29get rid of CL_SHARE_TO_SLAVEAl Viro
the only difference between it and CL_SLAVE is in this predicate in clone_mnt(): if ((flag & CL_SLAVE) || ((flag & CL_SHARED_TO_SLAVE) && IS_MNT_SHARED(old))) { However, in case of CL_SHARED_TO_SLAVE we have not allocated any mount group ids since the time we'd grabbed namespace_sem, so IS_MNT_SHARED() is equivalent to non-zero ->mnt_group_id. And in case of CL_SLAVE old has come either from the original tree, which had ->mnt_group_id allocated for all nodes or from result of sequence of CL_MAKE_SHARED or CL_MAKE_SHARED|CL_SLAVE copies, ultimately going back to the original tree. In both cases we are guaranteed that old->mnt_group_id will be non-zero. In other words, the predicate is always equal to (flags & (CL_SLAVE | CL_SHARED_TO_SLAVE)) && old->mnt_group_id and with that replacement CL_SLAVE and CL_SHARED_TO_SLAVE have exact same behaviour. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29take freeing of emptied mnt_namespace to namespace_unlock()Al Viro
Freeing of a namespace must be delayed until after we'd dealt with mount notifications (in namespace_unlock()). The reasons are not immediately obvious (they are buried in ->prev_ns handling in mnt_notify()), and having that free_mnt_ns() explicitly called after namespace_unlock() is asking for trouble - it does feel like they should be OK to free as soon as they've been emptied. Make the things more explicit by setting 'emptied_ns' under namespace_sem and having namespace_unlock() free the sucker as soon as it's safe to free. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29copy_tree(): don't link the mounts via mnt_listAl Viro
The only place that really needs to be adjusted is commit_tree() - there we need to iterate through the copy and we might as well use next_mnt() for that. However, in case when our tree has been slid under something already mounted (propagation to a mountpoint that already has something mounted on it or a 'beneath' move_mount) we need to take care not to walk into the overmounting tree. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29change_mnt_propagation(): move ->mnt_master assignment into MS_SLAVE caseAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29mnt_slave_list/mnt_slave: turn into hlist_head/hlist_nodeAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29turn do_make_slave() into transfer_propagation()Al Viro
Lift calculation of replacement propagation source, removal from peer group and assignment of ->mnt_master from do_make_slave() into change_mnt_propagation() itself. What remains is switching of what used to get propagation *through* mnt to alternative source. Rename to transfer_propagation(), passing it the replacement source as the second argument. Have it return void, while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29do_make_slave(): choose new master sanelyAl Viro
When mount changes propagation type so that it doesn't propagate events any more (MS_PRIVATE, MS_SLAVE, MS_UNBINDABLE), we need to make sure that event propagation between other mounts is unaffected. We need to make sure that events from peers and master of that mount (if any) still reach everything that used to be on its ->mnt_slave_list. If mount has neither peers nor master, we simply need to dissolve its ->mnt_slave_list and clear ->mnt_master of everything in there. If mount has peers, we transfer everything in ->mnt_slave_list of this mount into that of some of those peers (and adjust ->mnt_master accordingly). If mount has a master but no peers, we transfer everything in ->mnt_slave_list of this mount into that of its master (adjusting ->mnt_master, etc.). There are two problems with the current implementation: * there's a long-obsolete logics in choosing the peer - once upon a time it made sense to prefer the peer that had the same ->mnt_root as our mount, but that had been pointless since 2014 ("smarter propagate_mnt()") * the most common caller of that thing is umount_tree() taking the mounts out of propagation graph. In that case it's possible to have ->mnt_slave_list contents moved many times, since the replacement master is likely to be taken out by the same umount_tree(), etc. Take the choice of replacement master into a separate function (propagation_source()) and teach it to skip the candidates that are going to be taken out. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29change_mnt_propagation(): do_make_slave() is a no-op unless IS_MNT_SHARED()Al Viro
... since mnt->mnt_share and mnt->mnt_slave_list are guaranteed to be empty unless IS_MNT_SHARED(mnt). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29change_mnt_propagation() cleanups, step 1Al Viro
Lift changing ->mnt_slave from do_make_slave() into the caller. Simplifies the next steps... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29propagate_mnt(): fix comment and convert to kernel-doc, while we are at itAl Viro
Mountpoint is passed as struct mountpoint *, not struct dentry * (and called dest_mp, not dest_dentry) since 2013. Roots of created copies are linked via mnt_hash, not mnt_list since a bit before the merge into mainline back in 2005. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29propagate_mnt(): get rid of last_destAl Viro
Its only use is choosing the type of copy - CL_MAKE_SHARED if there already is a copy in that peer group, CL_SLAVE or CL_SLAVE | CL_MAKE_SHARED otherwise. But that's easy to keep track of - just set type in the beginning of group and reset to CL_MAKE_SHARED after the first created secondary in it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29fs/pnode.c: get rid of globalsAl Viro
this stuff can be local in propagate_mnt() now (and in some cases duplicates the existing variables there) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29propagate_one(): fold into the sole callerAl Viro
mechanical expansion; will be cleaned up on the next step Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29propagate_one(): separate the "what should be the master for this copy" partAl Viro
When we create the first copy for a peer group, it becomes a slave of one of the existing copies; take that logics into a separate helper - find_master(parent, last_copy, original). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29propagate_one(): separate the "do we need secondary here?" logicsAl Viro
take the checks into separate helper - need_secondary(mount, mountpoint). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29propagate_mnt(): handle all peer groups in the same loopAl Viro
the only difference is that for the original group we want to skip the first element; not worth having the logics twice... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29propagate_one(): get rid of dest_masterAl Viro
propagate_mnt() takes the subtree we are about to attach and creates its copies, setting the propagation between those. Each copy is cloned either from the original or from one of the already created copies. The tricky part is choosing the right copy to serve as a master when we are starting a new peer group. The algorithm for doing that selection puts temporary marks on the masters of mountpoints that already got a copy created for them; since the initial peer group might have no master at all, we need to special-case that when looking for the mark. Currently we do that by memorizing the master of original peer group. It works, but we get yet another piece of data to pass from propagate_mnt() to propagate_one(). Alternative is to mark the master of original peer group if not NULL, turning the check into "master is NULL or marked". Less data to pass around and memory safety is more obvious that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29mount: separate the flags accessed only under namespace_semAl Viro
Several flags are updated and checked only under namespace_sem; we are already making use of that when we are checking them without mount_lock, but we have to hold mount_lock for all updates, which makes things clumsier than they have to be. Take MNT_SHARED, MNT_UNBINDABLE, MNT_MARKED and MNT_UMOUNT_CANDIDATE into a separate field (->mnt_t_flags), renaming them to T_SHARED, etc. to avoid confusion. All accesses must be under namespace_sem. That changes locking requirements for mnt_change_propagation() and set_mnt_shared() - only namespace_sem is needed now. The same goes for SET_MNT_MARKED et.al. There might be more flags moved from ->mnt_flags to that field; this is just the initial set. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29don't have mounts pin their parentsAl Viro
Simplify the rules for mount refcounts. Current rules include: * being a namespace root => +1 * being someone's child => +1 * being someone's child => +1 to parent's refcount, unless you've already been through umount_tree(). The last part is not needed at all. It makes for more places where need to decrement refcounts and it creates an asymmetry between the situations for something that has never been a part of a namespace and something that left one, both for no good reason. If mount's refcount has additions from its children, we know that * it's either someone's child itself (and will remain so until umount_tree(), at which point contributions from children will disappear), or * or is the root of namespace (and will remain such until it either becomes someone's child in another namespace or goes through umount_tree()), or * it is the root of some tree copy, and is currently pinned by the caller of copy_tree() (and remains such until it either gets into namespace, or goes to umount_tree()). In all cases we already have contribution(s) to refcount that will last as long as the contribution from children remains. In other words, the lifetime is not affected by refcount contributions from children. It might be useful for "is it busy" checks, but those are actually no harder to express without it. NB: propagate_mnt_busy() part is an equivalent transformation, ugly as it is; the current logics is actually wrong and may give false negatives, but fixing that is for a separate patch (probably earlier in the queue). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29get rid of mountpoint->m_countAl Viro
struct mountpoint has an odd kinda-sorta refcount in it. It's always either equal to or one above the number of mounts attached to that mountpoint. "One above" happens when a function takes a temporary reference to mountpoint. Things get simpler if we express that as inserting a local object into ->m_list and removing it to drop the reference. New calling conventions: 1) lock_mount(), do_lock_mount(), get_mountpoint() and lookup_mountpoint() take an extra struct pinned_mountpoint * argument and returns 0/-E... (or true/false in case of lookup_mountpoint()) instead of returning struct mountpoint pointers. In case of success, the struct mountpoint * we used to get can be found as pinned_mountpoint.mp 2) unlock_mount() (always paired with lock_mount()/do_lock_mount()) takes an address of struct pinned_mountpoint - the same that had been passed to lock_mount()/do_lock_mount(). 3) put_mountpoint() for a temporary reference (paired with get_mountpoint() or lookup_mountpoint()) is replaced with unpin_mountpoint(), which takes the address of pinned_mountpoint we passed to matching {get,lookup}_mountpoint(). 4) all instances of pinned_mountpoint are local variables; they always live on stack. {} is used for initializer, after successful {get,lookup}_mountpoint() we must make sure to call unpin_mountpoint() before leaving the scope and after successful {do_,}lock_mount() we must make sure to call unlock_mount() before leaving the scope. 5) all manipulations of ->m_count are gone, along with ->m_count itself. struct mountpoint lives while its ->m_list is non-empty. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29combine __put_mountpoint() with unhash_mnt()Al Viro
A call of unhash_mnt() is immediately followed by passing its return value to __put_mountpoint(); the shrink list given to __put_mountpoint() will be ex_mountpoints when called from umount_mnt() and list when called from mntput_no_expire(). Replace with __umount_mnt(mount, shrink_list), moving the call of __put_mountpoint() into it (and returning nothing), adjust the callers. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29pivot_root(): reorder tree surgeries, collapse unhash_mnt() and put_mountpoint()Al Viro
attach new_mnt *before* detaching root_mnt; that way we don't need to keep hold on the mountpoint and one more pair of unhash_mnt()/put_mountpoint() gets folded together into umount_mnt(). Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29take ->mnt_expire handling under mount_lock [read_seqlock_excl]Al Viro
Doesn't take much massage, and we no longer need to make sure that by the time of final mntput() the victim has been removed from the list. Makes life safer for ->d_automount() instances... Rules: * all ->mnt_expire accesses are under mount_lock. * insertion into the list is done by mnt_set_expiry(), and caller (->d_automount() instance) must hold a reference to mount in question. It shouldn't be done more than once for a mount. * if a mount on an expiry list is not yet mounted, it will be ignored by anything that walks that list. * if the final mntput() finds its victim still on an expiry list (in which case it must've never been mounted - umount_tree() would've taken it out), it will remove the victim from the list. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29attach_recursive_mnt(): remove from expiry list on moveAl Viro
... rather than doing that in do_move_mount(). That's the main obstacle to moving the protection of ->mnt_expire from namespace_sem to mount_lock (spinlock-only), which would simplify several failure exits. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-29do_move_mount(): get rid of 'attached' flagAl Viro
'attached' serves as a proxy for "source is a subtree of our namespace and not the entirety of anon namespace"; finish massaging it away. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>