summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-12-06smb: client: fix potential race in cifs_put_tcon()Paulo Alcantara
dfs_cache_refresh() delayed worker could race with cifs_put_tcon(), so make sure to call list_replace_init() on @tcon->dfs_ses_list after kworker is cancelled or finished. Fixes: 4f42a8b54b5c ("smb: client: fix DFS interlink failover") Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-12-06smb3.1.1: fix posix mounts to older serversSteve French
Some servers which implement the SMB3.1.1 POSIX extensions did not set the file type in the mode in the infolevel 100 response. With the recent changes for checking the file type via the mode field, this can cause the root directory to be reported incorrectly and mounts (e.g. to ksmbd) to fail. Fixes: 6a832bc8bbb2 ("fs/smb/client: Implement new SMB3 POSIX type") Cc: stable@vger.kernel.org Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Cc: Ralph Boehme <slow@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-12-06btrfs: flush delalloc workers queue before stopping cleaner kthread during ↵Filipe Manana
unmount During the unmount path, at close_ctree(), we first stop the cleaner kthread, using kthread_stop() which frees the associated task_struct, and then stop and destroy all the work queues. However after we stopped the cleaner we may still have a worker from the delalloc_workers queue running inode.c:submit_compressed_extents(), which calls btrfs_add_delayed_iput(), which in turn tries to wake up the cleaner kthread - which was already destroyed before, resulting in a use-after-free on the task_struct. Syzbot reported this with the following stack traces: BUG: KASAN: slab-use-after-free in __lock_acquire+0x78/0x2100 kernel/locking/lockdep.c:5089 Read of size 8 at addr ffff8880259d2818 by task kworker/u8:3/52 CPU: 1 UID: 0 PID: 52 Comm: kworker/u8:3 Not tainted 6.13.0-rc1-syzkaller-00002-gcdd30ebb1b9f #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Workqueue: btrfs-delalloc btrfs_work_helper Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0x169/0x550 mm/kasan/report.c:489 kasan_report+0x143/0x180 mm/kasan/report.c:602 __lock_acquire+0x78/0x2100 kernel/locking/lockdep.c:5089 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162 class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline] try_to_wake_up+0xc2/0x1470 kernel/sched/core.c:4205 submit_compressed_extents+0xdf/0x16e0 fs/btrfs/inode.c:1615 run_ordered_work fs/btrfs/async-thread.c:288 [inline] btrfs_work_helper+0x96f/0xc40 fs/btrfs/async-thread.c:324 process_one_work kernel/workqueue.c:3229 [inline] process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3310 worker_thread+0x870/0xd30 kernel/workqueue.c:3391 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Allocated by task 2: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 unpoison_slab_object mm/kasan/common.c:319 [inline] __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:345 kasan_slab_alloc include/linux/kasan.h:250 [inline] slab_post_alloc_hook mm/slub.c:4104 [inline] slab_alloc_node mm/slub.c:4153 [inline] kmem_cache_alloc_node_noprof+0x1d9/0x380 mm/slub.c:4205 alloc_task_struct_node kernel/fork.c:180 [inline] dup_task_struct+0x57/0x8c0 kernel/fork.c:1113 copy_process+0x5d1/0x3d50 kernel/fork.c:2225 kernel_clone+0x223/0x870 kernel/fork.c:2807 kernel_thread+0x1bc/0x240 kernel/fork.c:2869 create_kthread kernel/kthread.c:412 [inline] kthreadd+0x60d/0x810 kernel/kthread.c:767 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Freed by task 24: kasan_save_stack mm/kasan/common.c:47 [inline] kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:582 poison_slab_object mm/kasan/common.c:247 [inline] __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264 kasan_slab_free include/linux/kasan.h:233 [inline] slab_free_hook mm/slub.c:2338 [inline] slab_free mm/slub.c:4598 [inline] kmem_cache_free+0x195/0x410 mm/slub.c:4700 put_task_struct include/linux/sched/task.h:144 [inline] delayed_put_task_struct+0x125/0x300 kernel/exit.c:227 rcu_do_batch kernel/rcu/tree.c:2567 [inline] rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823 handle_softirqs+0x2d4/0x9b0 kernel/softirq.c:554 run_ksoftirqd+0xca/0x130 kernel/softirq.c:943 smpboot_thread_fn+0x544/0xa30 kernel/smpboot.c:164 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Last potentially related work creation: kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47 __kasan_record_aux_stack+0xac/0xc0 mm/kasan/generic.c:544 __call_rcu_common kernel/rcu/tree.c:3086 [inline] call_rcu+0x167/0xa70 kernel/rcu/tree.c:3190 context_switch kernel/sched/core.c:5372 [inline] __schedule+0x1803/0x4be0 kernel/sched/core.c:6756 __schedule_loop kernel/sched/core.c:6833 [inline] schedule+0x14b/0x320 kernel/sched/core.c:6848 schedule_timeout+0xb0/0x290 kernel/time/sleep_timeout.c:75 do_wait_for_common kernel/sched/completion.c:95 [inline] __wait_for_common kernel/sched/completion.c:116 [inline] wait_for_common kernel/sched/completion.c:127 [inline] wait_for_completion+0x355/0x620 kernel/sched/completion.c:148 kthread_stop+0x19e/0x640 kernel/kthread.c:712 close_ctree+0x524/0xd60 fs/btrfs/disk-io.c:4328 generic_shutdown_super+0x139/0x2d0 fs/super.c:642 kill_anon_super+0x3b/0x70 fs/super.c:1237 btrfs_kill_super+0x41/0x50 fs/btrfs/super.c:2112 deactivate_locked_super+0xc4/0x130 fs/super.c:473 cleanup_mnt+0x41f/0x4b0 fs/namespace.c:1373 task_work_run+0x24f/0x310 kernel/task_work.c:239 ptrace_notify+0x2d2/0x380 kernel/signal.c:2503 ptrace_report_syscall include/linux/ptrace.h:415 [inline] ptrace_report_syscall_exit include/linux/ptrace.h:477 [inline] syscall_exit_work+0xc7/0x1d0 kernel/entry/common.c:173 syscall_exit_to_user_mode_prepare kernel/entry/common.c:200 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline] syscall_exit_to_user_mode+0x24a/0x340 kernel/entry/common.c:218 do_syscall_64+0x100/0x230 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x77/0x7f The buggy address belongs to the object at ffff8880259d1e00 which belongs to the cache task_struct of size 7424 The buggy address is located 2584 bytes inside of freed 7424-byte region [ffff8880259d1e00, ffff8880259d3b00) The buggy address belongs to the physical page: page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x259d0 head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 memcg:ffff88802f4b56c1 flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff) page_type: f5(slab) raw: 00fff00000000040 ffff88801bafe500 dead000000000100 dead000000000122 raw: 0000000000000000 0000000000040004 00000001f5000000 ffff88802f4b56c1 head: 00fff00000000040 ffff88801bafe500 dead000000000100 dead000000000122 head: 0000000000000000 0000000000040004 00000001f5000000 ffff88802f4b56c1 head: 00fff00000000003 ffffea0000967401 ffffffffffffffff 0000000000000000 head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 12, tgid 12 (kworker/u8:1), ts 7328037942, free_ts 0 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1556 prep_new_page mm/page_alloc.c:1564 [inline] get_page_from_freelist+0x3651/0x37a0 mm/page_alloc.c:3474 __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4751 alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265 alloc_slab_page+0x6a/0x140 mm/slub.c:2408 allocate_slab+0x5a/0x2f0 mm/slub.c:2574 new_slab mm/slub.c:2627 [inline] ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3815 __slab_alloc+0x58/0xa0 mm/slub.c:3905 __slab_alloc_node mm/slub.c:3980 [inline] slab_alloc_node mm/slub.c:4141 [inline] kmem_cache_alloc_node_noprof+0x269/0x380 mm/slub.c:4205 alloc_task_struct_node kernel/fork.c:180 [inline] dup_task_struct+0x57/0x8c0 kernel/fork.c:1113 copy_process+0x5d1/0x3d50 kernel/fork.c:2225 kernel_clone+0x223/0x870 kernel/fork.c:2807 user_mode_thread+0x132/0x1a0 kernel/fork.c:2885 call_usermodehelper_exec_work+0x5c/0x230 kernel/umh.c:171 process_one_work kernel/workqueue.c:3229 [inline] process_scheduled_works+0xa66/0x1840 kernel/workqueue.c:3310 worker_thread+0x870/0xd30 kernel/workqueue.c:3391 page_owner free stack trace missing Memory state around the buggy address: ffff8880259d2700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880259d2780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8880259d2800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8880259d2880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880259d2900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Fix this by flushing the delalloc workers queue before stopping the cleaner kthread. Reported-by: syzbot+b7cf50a0c173770dcb14@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/674ed7e8.050a0220.48a03.0031.GAE@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-06btrfs: handle bio_split() errorsJohannes Thumshirn
Commit e546fe1da9bd ("block: Rework bio_split() return value") changed bio_split() so that it can return errors. Add error handling for it in btrfs_split_bio() and ultimately btrfs_submit_chunk(). As the bio is not submitted, the bio counter must be decremented to pair btrfs_bio_counter_inc_blocked(). Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-06btrfs: properly wait for writeback before buffered writeQu Wenruo
[BUG] Before commit e820dbeb6ad1 ("btrfs: convert btrfs_buffered_write() to use folios"), function prepare_one_folio() will always wait for folio writeback to finish before returning the folio. However commit e820dbeb6ad1 ("btrfs: convert btrfs_buffered_write() to use folios") changed to use FGP_STABLE to do the writeback wait, but FGP_STABLE is calling folio_wait_stable(), which only calls folio_wait_writeback() if the address space has AS_STABLE_WRITES, which is not set for btrfs inodes. This means we will not wait for the folio writeback at all. [CAUSE] The cause is FGP_STABLE is not waiting for writeback unconditionally, but only for address spaces with AS_STABLE_WRITES, normally such flag is set when the super block has SB_I_STABLE_WRITES flag. Such super block flag is set when the block device has hardware digest support or has internal checksum requirement. I'd argue btrfs should set such super block due to its default data checksum behavior, but it is not set yet, so this means FGP_STABLE flag will have no effect at all. (For NODATASUM inodes, we can skip the waiting in theory but that should be an optimization in the future.) This can lead to data checksum mismatch, as we can modify the folio while it's still under writeback, this will make the contents differ from the contents at submission and checksum calculation. [FIX] Instead of fully relying on FGP_STABLE, manually do the folio writeback waiting, until we set the address space or super flag. Fixes: e820dbeb6ad1 ("btrfs: convert btrfs_buffered_write() to use folios") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-05ocfs2: update seq_file index in ocfs2_dlm_seq_nextWengang Wang
The following INFO level message was seen: seq_file: buggy .next function ocfs2_dlm_seq_next [ocfs2] did not update position index Fix: Update *pos (so m->index) to make seq_read_iter happy though the index its self makes no sense to ocfs2_dlm_seq_next. Link: https://lkml.kernel.org/r/20241119174500.9198-1-wen.gang.wang@oracle.com Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05ocfs2: free inode when ocfs2_get_init_inode() failsTetsuo Handa
syzbot is reporting busy inodes after unmount, for commit 9c89fe0af826 ("ocfs2: Handle error from dquot_initialize()") forgot to call iput() when new_inode() succeeded and dquot_initialize() failed. Link: https://lkml.kernel.org/r/e68c0224-b7c6-4784-b4fa-a9fc8c675525@I-love.SAKURA.ne.jp Fixes: 9c89fe0af826 ("ocfs2: Handle error from dquot_initialize()") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot+0af00f6a2cba2058b5db@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=0af00f6a2cba2058b5db Tested-by: syzbot+0af00f6a2cba2058b5db@syzkaller.appspotmail.com Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05nilfs2: fix potential out-of-bounds memory access in nilfs_find_entry()Ryusuke Konishi
Syzbot reported that when searching for records in a directory where the inode's i_size is corrupted and has a large value, memory access outside the folio/page range may occur, or a use-after-free bug may be detected if KASAN is enabled. This is because nilfs_last_byte(), which is called by nilfs_find_entry() and others to calculate the number of valid bytes of directory data in a page from i_size and the page index, loses the upper 32 bits of the 64-bit size information due to an inappropriate type of local variable to which the i_size value is assigned. This caused a large byte offset value due to underflow in the end address calculation in the calling nilfs_find_entry(), resulting in memory access that exceeds the folio/page size. Fix this issue by changing the type of the local variable causing the bit loss from "unsigned int" to "u64". The return value of nilfs_last_byte() is also of type "unsigned int", but it is truncated so as not to exceed PAGE_SIZE and no bit loss occurs, so no change is required. Link: https://lkml.kernel.org/r/20241119172403.9292-1-konishi.ryusuke@gmail.com Fixes: 2ba466d74ed7 ("nilfs2: directory entry operations") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+96d5d14c47d97015c624@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=96d5d14c47d97015c624 Tested-by: syzbot+96d5d14c47d97015c624@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05fs/proc/vmcore.c: fix warning when CONFIG_MMU=nAndrew Morton
>> fs/proc/vmcore.c:424:19: warning: 'mmap_vmcore_fault' defined but not used [-Wunused-function] 424 | static vm_fault_t mmap_vmcore_fault(struct vm_fault *vmf) Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202411140156.2o0nS4fl-lkp@intel.com/ Cc: Qi Xi <xiqi2@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-12-05Merge tag 'v6.13-rc1-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull smb server fixes from Steve French: - Three fixes for potential out of bound accesses in read and write paths (e.g. when alternate data streams enabled) - GCC 15 build fix * tag 'v6.13-rc1-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: align aux_payload_buf to avoid OOB reads in cryptographic operations ksmbd: fix Out-of-Bounds Write in ksmbd_vfs_stream_write ksmbd: fix Out-of-Bounds Read in ksmbd_vfs_stream_read smb: server: Fix building with GCC 15
2024-12-05isofs: Partially convert zisofs_read_folio to use a folioMatthew Wilcox (Oracle)
Remove several hidden calls to compound_head() and references to page->index. More needs to be done to use folios throughout the zisofs code. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20241125180117.2914311-1-willy@infradead.org
2024-12-05jffs2: Fix rtime decompressorRichard Weinberger
The fix for a memory corruption contained a off-by-one error and caused the compressor to fail in legit cases. Cc: Kinsey Moore <kinsey.moore@oarcorp.com> Cc: stable@vger.kernel.org Fixes: fe051552f5078 ("jffs2: Prevent rtime decompress memory corruption") Signed-off-by: Richard Weinberger <richard@nod.at>
2024-12-04ksmbd: align aux_payload_buf to avoid OOB reads in cryptographic operationsNorbert Szetei
The aux_payload_buf allocation in SMB2 read is performed without ensuring alignment, which could result in out-of-bounds (OOB) reads during cryptographic operations such as crypto_xor or ghash. This patch aligns the allocation of aux_payload_buf to prevent these issues. (Note that to add this patch to stable would require modifications due to recent patch "ksmbd: use __GFP_RETRY_MAYFAIL") Signed-off-by: Norbert Szetei <norbert@doyensec.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-12-04fs/smb/client: cifs_prime_dcache() for SMB3 POSIX reparse pointsRalph Boehme
Spares an extra revalidation request Cc: stable@vger.kernel.org Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Ralph Boehme <slow@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-12-04fs/smb/client: Implement new SMB3 POSIX typeRalph Boehme
Fixes special files against current Samba. On the Samba server: insgesamt 20 131958 brw-r--r-- 1 root root 0, 0 15. Nov 12:04 blockdev 131965 crw-r--r-- 1 root root 1, 1 15. Nov 12:04 chardev 131966 prw-r--r-- 1 samba samba 0 15. Nov 12:05 fifo 131953 -rw-rwxrw-+ 2 samba samba 4 18. Nov 11:37 file 131953 -rw-rwxrw-+ 2 samba samba 4 18. Nov 11:37 hardlink 131957 lrwxrwxrwx 1 samba samba 4 15. Nov 12:03 symlink -> file 131954 -rwxrwxr-x+ 1 samba samba 0 18. Nov 15:28 symlinkoversmb Before: ls: cannot access '/mnt/smb3unix/posix/blockdev': No data available ls: cannot access '/mnt/smb3unix/posix/chardev': No data available ls: cannot access '/mnt/smb3unix/posix/symlinkoversmb': No data available ls: cannot access '/mnt/smb3unix/posix/fifo': No data available ls: cannot access '/mnt/smb3unix/posix/symlink': No data available total 16 ? -????????? ? ? ? ? ? blockdev ? -????????? ? ? ? ? ? chardev ? -????????? ? ? ? ? ? fifo 131953 -rw-rwxrw- 2 root samba 4 Nov 18 11:37 file 131953 -rw-rwxrw- 2 root samba 4 Nov 18 11:37 hardlink ? -????????? ? ? ? ? ? symlink ? -????????? ? ? ? ? ? symlinkoversmb After: insgesamt 21 131958 brw-r--r-- 1 root root 0, 0 15. Nov 12:04 blockdev 131965 crw-r--r-- 1 root root 1, 1 15. Nov 12:04 chardev 131966 prw-r--r-- 1 root samba 0 15. Nov 12:05 fifo 131953 -rw-rwxrw- 2 root samba 4 18. Nov 11:37 file 131953 -rw-rwxrw- 2 root samba 4 18. Nov 11:37 hardlink 131957 lrwxrwxrwx 1 root samba 4 15. Nov 12:03 symlink -> file 131954 lrwxrwxr-x 1 root samba 23 18. Nov 15:28 symlinkoversmb -> mnt/smb3unix/posix/file Cc: stable@vger.kernel.org Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Ralph Boehme <slow@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-12-04lsm: lsm_context in security_dentry_init_securityCasey Schaufler
Replace the (secctx,seclen) pointer pair with a single lsm_context pointer to allow return of the LSM identifier along with the context and context length. This allows security_release_secctx() to know how to release the context. Callers have been modified to use or save the returned data from the new structure. Cc: ceph-devel@vger.kernel.org Cc: linux-nfs@vger.kernel.org Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subject tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-04lsm: use lsm_context in security_inode_getsecctxCasey Schaufler
Change the security_inode_getsecctx() interface to fill a lsm_context structure instead of data and length pointers. This provides the information about which LSM created the context so that security_release_secctx() can use the correct hook. Cc: linux-nfs@vger.kernel.org Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subject tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-04fs/smb/client: avoid querying SMB2_OP_QUERY_WSL_EA for SMB3 POSIXRalph Boehme
Avoid extra roundtrip Cc: stable@vger.kernel.org Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Ralph Boehme <slow@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-12-04lsm: ensure the correct LSM context releaserCasey Schaufler
Add a new lsm_context data structure to hold all the information about a "security context", including the string, its size and which LSM allocated the string. The allocation information is necessary because LSMs have different policies regarding the lifecycle of these strings. SELinux allocates and destroys them on each use, whereas Smack provides a pointer to an entry in a list that never goes away. Update security_release_secctx() to use the lsm_context instead of a (char *, len) pair. Change its callers to do likewise. The LSMs supporting this hook have had comments added to remind the developer that there is more work to be done. The BPF security module provides all LSM hooks. While there has yet to be a known instance of a BPF configuration that uses security contexts, the possibility is real. In the existing implementation there is potential for multiple frees in that case. Cc: linux-integrity@vger.kernel.org Cc: netdev@vger.kernel.org Cc: audit@vger.kernel.org Cc: netfilter-devel@vger.kernel.org To: Pablo Neira Ayuso <pablo@netfilter.org> Cc: linux-nfs@vger.kernel.org Cc: Todd Kjos <tkjos@google.com> Signed-off-by: Casey Schaufler <casey@schaufler-ca.com> [PM: subject tweak] Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-12-04jbd2: flush filesystem device before updating tail sequenceZhang Yi
When committing transaction in jbd2_journal_commit_transaction(), the disk caches for the filesystem device should be flushed before updating the journal tail sequence. However, this step is missed if the journal is not located on the filesystem device. As a result, the filesystem may become inconsistent following a power failure or system crash. Fix it by ensuring that the filesystem device is flushed appropriately. Fixes: 3339578f0578 ("jbd2: cleanup journal tail after transaction commit") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20241203014407.805916-3-yi.zhang@huaweicloud.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-04jbd2: increase IO priority for writing revoke recordsZhang Yi
Commit '6a3afb6ac6df ("jbd2: increase the journal IO's priority")' increases the priority of journal I/O by marking I/O with the JBD2_JOURNAL_REQ_FLAGS. However, that commit missed the revoke buffers, so also addresses that kind of I/Os. Fixes: 6a3afb6ac6df ("jbd2: increase the journal IO's priority") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20241203014407.805916-2-yi.zhang@huaweicloud.com Reviewed-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-03btrfs: fix missing snapshot drew unlock when root is dead during swap activationFilipe Manana
When activating a swap file we acquire the root's snapshot drew lock and then check if the root is dead, failing and returning with -EPERM if it's dead but without unlocking the root's snapshot lock. Fix this by adding the missing unlock. Fixes: 60021bd754c6 ("btrfs: prevent subvol with swapfile from being deleted") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-03btrfs: fix mount failure due to remount racesQu Wenruo
[BUG] The following reproducer can cause btrfs mount to fail: dev="/dev/test/scratch1" mnt1="/mnt/test" mnt2="/mnt/scratch" mkfs.btrfs -f $dev mount $dev $mnt1 btrfs subvolume create $mnt1/subvol1 btrfs subvolume create $mnt1/subvol2 umount $mnt1 mount $dev $mnt1 -o subvol=subvol1 while mount -o remount,ro $mnt1; do mount -o remount,rw $mnt1; done & bg=$! while mount $dev $mnt2 -o subvol=subvol2; do umount $mnt2; done kill $bg wait umount -R $mnt1 umount -R $mnt2 The script will fail with the following error: mount: /mnt/scratch: /dev/mapper/test-scratch1 already mounted on /mnt/test. dmesg(1) may have more information after failed mount system call. umount: /mnt/test: target is busy. umount: /mnt/scratch/: not mounted And there is no kernel error message. [CAUSE] During the btrfs mount, to support mounting different subvolumes with different RO/RW flags, we need to detect that and retry if needed: Retry with matching RO flags if the initial mount fail with -EBUSY. The problem is, during that retry we do not hold any super block lock (s_umount), this means there can be a remount process changing the RO flags of the original fs super block. If so, we can have an EBUSY error during retry. And this time we treat any failure as an error, without any retry and cause the above EBUSY mount failure. [FIX] The current retry behavior is racy because we do not have a super block thus no way to hold s_umount to prevent the race with remount. Solve the root problem by allowing fc->sb_flags to mismatch from the sb->s_flags at btrfs_get_tree_super(). Then at the re-entry point btrfs_get_tree_subvol(), manually check the fc->s_flags against sb->s_flags, if it's a RO->RW mismatch, then reconfigure with s_umount lock hold. Reported-by: Enno Gotthold <egotthold@suse.com> Reported-by: Fabian Vogt <fvogt@suse.com> [ Special thanks for the reproducer and early analysis pointing to btrfs. ] Fixes: f044b318675f ("btrfs: handle the ro->rw transition for mounting different subvolumes") Link: https://bugzilla.suse.com/show_bug.cgi?id=1231836 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-03Merge tag 'for-6.13-rc1-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - add lockdep annotations for io_uring/encoded read integration, inode lock is held when returning to userspace - properly reflect experimental config option to sysfs - handle NULL root in case the rescue mode accepts invalid/damaged tree roots (rescue=ibadroot) - regression fix of a deadlock between transaction and extent locks - fix pending bio accounting bug in encoded read ioctl - fix NOWAIT mode when checking references for NOCOW files - fix use-after-free in a rb-tree cleanup in ref-verify debugging tool * tag 'for-6.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix lockdep warnings on io_uring encoded reads btrfs: ref-verify: fix use-after-free after invalid ref action btrfs: add a sanity check for btrfs root in btrfs_search_slot() btrfs: don't loop for nowait writes when checking for cross references btrfs: sysfs: advertise experimental features only if CONFIG_BTRFS_EXPERIMENTAL=y btrfs: fix deadlock between transaction commits and extent locks btrfs: fix use-after-free in btrfs_encoded_read_endio()
2024-12-03Merge tag 'fs_for_v6.13-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota and udf fixes from Jan Kara: "Two small UDF fixes for better handling of corrupted filesystem and a quota fix to fix handling of filesystem freezing" * tag 'fs_for_v6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Verify inode link counts before performing rename udf: Skip parent dir link count update if corrupted quota: flush quota_release_work upon quota writeback
2024-12-03Merge tag 'xfs-fixes-6.13-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Carlos Maiolino: - Use xchg() in xlog_cil_insert_pcp_aggregate() - Fix ABBA deadlock on a race between mount and log shutdown - Fix quota softlimit incoherency on delalloc - Fix sparse inode limits on runt AG - remove unknown compat feature checks in SB write valdation - Eliminate a lockdep false positive * tag 'xfs-fixes-6.13-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: don't call xfs_bmap_same_rtgroup in xfs_bmap_add_extent_hole_delay xfs: Use xchg() in xlog_cil_insert_pcp_aggregate() xfs: prevent mount and log shutdown race xfs: delalloc and quota softlimit timers are incoherent xfs: fix sparse inode limits on runt AG xfs: remove unknown compat feature check in superblock write validation xfs: eliminate lockdep false positives in xfs_attr_shortform_list
2024-12-03fs/qnx6: Fix building with GCC 15Brahmajit Das
qnx6_checkroot() had been using weirdly spelled initializer - it needed to initialize 3-element arrays of char and it used NUL-padded 3-character string literals (i.e. 4-element initializers, with completely pointless zeroes at the end). That had been spotted by gcc-15[*]; prior to that gcc quietly dropped the 4th element of initializers. However, none of that had been needed in the first place - all this array is used for is checking that the first directory entry in root directory is "." and the second - "..". The check had been expressed as a loop, using that match_root[] array. Since there is no chance that we ever want to extend that list of entries, the entire thing is much too fancy for its own good; what we need is just a couple of explicit memcmp() and that's it. [*]: fs/qnx6/inode.c: In function ‘qnx6_checkroot’: fs/qnx6/inode.c:182:41: error: initializer-string for array of ‘char’ is too long [-Werror=unterminated-string-initialization] 182 | static char match_root[2][3] = {".\0\0", "..\0"}; | ^~~~~~~ fs/qnx6/inode.c:182:50: error: initializer-string for array of ‘char’ is too long [-Werror=unterminated-string-initialization] 182 | static char match_root[2][3] = {".\0\0", "..\0"}; | ^~~~~~ Signed-off-by: Brahmajit Das <brahmajit.xyz@gmail.com> Link: https://lore.kernel.org/r/20241004195132.1393968-1-brahmajit.xyz@gmail.com Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02module: Convert symbol namespace to string literalPeter Zijlstra
Clean up the existing export namespace code along the same lines of commit 33def8498fdd ("treewide: Convert macro and uses of __section(foo) to __section("foo")") and for the same reason, it is not desired for the namespace argument to be a macro expansion itself. Scripted using git grep -l -e MODULE_IMPORT_NS -e EXPORT_SYMBOL_NS | while read file; do awk -i inplace ' /^#define EXPORT_SYMBOL_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /^#define MODULE_IMPORT_NS/ { gsub(/__stringify\(ns\)/, "ns"); print; next; } /MODULE_IMPORT_NS/ { $0 = gensub(/MODULE_IMPORT_NS\(([^)]*)\)/, "MODULE_IMPORT_NS(\"\\1\")", "g"); } /EXPORT_SYMBOL_NS/ { if ($0 ~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+),/) { if ($0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/ && $0 !~ /(EXPORT_SYMBOL_NS[^(]*)\(\)/ && $0 !~ /^my/) { getline line; gsub(/[[:space:]]*\\$/, ""); gsub(/[[:space:]]/, "", line); $0 = $0 " " line; } $0 = gensub(/(EXPORT_SYMBOL_NS[^(]*)\(([^,]+), ([^)]+)\)/, "\\1(\\2, \"\\3\")", "g"); } } { print }' $file; done Requested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://mail.google.com/mail/u/2/#inbox/FMfcgzQXKWgMmjdFwwdsfgxzKpVHWPlc Acked-by: Greg KH <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-02hfs: Sanity check the root recordLeo Stone
In the syzbot reproducer, the hfs_cat_rec for the root dir has type HFS_CDR_FIL after being read with hfs_bnode_read() in hfs_super_fill(). This indicates it should be used as an hfs_cat_file, which is 102 bytes. Only the first 70 bytes of that struct are initialized, however, because the entrylength passed into hfs_bnode_read() is still the length of a directory record. This causes uninitialized values to be used later on, when the hfs_cat_rec union is treated as the larger hfs_cat_file struct. Add a check to make sure the retrieved record has the correct type for the root directory (HFS_CDR_DIR), and make sure we load the correct number of bytes for a directory record. Reported-by: syzbot+2db3c7526ba68f4ea776@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2db3c7526ba68f4ea776 Tested-by: syzbot+2db3c7526ba68f4ea776@syzkaller.appspotmail.com Tested-by: Leo Stone <leocstone@gmail.com> Signed-off-by: Leo Stone <leocstone@gmail.com> Link: https://lore.kernel.org/r/20241201051420.77858-1-leocstone@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02arm64: mte: set VM_MTE_ALLOWED for hugetlbfs at correct placeYang Shi
The commit 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour") moved vm flags validation before fop->mmap for file mappings. But when commit 25c17c4b55de ("hugetlb: arm64: add mte support") was rebased on top of it, the hugetlbfs part was missed. Mmapping hugetlbfs file may not have MAP_HUGETLB set. Fixes: 25c17c4b55de ("hugetlb: arm64: add mte support") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20241119200914.1145249-1-yang@os.amperecomputing.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-12-02nfsd: avoid pointless cred reference count bumpChristian Brauner
The code already got rid of the extra reference count from the old version of override_creds(). Link: https://lore.kernel.org/r/20241125-work-cred-v2-28-68b9d38bb5b2@kernel.org Acked-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02cachefiles: avoid pointless cred reference count bumpChristian Brauner
The cache holds a long-term reference to the credentials that's taken when the cache is created and put when the cache becomes unused. Link: https://lore.kernel.org/r/20241125-work-cred-v2-27-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02smb: avoid pointless cred reference count bumpChristian Brauner
The creds are allocated via prepare_kernel_cred() which has already taken a reference. This also removes a pointless check that gives the impression that override_creds() can ever be called on a task with current->cred NULL. That's not possible afaict. Remove the check to not imply that there can be a dangling pointer in current->cred. Link: https://lore.kernel.org/r/20241125-work-cred-v2-21-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02cifs: avoid pointless cred reference count bumpChristian Brauner
During module init root_cred will be allocated with its own reference which is only destroyed during module exit. Link: https://lore.kernel.org/r/20241125-work-cred-v2-20-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02cifs: avoid pointless cred reference count bumpChristian Brauner
During module init spnego_cred will be allocated with its own reference which is only destroyed during module exit. Link: https://lore.kernel.org/r/20241125-work-cred-v2-19-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02ovl: avoid pointless cred reference count bumpChristian Brauner
security_inode_copy_up() allocates a set of new credentials and has taken a reference count. Link: https://lore.kernel.org/r/20241125-work-cred-v2-18-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02open: avoid pointless cred reference count bumpChristian Brauner
The code already got rid of the extra reference count from the old version of override_creds(). Link: https://lore.kernel.org/r/20241125-work-cred-v2-17-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02nfsfh: avoid pointless cred reference count bumpChristian Brauner
The code already got rid of the extra reference count from the old version of override_creds(). Link: https://lore.kernel.org/r/20241125-work-cred-v2-16-68b9d38bb5b2@kernel.org Acked-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02nfs/nfs4recover: avoid pointless cred reference count bumpChristian Brauner
The code already got rid of the extra reference count from the old version of override_creds(). Link: https://lore.kernel.org/r/20241125-work-cred-v2-15-68b9d38bb5b2@kernel.org Acked-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02nfs/nfs4idmap: avoid pointless reference count bumpChristian Brauner
The override creds are allocated with a long-term refernce when the id_resolver is initialized via prepare_kernel_creds() that is put when the id_resolver is destroyed. Link: https://lore.kernel.org/r/20241125-work-cred-v2-14-68b9d38bb5b2@kernel.org Acked-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02nfs/localio: avoid pointless cred reference count bumpsChristian Brauner
filp->f_cred already holds a reference count that is stable during the operation. Link: https://lore.kernel.org/r/20241125-work-cred-v2-13-68b9d38bb5b2@kernel.org Acked-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02coredump: avoid pointless cred reference count bumpChristian Brauner
The creds are allocated via prepare_creds() which has already taken a reference. Link: https://lore.kernel.org/r/20241125-work-cred-v2-12-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02binfmt_misc: avoid pointless cred reference count bumpChristian Brauner
file->f_cred already holds a reference count that is stable during the operation. Link: https://lore.kernel.org/r/20241125-work-cred-v2-11-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02aio: avoid pointless cred reference count bumpChristian Brauner
iocb->fsync.creds already holds a reference count that is stable while the operation is performed. Link: https://lore.kernel.org/r/20241125-work-cred-v2-10-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02tree-wide: s/revert_creds_light()/revert_creds()/gChristian Brauner
Rename all calls to revert_creds_light() back to revert_creds(). Link: https://lore.kernel.org/r/20241125-work-cred-v2-6-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02tree-wide: s/override_creds_light()/override_creds()/gChristian Brauner
Rename all calls to override_creds_light() back to overrid_creds(). Link: https://lore.kernel.org/r/20241125-work-cred-v2-5-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02tree-wide: s/revert_creds()/put_cred(revert_creds_light())/gChristian Brauner
Convert all calls to revert_creds() over to explicitly dropping reference counts in preparation for converting revert_creds() to revert_creds_light() semantics. Link: https://lore.kernel.org/r/20241125-work-cred-v2-3-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02tree-wide: s/override_creds()/override_creds_light(get_new_cred())/gChristian Brauner
Convert all callers from override_creds() to override_creds_light(get_new_cred()) in preparation of making override_creds() not take a separate reference at all. Link: https://lore.kernel.org/r/20241125-work-cred-v2-1-68b9d38bb5b2@kernel.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02fs: delay sysctl_nr_open check in expand_files()Mateusz Guzik
Suppose a thread sharing the table started a resize, while sysctl_nr_open got lowered to a value which prohibits it. This is still going to go through with and without the patch, which is fine. Further suppose another thread shows up to do a matching expansion while resize_in_progress == true. It is going to error out since it performs the sysctl_nr_open check *before* finding out if there is an expansion in progress. But the aformentioned thread is going to succeded, so the error is spurious (and it would not happen if the thread showed up a little bit later). Checking the sysctl *after* we know there are no pending updates sorts it out. While here annotate the thing as unlikely. Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Link: https://lore.kernel.org/r/20241116064128.280870-1-mjguzik@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-12-02proc/kcore: use percpu_rw_semaphore for kclist_lockOmar Sandoval
The list of memory ranges for /proc/kcore is protected by a rw_semaphore. We lock it for reading on every read from /proc/kcore. This is very heavy, especially since it is rarely locked for writing. Since we want to strongly favor read lock performance, convert it to a percpu_rw_semaphore. I also experimented with percpu_ref and SRCU, but this change was the simplest and the fastest. In my benchmark, this reduces the time per read by yet another 20 nanoseconds on top of the previous two changes, from 195 nanoseconds per read to 175. Link: https://github.com/osandov/drgn/issues/106 Signed-off-by: Omar Sandoval <osandov@fb.com> Link: https://lore.kernel.org/r/83a3b235b4bcc3b8aef7c533e0657f4d7d5d35ae.1731115587.git.osandov@fb.com Signed-off-by: Christian Brauner <brauner@kernel.org>