summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-07-09mm: update core kernel code to use vm_flags_t consistentlyLorenzo Stoakes
The core kernel code is currently very inconsistent in its use of vm_flags_t vs. unsigned long. This prevents us from changing the type of vm_flags_t in the future and is simply not correct, so correct this. While this results in rather a lot of churn, it is a critical pre-requisite for a future planned change to VMA flag type. Additionally, update VMA userland tests to account for the changes. To make review easier and to break things into smaller parts, driver and architecture-specific changes is left for a subsequent commit. The code has been adjusted to cascade the changes across all calling code as far as is needed. We will adjust architecture-specific and driver code in a subsequent patch. Overall, this patch does not introduce any functional change. Link: https://lkml.kernel.org/r/d1588e7bb96d1ea3fe7b9df2c699d5b4592d901d.1750274467.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Kees Cook <kees@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09ceph: convert ceph_zero_partial_page() to use a folioMatthew Wilcox (Oracle)
Retrieve a folio from the pagecache instead of a page and operate on it. Removes several hidden calls to compound_head() along with calls to deprecated functions like wait_on_page_writeback() and find_lock_page(). [dan.carpenter@linaro.org: fix NULL vs IS_ERR() bug in ceph_zero_partial_page()] Link: https://lkml.kernel.org/r/685c1424.050a0220.baa8.d6a1@mx.google.com Link: https://lkml.kernel.org/r/20250612143443.2848197-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Xiubo Li <xiubli@redhat.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Alex Markuze <amarkuze@redhat.com> Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09direct-io: use memzero_page()Matthew Wilcox (Oracle)
memzero_page() is the new name for zero_user(). Link: https://lkml.kernel.org/r/20250612143443.2848197-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Alex Markuze <amarkuze@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Xiubo Li <xiubli@redhat.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09proc: use the same treatment to check proc_lseek as ones for proc_read_iter ↵wangzijie
et.al Check pde->proc_ops->proc_lseek directly may cause UAF in rmmod scenario. It's a gap in proc_reg_open() after commit 654b33ada4ab("proc: fix UAF in proc_get_inode()"). Followed by AI Viro's suggestion, fix it in same manner. Link: https://lkml.kernel.org/r/20250607021353.1127963-1-wangzijie1@honor.com Fixes: 3f61631d47f1 ("take care to handle NULL ->proc_lseek()") Signed-off-by: wangzijie <wangzijie1@honor.com> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09userfaultfd: remove UFFD_CLOEXEC, UFFD_NONBLOCK, and UFFD_FLAGS_SETTal Zussman
UFFD_CLOEXEC, UFFD_NONBLOCK, and UFFD_FLAGS_SET have been unused since they were added in commit 932b18e0aec6 ("userfaultfd: linux/userfaultfd_k.h"). Remove them and the associated BUILD_BUG_ON() checks. Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-4-a7274d3bd5e4@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09userfaultfd: remove (VM_)BUG_ON()sTal Zussman
BUG_ON() is deprecated [1]. Convert all the BUG_ON()s and VM_BUG_ON()s to use VM_WARN_ON_ONCE(). There are a few additional cases that are converted or modified: - Convert the printk(KERN_WARNING ...) in handle_userfault() to use pr_warn(). - Convert the WARN_ON_ONCE()s in move_pages() to use VM_WARN_ON_ONCE(), as the relevant conditions are already checked in validate_range() in move_pages()'s caller. - Convert the VM_WARN_ON()'s in move_pages() to VM_WARN_ON_ONCE(). These cases should never happen and are similar to those in mfill_atomic() and mfill_atomic_hugetlb(), which were previously BUG_ON()s. move_pages() was added later than those functions and makes use of VM_WARN_ON() as a replacement for the deprecated BUG_ON(), but. VM_WARN_ON_ONCE() is likely a better direct replacement. - Convert the WARN_ON() for !VM_MAYWRITE in userfaultfd_unregister() and userfaultfd_register_range() to VM_WARN_ON_ONCE(). This condition is enforced in userfaultfd_register() so it should never happen, and can be converted to a debug check. [1] https://www.kernel.org/doc/html/v6.15/process/coding-style.html#use-warn-rather-than-bug Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-3-a7274d3bd5e4@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09userfaultfd: prevent unregistering VMAs through a different userfaultfdTal Zussman
Currently, a VMA registered with a uffd can be unregistered through a different uffd associated with the same mm_struct. The existing behavior is slightly broken and may incorrectly reject unregistering some VMAs due to the following check: if (!vma_can_userfault(cur, cur->vm_flags, wp_async)) goto out_unlock; where wp_async is derived from ctx, not from cur. For example, a file-backed VMA registered with wp_async enabled and UFFD_WP mode cannot be unregistered through a uffd that does not have wp_async enabled. Rather than fix this and maintain this odd behavior, make unregistration stricter by requiring VMAs to be unregistered through the same uffd they were registered with. Additionally, reorder the BUG() checks to avoid the aforementioned wp_async issue in them. Convert the existing check to VM_WARN_ON_ONCE() as BUG_ON() is deprecated. This change slightly modifies the ABI. It should not be backported to -stable. It is expected that no one depends on this behavior, and no such cases are known. While at it, correct the comment for the no userfaultfd case. This seems to be a copy-paste artifact from the analogous userfaultfd_register() check. Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-2-a7274d3bd5e4@columbia.edu Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization") Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand <david@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: remove the for_reclaim field from struct writeback_controlChristoph Hellwig
This field is now only set to one in the i915 gem code that only calls writeback_iter on it, which ignores the flag. All other checks are thuse dead code and the field can be removed. Link: https://lkml.kernel.org/r/20250610054959.2057526-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: fix the inaccurate memory statistics issue for usersBaolin Wang
On some large machines with a high number of CPUs running a 64K pagesize kernel, we found that the 'RES' field is always 0 displayed by the top command for some processes, which will cause a lot of confusion for users. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd The main reason is that the batch size of the percpu counter is quite large on these machines, caching a significant percpu value, since converting mm's rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter"). Intuitively, the batch number should be optimized, but on some paths, performance may take precedence over statistical accuracy. Therefore, introducing a new interface to add the percpu statistical count and display it to users, which can remove the confusion. In addition, this change is not expected to be on a performance-critical path, so the modification should be acceptable. In addition, the 'mm->rss_stat' is updated by using add_mm_counter() and dec/inc_mm_counter(), which are all wrappers around percpu_counter_add_batch(). In percpu_counter_add_batch(), there is percpu batch caching to avoid 'fbc->lock' contention. This patch changes task_mem() and task_statm() to get the accurate mm counters under the 'fbc->lock', but this should not exacerbate kernel 'mm->rss_stat' lock contention due to the percpu batch caching of the mm counters. The following test also confirm the theoretical analysis. I run the stress-ng that stresses anon page faults in 32 threads on my 32 cores machine, while simultaneously running a script that starts 32 threads to busy-loop pread each stress-ng thread's /proc/pid/status interface. From the following data, I did not observe any obvious impact of this patch on the stress-ng tests. w/o patch: stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles 67.327 B/sec stress-ng: info: [6848] 1,616,524,844,832 Instructions 24.740 B/sec (0.367 instr. per cycle) stress-ng: info: [6848] 39,529,792 Page Faults Total 0.605 M/sec stress-ng: info: [6848] 39,529,792 Page Faults Minor 0.605 M/sec w/patch: stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles 68.382 B/sec stress-ng: info: [2485] 1,615,101,503,296 Instructions 24.750 B/sec (0.362 instr. per cycle) stress-ng: info: [2485] 39,439,232 Page Faults Total 0.604 M/sec stress-ng: info: [2485] 39,439,232 Page Faults Minor 0.604 M/sec On comparing a very simple app which just allocates & touches some memory against v6.1 (which doesn't have f1a7941243c1) and latest Linus tree (4c06e63b9203) I can see that on latest Linus tree the values for VmRSS, RssAnon and RssFile from /proc/self/status are all zeroes while they do report values on v6.1 and a Linus tree with this patch. Link: https://lkml.kernel.org/r/f4586b17f66f97c174f7fd1f8647374fdb53de1c.1749119050.git.baolin.wang@linux.alibaba.com Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by Donet Tom <donettom@linux.ibm.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: SeongJae Park <sj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09gfs2: Minor do_xmote cancelation fixAndreas Gruenbacher
Commit 6cb3b1c2df87 changed how finish_xmote() clears the GLF_LOCK flag, but it failed to adjust the equivalent code in do_xmote(). Fix that. Fixes: 6cb3b1c2df87 ("gfs2: Fix additional unlikely request cancelation race") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-09f2fs: introduce is_cur{seg,sec}()Chao Yu
There are redundant codes in IS_CUR{SEG,SEC}() macros, let's introduce inline is_cur{seg,sec}() functions, and use a loop in it for cleanup. Meanwhile, it enhances expansibility, as it doesn't need to change is_cur{seg,sec}() when we add a new log header. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-09f2fs: fix to avoid panic in f2fs_evict_inodeChao Yu
As syzbot [1] reported as below: R10: 0000000000000100 R11: 0000000000000206 R12: 00007ffe17473450 R13: 00007f28b1c10854 R14: 000000000000dae5 R15: 00007ffe17474520 </TASK> ---[ end trace 0000000000000000 ]--- ================================================================== BUG: KASAN: use-after-free in __list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62 Read of size 8 at addr ffff88812d962278 by task syz-executor/564 CPU: 1 PID: 564 Comm: syz-executor Tainted: G W 6.1.129-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025 Call Trace: <TASK> __dump_stack+0x21/0x24 lib/dump_stack.c:88 dump_stack_lvl+0xee/0x158 lib/dump_stack.c:106 print_address_description+0x71/0x210 mm/kasan/report.c:316 print_report+0x4a/0x60 mm/kasan/report.c:427 kasan_report+0x122/0x150 mm/kasan/report.c:531 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report_generic.c:351 __list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62 __list_del_entry include/linux/list.h:134 [inline] list_del_init include/linux/list.h:206 [inline] f2fs_inode_synced+0xf7/0x2e0 fs/f2fs/super.c:1531 f2fs_update_inode+0x74/0x1c40 fs/f2fs/inode.c:585 f2fs_update_inode_page+0x137/0x170 fs/f2fs/inode.c:703 f2fs_write_inode+0x4ec/0x770 fs/f2fs/inode.c:731 write_inode fs/fs-writeback.c:1460 [inline] __writeback_single_inode+0x4a0/0xab0 fs/fs-writeback.c:1677 writeback_single_inode+0x221/0x8b0 fs/fs-writeback.c:1733 sync_inode_metadata+0xb6/0x110 fs/fs-writeback.c:2789 f2fs_sync_inode_meta+0x16d/0x2a0 fs/f2fs/checkpoint.c:1159 block_operations fs/f2fs/checkpoint.c:1269 [inline] f2fs_write_checkpoint+0xca3/0x2100 fs/f2fs/checkpoint.c:1658 kill_f2fs_super+0x231/0x390 fs/f2fs/super.c:4668 deactivate_locked_super+0x98/0x100 fs/super.c:332 deactivate_super+0xaf/0xe0 fs/super.c:363 cleanup_mnt+0x45f/0x4e0 fs/namespace.c:1186 __cleanup_mnt+0x19/0x20 fs/namespace.c:1193 task_work_run+0x1c6/0x230 kernel/task_work.c:203 exit_task_work include/linux/task_work.h:39 [inline] do_exit+0x9fb/0x2410 kernel/exit.c:871 do_group_exit+0x210/0x2d0 kernel/exit.c:1021 __do_sys_exit_group kernel/exit.c:1032 [inline] __se_sys_exit_group kernel/exit.c:1030 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1030 x64_sys_call+0x7b4/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:232 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x68/0xd2 RIP: 0033:0x7f28b1b8e169 Code: Unable to access opcode bytes at 0x7f28b1b8e13f. RSP: 002b:00007ffe174710a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00007f28b1c10879 RCX: 00007f28b1b8e169 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001 RBP: 0000000000000002 R08: 00007ffe1746ee47 R09: 00007ffe17472360 R10: 0000000000000009 R11: 0000000000000246 R12: 00007ffe17472360 R13: 00007f28b1c10854 R14: 000000000000dae5 R15: 00007ffe17474520 </TASK> Allocated by task 569: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x4b/0x70 mm/kasan/common.c:52 kasan_save_alloc_info+0x25/0x30 mm/kasan/generic.c:505 __kasan_slab_alloc+0x72/0x80 mm/kasan/common.c:328 kasan_slab_alloc include/linux/kasan.h:201 [inline] slab_post_alloc_hook+0x4f/0x2c0 mm/slab.h:737 slab_alloc_node mm/slub.c:3398 [inline] slab_alloc mm/slub.c:3406 [inline] __kmem_cache_alloc_lru mm/slub.c:3413 [inline] kmem_cache_alloc_lru+0x104/0x220 mm/slub.c:3429 alloc_inode_sb include/linux/fs.h:3245 [inline] f2fs_alloc_inode+0x2d/0x340 fs/f2fs/super.c:1419 alloc_inode fs/inode.c:261 [inline] iget_locked+0x186/0x880 fs/inode.c:1373 f2fs_iget+0x55/0x4c60 fs/f2fs/inode.c:483 f2fs_lookup+0x366/0xab0 fs/f2fs/namei.c:487 __lookup_slow+0x2a3/0x3d0 fs/namei.c:1690 lookup_slow+0x57/0x70 fs/namei.c:1707 walk_component+0x2e6/0x410 fs/namei.c:1998 lookup_last fs/namei.c:2455 [inline] path_lookupat+0x180/0x490 fs/namei.c:2479 filename_lookup+0x1f0/0x500 fs/namei.c:2508 vfs_statx+0x10b/0x660 fs/stat.c:229 vfs_fstatat fs/stat.c:267 [inline] vfs_lstat include/linux/fs.h:3424 [inline] __do_sys_newlstat fs/stat.c:423 [inline] __se_sys_newlstat+0xd5/0x350 fs/stat.c:417 __x64_sys_newlstat+0x5b/0x70 fs/stat.c:417 x64_sys_call+0x393/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:7 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x68/0xd2 Freed by task 13: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x4b/0x70 mm/kasan/common.c:52 kasan_save_free_info+0x31/0x50 mm/kasan/generic.c:516 ____kasan_slab_free+0x132/0x180 mm/kasan/common.c:236 __kasan_slab_free+0x11/0x20 mm/kasan/common.c:244 kasan_slab_free include/linux/kasan.h:177 [inline] slab_free_hook mm/slub.c:1724 [inline] slab_free_freelist_hook+0xc2/0x190 mm/slub.c:1750 slab_free mm/slub.c:3661 [inline] kmem_cache_free+0x12d/0x2a0 mm/slub.c:3683 f2fs_free_inode+0x24/0x30 fs/f2fs/super.c:1562 i_callback+0x4c/0x70 fs/inode.c:250 rcu_do_batch+0x503/0xb80 kernel/rcu/tree.c:2297 rcu_core+0x5a2/0xe70 kernel/rcu/tree.c:2557 rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2574 handle_softirqs+0x178/0x500 kernel/softirq.c:578 run_ksoftirqd+0x28/0x30 kernel/softirq.c:945 smpboot_thread_fn+0x45a/0x8c0 kernel/smpboot.c:164 kthread+0x270/0x310 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 Last potentially related work creation: kasan_save_stack+0x3a/0x60 mm/kasan/common.c:45 __kasan_record_aux_stack+0xb6/0xc0 mm/kasan/generic.c:486 kasan_record_aux_stack_noalloc+0xb/0x10 mm/kasan/generic.c:496 call_rcu+0xd4/0xf70 kernel/rcu/tree.c:2845 destroy_inode fs/inode.c:316 [inline] evict+0x7da/0x870 fs/inode.c:720 iput_final fs/inode.c:1834 [inline] iput+0x62b/0x830 fs/inode.c:1860 do_unlinkat+0x356/0x540 fs/namei.c:4397 __do_sys_unlink fs/namei.c:4438 [inline] __se_sys_unlink fs/namei.c:4436 [inline] __x64_sys_unlink+0x49/0x50 fs/namei.c:4436 x64_sys_call+0x958/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:88 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x4c/0xa0 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x68/0xd2 The buggy address belongs to the object at ffff88812d961f20 which belongs to the cache f2fs_inode_cache of size 1200 The buggy address is located 856 bytes inside of 1200-byte region [ffff88812d961f20, ffff88812d9623d0) The buggy address belongs to the physical page: page:ffffea0004b65800 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12d960 head:ffffea0004b65800 order:2 compound_mapcount:0 compound_pincount:0 flags: 0x4000000000010200(slab|head|zone=1) raw: 4000000000010200 0000000000000000 dead000000000122 ffff88810a94c500 raw: 0000000000000000 00000000800c000c 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 2, migratetype Reclaimable, gfp_mask 0x1d2050(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_RECLAIMABLE), pid 569, tgid 568 (syz.2.16), ts 55943246141, free_ts 0 set_page_owner include/linux/page_owner.h:31 [inline] post_alloc_hook+0x1d0/0x1f0 mm/page_alloc.c:2532 prep_new_page mm/page_alloc.c:2539 [inline] get_page_from_freelist+0x2e63/0x2ef0 mm/page_alloc.c:4328 __alloc_pages+0x235/0x4b0 mm/page_alloc.c:5605 alloc_slab_page include/linux/gfp.h:-1 [inline] allocate_slab mm/slub.c:1939 [inline] new_slab+0xec/0x4b0 mm/slub.c:1992 ___slab_alloc+0x6f6/0xb50 mm/slub.c:3180 __slab_alloc+0x5e/0xa0 mm/slub.c:3279 slab_alloc_node mm/slub.c:3364 [inline] slab_alloc mm/slub.c:3406 [inline] __kmem_cache_alloc_lru mm/slub.c:3413 [inline] kmem_cache_alloc_lru+0x13f/0x220 mm/slub.c:3429 alloc_inode_sb include/linux/fs.h:3245 [inline] f2fs_alloc_inode+0x2d/0x340 fs/f2fs/super.c:1419 alloc_inode fs/inode.c:261 [inline] iget_locked+0x186/0x880 fs/inode.c:1373 f2fs_iget+0x55/0x4c60 fs/f2fs/inode.c:483 f2fs_fill_super+0x3ad7/0x6bb0 fs/f2fs/super.c:4293 mount_bdev+0x2ae/0x3e0 fs/super.c:1443 f2fs_mount+0x34/0x40 fs/f2fs/super.c:4642 legacy_get_tree+0xea/0x190 fs/fs_context.c:632 vfs_get_tree+0x89/0x260 fs/super.c:1573 do_new_mount+0x25a/0xa20 fs/namespace.c:3056 page_owner free stack trace missing Memory state around the buggy address: ffff88812d962100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88812d962180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88812d962200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88812d962280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88812d962300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== [1] https://syzkaller.appspot.com/x/report.txt?x=13448368580000 This bug can be reproduced w/ the reproducer [2], once we enable CONFIG_F2FS_CHECK_FS config, the reproducer will trigger panic as below, so the direct reason of this bug is the same as the one below patch [3] fixed. kernel BUG at fs/f2fs/inode.c:857! RIP: 0010:f2fs_evict_inode+0x1204/0x1a20 Call Trace: <TASK> evict+0x32a/0x7a0 do_unlinkat+0x37b/0x5b0 __x64_sys_unlink+0xad/0x100 do_syscall_64+0x5a/0xb0 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0010:f2fs_evict_inode+0x1204/0x1a20 [2] https://syzkaller.appspot.com/x/repro.c?x=17495ccc580000 [3] https://lore.kernel.org/linux-f2fs-devel/20250702120321.1080759-1-chao@kernel.org Tracepoints before panic: f2fs_unlink_enter: dev = (7,0), dir ino = 3, i_size = 4096, i_blocks = 8, name = file1 f2fs_unlink_exit: dev = (7,0), ino = 7, ret = 0 f2fs_evict_inode: dev = (7,0), ino = 7, pino = 3, i_mode = 0x81ed, i_size = 10, i_nlink = 0, i_blocks = 0, i_advise = 0x0 f2fs_truncate_node: dev = (7,0), ino = 7, nid = 8, block_address = 0x3c05 f2fs_unlink_enter: dev = (7,0), dir ino = 3, i_size = 4096, i_blocks = 8, name = file3 f2fs_unlink_exit: dev = (7,0), ino = 8, ret = 0 f2fs_evict_inode: dev = (7,0), ino = 8, pino = 3, i_mode = 0x81ed, i_size = 9000, i_nlink = 0, i_blocks = 24, i_advise = 0x4 f2fs_truncate: dev = (7,0), ino = 8, pino = 3, i_mode = 0x81ed, i_size = 0, i_nlink = 0, i_blocks = 24, i_advise = 0x4 f2fs_truncate_blocks_enter: dev = (7,0), ino = 8, i_size = 0, i_blocks = 24, start file offset = 0 f2fs_truncate_blocks_exit: dev = (7,0), ino = 8, ret = -2 The root cause is: in the fuzzed image, dnode #8 belongs to inode #7, after inode #7 eviction, dnode #8 was dropped. However there is dirent that has ino #8, so, once we unlink file3, in f2fs_evict_inode(), both f2fs_truncate() and f2fs_update_inode_page() will fail due to we can not load node #8, result in we missed to call f2fs_inode_synced() to clear inode dirty status. Let's fix this by calling f2fs_inode_synced() in error path of f2fs_evict_inode(). PS: As I verified, the reproducer [2] can trigger this bug in v6.1.129, but it failed in v6.16-rc4, this is because the testcase will stop due to other corruption has been detected by f2fs: F2FS-fs (loop0): inconsistent node block, node_type:2, nid:8, node_footer[nid:8,ino:8,ofs:0,cpver:5013063228981249506,blkaddr:15366] F2FS-fs (loop0): f2fs_lookup: inode (ino=9) has zero i_nlink Fixes: 0f18b462b2e5 ("f2fs: flush inode metadata when checkpoint is doing") Closes: https://syzkaller.appspot.com/x/report.txt?x=13448368580000 Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-09f2fs: fix to avoid UAF in f2fs_sync_inode_meta()Chao Yu
syzbot reported an UAF issue as below: [1] [2] [1] https://syzkaller.appspot.com/text?tag=CrashReport&x=16594c60580000 ================================================================== BUG: KASAN: use-after-free in __list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62 Read of size 8 at addr ffff888100567dc8 by task kworker/u4:0/8 CPU: 1 PID: 8 Comm: kworker/u4:0 Tainted: G W 6.1.129-syzkaller-00017-g642656a36791 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025 Workqueue: writeback wb_workfn (flush-7:0) Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x151/0x1b7 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:316 [inline] print_report+0x158/0x4e0 mm/kasan/report.c:427 kasan_report+0x13c/0x170 mm/kasan/report.c:531 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report_generic.c:351 __list_del_entry_valid+0xa6/0x130 lib/list_debug.c:62 __list_del_entry include/linux/list.h:134 [inline] list_del_init include/linux/list.h:206 [inline] f2fs_inode_synced+0x100/0x2e0 fs/f2fs/super.c:1553 f2fs_update_inode+0x72/0x1c40 fs/f2fs/inode.c:588 f2fs_update_inode_page+0x135/0x170 fs/f2fs/inode.c:706 f2fs_write_inode+0x416/0x790 fs/f2fs/inode.c:734 write_inode fs/fs-writeback.c:1460 [inline] __writeback_single_inode+0x4cf/0xb80 fs/fs-writeback.c:1677 writeback_sb_inodes+0xb32/0x1910 fs/fs-writeback.c:1903 __writeback_inodes_wb+0x118/0x3f0 fs/fs-writeback.c:1974 wb_writeback+0x3da/0xa00 fs/fs-writeback.c:2081 wb_check_background_flush fs/fs-writeback.c:2151 [inline] wb_do_writeback fs/fs-writeback.c:2239 [inline] wb_workfn+0xbba/0x1030 fs/fs-writeback.c:2266 process_one_work+0x73d/0xcb0 kernel/workqueue.c:2299 worker_thread+0xa60/0x1260 kernel/workqueue.c:2446 kthread+0x26d/0x300 kernel/kthread.c:386 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> Allocated by task 298: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x4b/0x70 mm/kasan/common.c:52 kasan_save_alloc_info+0x1f/0x30 mm/kasan/generic.c:505 __kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:333 kasan_slab_alloc include/linux/kasan.h:202 [inline] slab_post_alloc_hook+0x53/0x2c0 mm/slab.h:768 slab_alloc_node mm/slub.c:3421 [inline] slab_alloc mm/slub.c:3431 [inline] __kmem_cache_alloc_lru mm/slub.c:3438 [inline] kmem_cache_alloc_lru+0x102/0x270 mm/slub.c:3454 alloc_inode_sb include/linux/fs.h:3255 [inline] f2fs_alloc_inode+0x2d/0x350 fs/f2fs/super.c:1437 alloc_inode fs/inode.c:261 [inline] iget_locked+0x18c/0x7e0 fs/inode.c:1373 f2fs_iget+0x55/0x4ca0 fs/f2fs/inode.c:486 f2fs_lookup+0x3c1/0xb50 fs/f2fs/namei.c:484 __lookup_slow+0x2b9/0x3e0 fs/namei.c:1689 lookup_slow+0x5a/0x80 fs/namei.c:1706 walk_component+0x2e7/0x410 fs/namei.c:1997 lookup_last fs/namei.c:2454 [inline] path_lookupat+0x16d/0x450 fs/namei.c:2478 filename_lookup+0x251/0x600 fs/namei.c:2507 vfs_statx+0x107/0x4b0 fs/stat.c:229 vfs_fstatat fs/stat.c:267 [inline] vfs_lstat include/linux/fs.h:3434 [inline] __do_sys_newlstat fs/stat.c:423 [inline] __se_sys_newlstat+0xda/0x7c0 fs/stat.c:417 __x64_sys_newlstat+0x5b/0x70 fs/stat.c:417 x64_sys_call+0x52/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:7 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x3b/0x80 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x68/0xd2 Freed by task 0: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x4b/0x70 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x40 mm/kasan/generic.c:516 ____kasan_slab_free+0x131/0x180 mm/kasan/common.c:241 __kasan_slab_free+0x11/0x20 mm/kasan/common.c:249 kasan_slab_free include/linux/kasan.h:178 [inline] slab_free_hook mm/slub.c:1745 [inline] slab_free_freelist_hook mm/slub.c:1771 [inline] slab_free mm/slub.c:3686 [inline] kmem_cache_free+0x291/0x560 mm/slub.c:3711 f2fs_free_inode+0x24/0x30 fs/f2fs/super.c:1584 i_callback+0x4b/0x70 fs/inode.c:250 rcu_do_batch+0x552/0xbe0 kernel/rcu/tree.c:2297 rcu_core+0x502/0xf40 kernel/rcu/tree.c:2557 rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2574 handle_softirqs+0x1db/0x650 kernel/softirq.c:624 __do_softirq kernel/softirq.c:662 [inline] invoke_softirq kernel/softirq.c:479 [inline] __irq_exit_rcu+0x52/0xf0 kernel/softirq.c:711 irq_exit_rcu+0x9/0x10 kernel/softirq.c:723 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1118 [inline] sysvec_apic_timer_interrupt+0xa9/0xc0 arch/x86/kernel/apic/apic.c:1118 asm_sysvec_apic_timer_interrupt+0x1b/0x20 arch/x86/include/asm/idtentry.h:691 Last potentially related work creation: kasan_save_stack+0x3b/0x60 mm/kasan/common.c:45 __kasan_record_aux_stack+0xb4/0xc0 mm/kasan/generic.c:486 kasan_record_aux_stack_noalloc+0xb/0x10 mm/kasan/generic.c:496 __call_rcu_common kernel/rcu/tree.c:2807 [inline] call_rcu+0xdc/0x10f0 kernel/rcu/tree.c:2926 destroy_inode fs/inode.c:316 [inline] evict+0x87d/0x930 fs/inode.c:720 iput_final fs/inode.c:1834 [inline] iput+0x616/0x690 fs/inode.c:1860 do_unlinkat+0x4e1/0x920 fs/namei.c:4396 __do_sys_unlink fs/namei.c:4437 [inline] __se_sys_unlink fs/namei.c:4435 [inline] __x64_sys_unlink+0x49/0x50 fs/namei.c:4435 x64_sys_call+0x289/0x9a0 arch/x86/include/generated/asm/syscalls_64.h:88 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x3b/0x80 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x68/0xd2 The buggy address belongs to the object at ffff888100567a10 which belongs to the cache f2fs_inode_cache of size 1360 The buggy address is located 952 bytes inside of 1360-byte region [ffff888100567a10, ffff888100567f60) The buggy address belongs to the physical page: page:ffffea0004015800 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x100560 head:ffffea0004015800 order:3 compound_mapcount:0 compound_pincount:0 flags: 0x4000000000010200(slab|head|zone=1) raw: 4000000000010200 0000000000000000 dead000000000122 ffff8881002c4d80 raw: 0000000000000000 0000000080160016 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 3, migratetype Reclaimable, gfp_mask 0xd2050(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_RECLAIMABLE), pid 298, tgid 298 (syz-executor330), ts 26489303743, free_ts 0 set_page_owner include/linux/page_owner.h:33 [inline] post_alloc_hook+0x213/0x220 mm/page_alloc.c:2637 prep_new_page+0x1b/0x110 mm/page_alloc.c:2644 get_page_from_freelist+0x3a98/0x3b10 mm/page_alloc.c:4539 __alloc_pages+0x234/0x610 mm/page_alloc.c:5837 alloc_slab_page+0x6c/0xf0 include/linux/gfp.h:-1 allocate_slab mm/slub.c:1962 [inline] new_slab+0x90/0x3e0 mm/slub.c:2015 ___slab_alloc+0x6f9/0xb80 mm/slub.c:3203 __slab_alloc+0x5d/0xa0 mm/slub.c:3302 slab_alloc_node mm/slub.c:3387 [inline] slab_alloc mm/slub.c:3431 [inline] __kmem_cache_alloc_lru mm/slub.c:3438 [inline] kmem_cache_alloc_lru+0x149/0x270 mm/slub.c:3454 alloc_inode_sb include/linux/fs.h:3255 [inline] f2fs_alloc_inode+0x2d/0x350 fs/f2fs/super.c:1437 alloc_inode fs/inode.c:261 [inline] iget_locked+0x18c/0x7e0 fs/inode.c:1373 f2fs_iget+0x55/0x4ca0 fs/f2fs/inode.c:486 f2fs_fill_super+0x5360/0x6dc0 fs/f2fs/super.c:4488 mount_bdev+0x282/0x3b0 fs/super.c:1445 f2fs_mount+0x34/0x40 fs/f2fs/super.c:4743 legacy_get_tree+0xf1/0x190 fs/fs_context.c:632 page_owner free stack trace missing Memory state around the buggy address: ffff888100567c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888100567d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888100567d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888100567e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888100567e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== [2] https://syzkaller.appspot.com/text?tag=CrashLog&x=13654c60580000 [ 24.675720][ T28] audit: type=1400 audit(1745327318.732:72): avc: denied { write } for pid=298 comm="syz-executor399" name="/" dev="loop0" ino=3 scontext=root:sysadm_r:sysadm_t tcontext=system_u:object_r:unlabeled_t tclass=dir permissive=1 [ 24.705426][ T296] ------------[ cut here ]------------ [ 24.706608][ T28] audit: type=1400 audit(1745327318.732:73): avc: denied { remove_name } for pid=298 comm="syz-executor399" name="file0" dev="loop0" ino=4 scontext=root:sysadm_r:sysadm_t tcontext=system_u:object_r:unlabeled_t tclass=dir permissive=1 [ 24.711550][ T296] WARNING: CPU: 0 PID: 296 at fs/f2fs/inode.c:847 f2fs_evict_inode+0x1262/0x1540 [ 24.734141][ T28] audit: type=1400 audit(1745327318.732:74): avc: denied { rename } for pid=298 comm="syz-executor399" name="file0" dev="loop0" ino=4 scontext=root:sysadm_r:sysadm_t tcontext=system_u:object_r:unlabeled_t tclass=dir permissive=1 [ 24.742969][ T296] Modules linked in: [ 24.765201][ T28] audit: type=1400 audit(1745327318.732:75): avc: denied { add_name } for pid=298 comm="syz-executor399" name="bus" scontext=root:sysadm_r:sysadm_t tcontext=system_u:object_r:unlabeled_t tclass=dir permissive=1 [ 24.768847][ T296] CPU: 0 PID: 296 Comm: syz-executor399 Not tainted 6.1.129-syzkaller-00017-g642656a36791 #0 [ 24.799506][ T296] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025 [ 24.809401][ T296] RIP: 0010:f2fs_evict_inode+0x1262/0x1540 [ 24.815018][ T296] Code: 34 70 4a ff eb 0d e8 2d 70 4a ff 4d 89 e5 4c 8b 64 24 18 48 8b 5c 24 28 4c 89 e7 e8 78 38 03 00 e9 84 fc ff ff e8 0e 70 4a ff <0f> 0b 4c 89 f7 be 08 00 00 00 e8 7f 21 92 ff f0 41 80 0e 04 e9 61 [ 24.834584][ T296] RSP: 0018:ffffc90000db7a40 EFLAGS: 00010293 [ 24.840465][ T296] RAX: ffffffff822aca42 RBX: 0000000000000002 RCX: ffff888110948000 [ 24.848291][ T296] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000 [ 24.856064][ T296] RBP: ffffc90000db7bb0 R08: ffffffff822ac6a8 R09: ffffed10200b005d [ 24.864073][ T296] R10: 0000000000000000 R11: dffffc0000000001 R12: ffff888100580000 [ 24.871812][ T296] R13: dffffc0000000000 R14: ffff88810fef4078 R15: 1ffff920001b6f5c The root cause is w/ a fuzzed image, f2fs may missed to clear FI_DIRTY_INODE flag for target inode, after f2fs_evict_inode(), the inode is still linked in sbi->inode_list[DIRTY_META] global list, once it triggers checkpoint, f2fs_sync_inode_meta() may access the released inode. In f2fs_evict_inode(), let's always call f2fs_inode_synced() to clear FI_DIRTY_INODE flag and drop inode from global dirty list to avoid this UAF issue. Fixes: 0f18b462b2e5 ("f2fs: flush inode metadata when checkpoint is doing") Closes: https://syzkaller.appspot.com/bug?extid=849174b2efaf0d8be6ba Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-09f2fs: use kfree() instead of kvfree() to free some memoryJiazi Li
options in f2fs_fill_super is alloc by kstrdup: options = kstrdup((const char *)data, GFP_KERNEL) sit_bitmap[_mir], nat_bitmap[_mir] are alloc by kmemdup: sit_i->sit_bitmap = kmemdup(src_bitmap, sit_bitmap_size, GFP_KERNEL); sit_i->sit_bitmap_mir = kmemdup(src_bitmap, sit_bitmap_size, GFP_KERNEL); nm_i->nat_bitmap = kmemdup(version_bitmap, nm_i->bitmap_size, GFP_KERNEL); nm_i->nat_bitmap_mir = kmemdup(version_bitmap, nm_i->bitmap_size, GFP_KERNEL); write_io is alloc by f2fs_kmalloc: sbi->write_io[i] = f2fs_kmalloc(sbi, array_size(n, sizeof(struct f2fs_bio_info)) Use kfree is more efficient. Signed-off-by: Jiazi Li <jqqlijiazi@gmail.com> Signed-off-by: peixuan.qiu <peixuan.qiu@transsion.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-07-09gfs2: Remove GIF_ALLOC_FAILED flagAndreas Gruenbacher
Get rid of the GIF_ALLOC_FAILED flag; we can now be confident that the additional consistency check isn't needed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-09gfs2: Use SECTOR_SIZE and SECTOR_SHIFTAndreas Gruenbacher
Use the SECTOR_SIZE and SECTOR_SHIFT constants where appropriate instead of hardcoding their values. Reported-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2025-07-09eventpoll: don't decrement ep refcount while still holding the ep mutexLinus Torvalds
Jann Horn points out that epoll is decrementing the ep refcount and then doing a mutex_unlock(&ep->mtx); afterwards. That's very wrong, because it can lead to a use-after-free. That pattern is actually fine for the very last reference, because the code in question will delay the actual call to "ep_free(ep)" until after it has unlocked the mutex. But it's wrong for the much subtler "next to last" case when somebody *else* may also be dropping their reference and free the ep while we're still using the mutex. Note that this is true even if that other user is also using the same ep mutex: mutexes, unlike spinlocks, can not be used for object ownership, even if they guarantee mutual exclusion. A mutex "unlock" operation is not atomic, and as one user is still accessing the mutex as part of unlocking it, another user can come in and get the now released mutex and free the data structure while the first user is still cleaning up. See our mutex documentation in Documentation/locking/mutex-design.rst, in particular the section [1] about semantics: "mutex_unlock() may access the mutex structure even after it has internally released the lock already - so it's not safe for another context to acquire the mutex and assume that the mutex_unlock() context is not using the structure anymore" So if we drop our ep ref before the mutex unlock, but we weren't the last one, we may then unlock the mutex, another user comes in, drops _their_ reference and releases the 'ep' as it now has no users - all while the mutex_unlock() is still accessing it. Fix this by simply moving the ep refcount dropping to outside the mutex: the refcount itself is atomic, and doesn't need mutex protection (that's the whole _point_ of refcounts: unlike mutexes, they are inherently about object lifetimes). Reported-by: Jann Horn <jannh@google.com> Link: https://docs.kernel.org/locking/mutex-design.html#semantics [1] Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-07-09debugfs_get_aux(): allow storing non-const void *Al Viro
typechecking is up to users, anyway... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20250702212616.GI3406663@ZenIV Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-09debugfs: split short and full proxy wrappers, kill debugfs_real_fops()Al Viro
All users outside of fs/debugfs/file.c are gone, in there we can just fully split the wrappers for full and short cases and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20250702212419.GG3406663@ZenIV Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-09resctrl: get rid of pointless debugfs_file_{get,put}()Al Viro
->write() of file_operations that gets used only via debugfs_create_file() is called only under debugfs_file_get() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lore.kernel.org/r/20250702211650.GD3406663@ZenIV Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-08bcachefs: Don't set BCH_FS_error on transaction restartKent Overstreet
This started showing up more when we started logging the error being corrected in the journal - but __bch2_fsck_err() could return transaction restarts before that. Setting BCH_FS_error incorrectly causes recovery passes to not be cleared, among other issues. Fixes: b43f72492768 ("bcachefs: Log fsck errors in the journal") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-08ksmbd: fix potential use-after-free in oplock/lease break ackNamjae Jeon
If ksmbd_iov_pin_rsp return error, use-after-free can happen by accessing opinfo->state and opinfo_put and ksmbd_fd_put could called twice. Reported-by: Ziyan Xu <research@securitygossip.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-08ksmbd: fix a mount write count leak in ksmbd_vfs_kern_path_locked()Al Viro
If the call of ksmbd_vfs_lock_parent() fails, we drop the parent_path references and return an error. We need to drop the write access we just got on parent_path->mnt before we drop the mount reference - callers assume that ksmbd_vfs_kern_path_locked() returns with mount write access grabbed if and only if it has returned 0. Fixes: 864fb5d37163 ("ksmbd: fix possible deadlock in smb2_open") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-08smb: server: make use of rdma_destroy_qp()Stefan Metzmacher
The qp is created by rdma_create_qp() as t->cm_id->qp and t->qp is just a shortcut. rdma_destroy_qp() also calls ib_destroy_qp(cm_id->qp) internally, but it is protected by a mutex, clears the cm_id and also calls trace_cm_qp_destroy(). This should make the tracing more useful as both rdma_create_qp() and rdma_destroy_qp() are traces and it makes the code look more sane as functions from the same layer are used for the specific qp object. trace-cmd stream -e rdma_cma:cm_qp_create -e rdma_cma:cm_qp_destroy shows this now while doing a mount and unmount from a client: <...>-80 [002] 378.514182: cm_qp_create: cm.id=1 src=172.31.9.167:5445 dst=172.31.9.166:37113 tos=0 pd.id=0 qp_type=RC send_wr=867 recv_wr=255 qp_num=1 rc=0 <...>-6283 [001] 381.686172: cm_qp_destroy: cm.id=1 src=172.31.9.167:5445 dst=172.31.9.166:37113 tos=0 qp_num=1 Before we only saw the first line. Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <stfrench@microsoft.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers") Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Tom Talpey <tom@talpey.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-08fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressablePankaj Raghav
Since [1], it is possible for filesystems to have blocksize > PAGE_SIZE of the system. Remove the assumption and make the check generic for all blocksizes in generic_check_addressable(). [1] https://lore.kernel.org/linux-xfs/20240822135018.1931258-1-kernel@pankajraghav.com/ Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/20250630104018.213985-1-p.raghav@samsung.com Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-08fs/buffer: remove the min and max limit checks in __getblk_slow()Pankaj Raghav
All filesystems will already check the max and min value of their block size during their initialization. __getblk_slow() is a very low-level function to have these checks. Remove them and only check for logical block size alignment. As this check with logical block size alignment might never trigger, add WARN_ON_ONCE() to the check. As WARN_ON_ONCE() will already print the stack, remove the call to dump_stack(). Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/20250626113223.181399-1-p.raghav@samsung.com Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-08fs: Prevent file descriptor table allocations exceeding INT_MAXSasha Levin
When sysctl_nr_open is set to a very high value (for example, 1073741816 as set by systemd), processes attempting to use file descriptors near the limit can trigger massive memory allocation attempts that exceed INT_MAX, resulting in a WARNING in mm/slub.c: WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288 This happens because kvmalloc_array() and kvmalloc() check if the requested size exceeds INT_MAX and emit a warning when the allocation is not flagged with __GFP_NOWARN. Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a process calls dup2(oldfd, 1073741880), the kernel attempts to allocate: - File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes - Multiple bitmaps: ~400MB - Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647) Reproducer: 1. Set /proc/sys/fs/nr_open to 1073741816: # echo 1073741816 > /proc/sys/fs/nr_open 2. Run a program that uses a high file descriptor: #include <unistd.h> #include <sys/resource.h> int main() { struct rlimit rlim = {1073741824, 1073741824}; setrlimit(RLIMIT_NOFILE, &rlim); dup2(2, 1073741880); // Triggers the warning return 0; } 3. Observe WARNING in dmesg at mm/slub.c:5027 systemd commit a8b627a introduced automatic bumping of fs.nr_open to the maximum possible value. The rationale was that systems with memory control groups (memcg) no longer need separate file descriptor limits since memory is properly accounted. However, this change overlooked that: 1. The kernel's allocation functions still enforce INT_MAX as a maximum size regardless of memcg accounting 2. Programs and tests that legitimately test file descriptor limits can inadvertently trigger massive allocations 3. The resulting allocations (>8GB) are impractical and will always fail systemd's algorithm starts with INT_MAX and keeps halving the value until the kernel accepts it. On most systems, this results in nr_open being set to 1073741816 (0x3ffffff8), which is just under 1GB of file descriptors. While processes rarely use file descriptors near this limit in normal operation, certain selftests (like tools/testing/selftests/core/unshare_test.c) and programs that test file descriptor limits can trigger this issue. Fix this by adding a check in alloc_fdtable() to ensure the requested allocation size does not exceed INT_MAX. This causes the operation to fail with -EMFILE instead of triggering a kernel warning and avoids the impractical >8GB memory allocation request. Fixes: 9cfe015aa424 ("get rid of NR_OPEN and introduce a sysctl_nr_open") Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org> Link: https://lore.kernel.org/20250629074021.1038845-1-sashal@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-08xfs: remove the bt_bdev_file buftarg fieldChristoph Hellwig
And use bt_file for both bdev and shmem backed buftargs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: rename the bt_bdev_* buftarg fieldsChristoph Hellwig
The extra bdev_ is weird, so drop it. Also improve the comment to make it clear these are the hardware limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: refactor xfs_calc_atomic_write_unit_maxChristoph Hellwig
This function and the helpers used by it duplicate the same logic for AGs and RTGs. Use the xfs_group_type enum to unify both variants. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: add a xfs_group_type_buftarg helperChristoph Hellwig
Generalize the xfs_group_type helper in the discard code to return a buftarg and move it to xfs_mount.h, and use the result in xfs_dax_notify_dev_failure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: remove the call to sync_blockdev in xfs_configure_buftargChristoph Hellwig
This extra call is not needed as xfs_alloc_buftarg already calls sync_blockdev. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: clean up the initial read logic in xfs_readsbChristoph Hellwig
The initial sb read is always for a device logical block size buffer. The device logical block size is provided in the bt_logical_sectorsize in struct buftarg, so use that instead of the confusingly named xfs_getsize_buftarg buffer that reads it from the bdev. Update the comments surrounding the code to better describe what is going on. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: replace strncpy with memcpy in xattr listingPranav Tyagi
Use memcpy() in place of strncpy() in __xfs_xattr_put_listent(). The length is known and a null byte is added manually. No functional change intended. Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08fold fs_struct->{lock,seq} into a seqlockAl Viro
The combination of spinlock_t lock and seqcount_spinlock_t seq in struct fs_struct is an open-coded seqlock_t (see linux/seqlock_types.h). Combine and switch to equivalent seqlock_t primitives. AFAICS, that does end up with the same sequence of underlying operations in all cases. While we are at it, get_fs_pwd() is open-coded verbatim in get_path_from_fd(); rather than applying conversion to it, replace with the call of get_fs_pwd() there. Not worth splitting the commit for that, IMO... A bit of historical background - conversion of seqlock_t to use of seqcount_spinlock_t happened several months after the same had been done to struct fs_struct; switching fs_struct to seqlock_t could've been done immediately after that, but it looks like nobody had gotten around to that until now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/20250702053437.GC1880847@ZenIV Acked-by: Ahmed S. Darwish <darwi@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-08Revert "fs/ntfs3: Replace inode_trylock with inode_lock"Konstantin Komarov
This reverts commit 69505fe98f198ee813898cbcaf6770949636430b. Initially, conditional lock acquisition was removed to fix an xfstest bug that was observed during internal testing. The deadlock reported by syzbot is resolved by reintroducing conditional acquisition. The xfstest bug no longer occurs on kernel version 6.16-rc1 during internal testing. I assume that changes in other modules may have contributed to this. Fixes: 69505fe98f19 ("fs/ntfs3: Replace inode_trylock with inode_lock") Reported-by: syzbot+a91fcdbd2698f99db8f4@syzkaller.appspotmail.com Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2025-07-07bcachefs: Fix additional misalignment in journal space calculationsKent Overstreet
Additional fix on top of f54b2a80d0df bcachefs: Fix misaligned bucket check in journal space calculations Make sure that when we calculate space for the next entry it's not misaligned: we need to round_down() to filesystem block size in multiple places (next entry size calculation as well as total space available). Reported-by: Ondřej Kraus <neverberlerfellerer@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-07bcachefs: Don't schedule non persistent passes persistentlyKent Overstreet
if (!(in_recovery && (flags & RUN_RECOVERY_PASS_nopersistent))) should have been if (!in_recovery && !(flags & RUN_RECOVERY_PASS_nopersistent))) But the !in_recovery part was also wrong: the assumption is that if we're in recovery we'll just rewind and run the recovery pass immediately, but we're not able to do so if we've already gone RW and the pass must be run before we go RW. In that case, we need to schedule it in the superblock so it can be run on the next mount attempt. Scheduling it persistently is fine, because it'll be cleared in the superblock immediately when the pass completes successfully. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-07fs/ntfs3: Exclude call make_bad_inode for live nodes.Konstantin Komarov
Use ntfs_inode field 'ni_bad' to mark inode as bad (if something went wrong) and to avoid any operations Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2025-07-07coredump: fix PIDFD_INFO_COREDUMP ioctl checkLaura Brehm
In Commit 1d8db6fd698de1f73b1a7d72aea578fdd18d9a87 ("pidfs, coredump: add PIDFD_INFO_COREDUMP"), the following code was added: if (mask & PIDFD_INFO_COREDUMP) { kinfo.mask |= PIDFD_INFO_COREDUMP; kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask); } [...] if (!(kinfo.mask & PIDFD_INFO_COREDUMP)) { task_lock(task); if (task->mm) kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags); task_unlock(task); } The second bit in particular looks off to me - the condition in essence checks whether PIDFD_INFO_COREDUMP was **not** requested, and if so fetches the coredump_mask in kinfo, since it's checking !(kinfo.mask & PIDFD_INFO_COREDUMP), which is unconditionally set in the earlier hunk. I'm tempted to assume the idea in the second hunk was to calculate the coredump mask if one was requested but fetched in the first hunk, in which case the check should be if ((kinfo.mask & PIDFD_INFO_COREDUMP) && !(kinfo.coredump_mask)) which might be more legibly written as if ((mask & PIDFD_INFO_COREDUMP) && !(kinfo.coredump_mask)) This could also instead be achieved by changing the first hunk to be: if (mask & PIDFD_INFO_COREDUMP) { kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask); if (kinfo.coredump_mask) kinfo.mask |= PIDFD_INFO_COREDUMP; } and the second hunk to: if ((mask & PIDFD_INFO_COREDUMP) && !(kinfo.mask & PIDFD_INFO_COREDUMP)) { task_lock(task); if (task->mm) { kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags); kinfo.mask |= PIDFD_INFO_COREDUMP; } task_unlock(task); } However, when looking at this, the supposition that the second hunk means to cover cases where the coredump info was requested but the first hunk failed to get it starts getting doubtful, so apologies if I'm completely off-base. This patch addresses the issue by fixing the check in the second hunk. Signed-off-by: Laura Brehm <laurabrehm@hey.com> Link: https://lore.kernel.org/20250703120244.96908-3-laurabrehm@hey.com Cc: brauner@kernel.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: add coredump_skip() helperChristian Brauner
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-24-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: avoid pointless variableChristian Brauner
we don't use that value at all so don't bother with it in the first place. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-23-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: order auto cleanup variables at the topChristian Brauner
so they're easy to spot. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-22-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: add coredump_cleanup()Christian Brauner
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-21-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: auto cleanup prepare_creds()Christian Brauner
which will allow us to simplify the exit path in further patches. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-20-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: directly returnChristian Brauner
instead of jumping to a pointless cleanup label. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-18-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: auto cleanup argvChristian Brauner
to prepare for a simpler exit path. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-17-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: add coredump_write()Christian Brauner
to encapsulate that logic simplifying vfs_coredump() even further. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-16-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: use a single helper for the socketChristian Brauner
Don't split it into multiple functions. Just use a single one like we do for coredump_file() and coredump_pipe() now. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-15-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: move pipe specific file check into coredump_pipe()Christian Brauner
There's no point in having this eyesore in the middle of vfs_coredump(). Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-14-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>