summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-04-28f2fs: Use a folio in add_ipu_page()Matthew Wilcox (Oracle)
Convert the incoming page to a folio at the start, then use the folio later in the function. Moves a call to page_folio() earlier, and removes an access to page->mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-28f2fs: Use bio_for_each_folio_all() in __has_merged_page()Matthew Wilcox (Oracle)
Iterate over each folio rather than each page. Convert f2fs_compress_control_page() to f2fs_compress_control_folio() since this is the only caller. Removes a reference to page->mapping which is going away soon as well as calls to fscrypt_is_bounce_page() and fscrypt_pagecache_page(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-28f2fs: Use F2FS_P_SB() in f2fs_is_compressed_page()Matthew Wilcox (Oracle)
Remove a reference to page->mapping which is going away soon. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-28f2fs: Introduce fio_inode()Matthew Wilcox (Oracle)
This helper returns the inode associated with the f2fs_io_info. That's a relatively common thing to want, mildly awkward to get and provides one place to change if we decide to record it directly, or change fio->page to fio->folio. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> [Jaegeuk Kim: fix wrong fio_inode conversion] Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-28f2fs: Use a folio in f2fs_write_raw_pages()Matthew Wilcox (Oracle)
Convert each page in rpages to a folio before operating on it. Replaces eight calls to compound_head() with one and removes a reference to page->mapping which is going away soon. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-28f2fs: Use a folio in f2fs_compress_free_page()Matthew Wilcox (Oracle)
Convert the incoming page to a folio and operate on it. Removes a reference to page->mapping which is going away soon. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-14f2fs: prevent kernel warning due to negative i_nlink from corrupted imageJaegeuk Kim
WARNING: CPU: 1 PID: 9426 at fs/inode.c:417 drop_nlink+0xac/0xd0 home/cc/linux/fs/inode.c:417 Modules linked in: CPU: 1 UID: 0 PID: 9426 Comm: syz-executor568 Not tainted 6.14.0-12627-g94d471a4f428 #2 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:drop_nlink+0xac/0xd0 home/cc/linux/fs/inode.c:417 Code: 48 8b 5d 28 be 08 00 00 00 48 8d bb 70 07 00 00 e8 f9 67 e6 ff f0 48 ff 83 70 07 00 00 5b 5d e9 9a 12 82 ff e8 95 12 82 ff 90 &lt;0f&gt; 0b 90 c7 45 48 ff ff ff ff 5b 5d e9 83 12 82 ff e8 fe 5f e6 ff RSP: 0018:ffffc900026b7c28 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff8239710f RDX: ffff888041345a00 RSI: ffffffff8239717b RDI: 0000000000000005 RBP: ffff888054509ad0 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000000 R11: ffffffff9ab36f08 R12: ffff88804bb40000 R13: ffff8880545091e0 R14: 0000000000008000 R15: ffff8880545091e0 FS: 000055555d0c5880(0000) GS:ffff8880eb3e3000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f915c55b178 CR3: 0000000050d20000 CR4: 0000000000352ef0 Call Trace: <task> f2fs_i_links_write home/cc/linux/fs/f2fs/f2fs.h:3194 [inline] f2fs_drop_nlink+0xd1/0x3c0 home/cc/linux/fs/f2fs/dir.c:845 f2fs_delete_entry+0x542/0x1450 home/cc/linux/fs/f2fs/dir.c:909 f2fs_unlink+0x45c/0x890 home/cc/linux/fs/f2fs/namei.c:581 vfs_unlink+0x2fb/0x9b0 home/cc/linux/fs/namei.c:4544 do_unlinkat+0x4c5/0x6a0 home/cc/linux/fs/namei.c:4608 __do_sys_unlink home/cc/linux/fs/namei.c:4654 [inline] __se_sys_unlink home/cc/linux/fs/namei.c:4652 [inline] __x64_sys_unlink+0xc5/0x110 home/cc/linux/fs/namei.c:4652 do_syscall_x64 home/cc/linux/arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xc7/0x250 home/cc/linux/arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fb3d092324b Code: 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 57 00 00 00 0f 05 &lt;48&gt; 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffdc232d938 EFLAGS: 00000206 ORIG_RAX: 0000000000000057 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb3d092324b RDX: 00007ffdc232d960 RSI: 00007ffdc232d960 RDI: 00007ffdc232d9f0 RBP: 00007ffdc232d9f0 R08: 0000000000000001 R09: 00007ffdc232d7c0 R10: 00000000fffffffd R11: 0000000000000206 R12: 00007ffdc232eaf0 R13: 000055555d0cebb0 R14: 00007ffdc232d958 R15: 0000000000000001 </task> Cc: stable@vger.kernel.org Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-12f2fs: fix to do sanity check on sbi->total_valid_block_countChao Yu
syzbot reported a f2fs bug as below: ------------[ cut here ]------------ kernel BUG at fs/f2fs/f2fs.h:2521! RIP: 0010:dec_valid_block_count+0x3b2/0x3c0 fs/f2fs/f2fs.h:2521 Call Trace: f2fs_truncate_data_blocks_range+0xc8c/0x11a0 fs/f2fs/file.c:695 truncate_dnode+0x417/0x740 fs/f2fs/node.c:973 truncate_nodes+0x3ec/0xf50 fs/f2fs/node.c:1014 f2fs_truncate_inode_blocks+0x8e3/0x1370 fs/f2fs/node.c:1197 f2fs_do_truncate_blocks+0x840/0x12b0 fs/f2fs/file.c:810 f2fs_truncate_blocks+0x10d/0x300 fs/f2fs/file.c:838 f2fs_truncate+0x417/0x720 fs/f2fs/file.c:888 f2fs_setattr+0xc4f/0x12f0 fs/f2fs/file.c:1112 notify_change+0xbca/0xe90 fs/attr.c:552 do_truncate+0x222/0x310 fs/open.c:65 handle_truncate fs/namei.c:3466 [inline] do_open fs/namei.c:3849 [inline] path_openat+0x2e4f/0x35d0 fs/namei.c:4004 do_filp_open+0x284/0x4e0 fs/namei.c:4031 do_sys_openat2+0x12b/0x1d0 fs/open.c:1429 do_sys_open fs/open.c:1444 [inline] __do_sys_creat fs/open.c:1522 [inline] __se_sys_creat fs/open.c:1516 [inline] __x64_sys_creat+0x124/0x170 fs/open.c:1516 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/syscall_64.c:94 The reason is: in fuzzed image, sbi->total_valid_block_count is inconsistent w/ mapped blocks indexed by inode, so, we should not trigger panic for such case, instead, let's print log and set fsck flag. Fixes: 39a53e0ce0df ("f2fs: add superblock and major in-memory structure") Reported-by: syzbot+8b376a77b2f364097fbe@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/67f3c0b2.050a0220.396535.0547.GAE@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-12f2fs: support to disable linear lookup fallbackChao Yu
After commit 91b587ba79e1 ("f2fs: Introduce linear search for dentries"), f2fs forced to use linear lookup whenever a hash-based lookup fails on casefolded directory, it may affect performance for scenarios: a) create a new file w/ filename it doesn't exist in directory, b) lookup a file which may be removed. This patch supports to disable linear lookup fallback, so, once there is a solution for commit 5c26d2f1d3f5 ("unicode: Don't special case ignorable code points") to fix red heart unicode issue, then we can set an encodeing flag to disable the fallback for performance recovery. The way is kept in line w/ ext4, refer to commit 9e28059d5664 ("ext4: introduce linear search for dentries"). Cc: Daniel Lee <chullee@google.com> Cc: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-12f2fs: prevent the current section from being selected as a victim during GCyohan.joung
When selecting a victim using next_victim_seg in a large section, the selected section might already have been cleared and designated as the new current section, making it actively in use. This behavior causes inconsistency between the SIT and SSA. F2FS-fs (dm-54): Inconsistent segment (70961) type [0, 1] in SSA and SIT Call trace: dump_backtrace+0xe8/0x10c show_stack+0x18/0x28 dump_stack_lvl+0x50/0x6c dump_stack+0x18/0x28 f2fs_stop_checkpoint+0x1c/0x3c do_garbage_collect+0x41c/0x271c f2fs_gc+0x27c/0x828 gc_thread_func+0x290/0x88c kthread+0x11c/0x164 ret_from_fork+0x10/0x20 issue scenario segs_per_sec=2 - seg#0 and seg#1 are all dirty - all valid blocks are removed in seg#1 - gc select this sec and next_victim_seg=seg#0 - migrate seg#0, next_victim_seg=seg#1 - checkpoint -> sec(seg#0, seg#1) becomes free - allocator assigns sec(seg#0, seg#1) to curseg - gc tries to migrate seg#1 Fixes: e3080b0120a1 ("f2fs: support subsectional garbage collection") Signed-off-by: yohan.joung <yohan.joung@sk.com> Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-10f2fs: clean up unnecessary indentationJaegeuk Kim
No functional change. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-10f2fs: fix to do sanity check on ino and xnidChao Yu
syzbot reported a f2fs bug as below: INFO: task syz-executor140:5308 blocked for more than 143 seconds. Not tainted 6.14.0-rc7-syzkaller-00069-g81e4f8d68c66 #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor140 state:D stack:24016 pid:5308 tgid:5308 ppid:5306 task_flags:0x400140 flags:0x00000006 Call Trace: <TASK> context_switch kernel/sched/core.c:5378 [inline] __schedule+0x190e/0x4c90 kernel/sched/core.c:6765 __schedule_loop kernel/sched/core.c:6842 [inline] schedule+0x14b/0x320 kernel/sched/core.c:6857 io_schedule+0x8d/0x110 kernel/sched/core.c:7690 folio_wait_bit_common+0x839/0xee0 mm/filemap.c:1317 __folio_lock mm/filemap.c:1664 [inline] folio_lock include/linux/pagemap.h:1163 [inline] __filemap_get_folio+0x147/0xb40 mm/filemap.c:1917 pagecache_get_page+0x2c/0x130 mm/folio-compat.c:87 find_get_page_flags include/linux/pagemap.h:842 [inline] f2fs_grab_cache_page+0x2b/0x320 fs/f2fs/f2fs.h:2776 __get_node_page+0x131/0x11b0 fs/f2fs/node.c:1463 read_xattr_block+0xfb/0x190 fs/f2fs/xattr.c:306 lookup_all_xattrs fs/f2fs/xattr.c:355 [inline] f2fs_getxattr+0x676/0xf70 fs/f2fs/xattr.c:533 __f2fs_get_acl+0x52/0x870 fs/f2fs/acl.c:179 f2fs_acl_create fs/f2fs/acl.c:375 [inline] f2fs_init_acl+0xd7/0x9b0 fs/f2fs/acl.c:418 f2fs_init_inode_metadata+0xa0f/0x1050 fs/f2fs/dir.c:539 f2fs_add_inline_entry+0x448/0x860 fs/f2fs/inline.c:666 f2fs_add_dentry+0xba/0x1e0 fs/f2fs/dir.c:765 f2fs_do_add_link+0x28c/0x3a0 fs/f2fs/dir.c:808 f2fs_add_link fs/f2fs/f2fs.h:3616 [inline] f2fs_mknod+0x2e8/0x5b0 fs/f2fs/namei.c:766 vfs_mknod+0x36d/0x3b0 fs/namei.c:4191 unix_bind_bsd net/unix/af_unix.c:1286 [inline] unix_bind+0x563/0xe30 net/unix/af_unix.c:1379 __sys_bind_socket net/socket.c:1817 [inline] __sys_bind+0x1e4/0x290 net/socket.c:1848 __do_sys_bind net/socket.c:1853 [inline] __se_sys_bind net/socket.c:1851 [inline] __x64_sys_bind+0x7a/0x90 net/socket.c:1851 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Let's dump and check metadata of corrupted inode, it shows its xattr_nid is the same to its i_ino. dump.f2fs -i 3 chaseyu.img.raw i_xattr_nid [0x 3 : 3] So that, during mknod in the corrupted directory, it tries to get and lock inode page twice, result in deadlock. - f2fs_mknod - f2fs_add_inline_entry - f2fs_get_inode_page --- lock dir's inode page - f2fs_init_acl - f2fs_acl_create(dir,..) - __f2fs_get_acl - f2fs_getxattr - lookup_all_xattrs - __get_node_page --- try to lock dir's inode page In order to fix this, let's add sanity check on ino and xnid. Cc: stable@vger.kernel.org Reported-by: syzbot+cc448dcdc7ae0b4e4ffa@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-f2fs-devel/67e06150.050a0220.21942d.0005.GAE@google.com Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-10f2fs: add a fast path in finish_preallocate_blocks()Chao Yu
This patch uses i_sem to protect access/update on f2fs_inode_info.flag in finish_preallocate_blocks(), it avoids grabbing inode_lock() in each open(). Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-10f2fs: zone: fix to avoid inconsistence in between SIT and SSAChao Yu
w/ below testcase, it will cause inconsistence in between SIT and SSA. create_null_blk 512 2 1024 1024 mkfs.f2fs -m /dev/nullb0 mount /dev/nullb0 /mnt/f2fs/ touch /mnt/f2fs/file f2fs_io pinfile set /mnt/f2fs/file fallocate -l 4GiB /mnt/f2fs/file F2FS-fs (nullb0): Inconsistent segment (0) type [1, 0] in SSA and SIT CPU: 5 UID: 0 PID: 2398 Comm: fallocate Tainted: G O 6.13.0-rc1 #84 Tainted: [O]=OOT_MODULE Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 Call Trace: <TASK> dump_stack_lvl+0xb3/0xd0 dump_stack+0x14/0x20 f2fs_handle_critical_error+0x18c/0x220 [f2fs] f2fs_stop_checkpoint+0x38/0x50 [f2fs] do_garbage_collect+0x674/0x6e0 [f2fs] f2fs_gc_range+0x12b/0x230 [f2fs] f2fs_allocate_pinning_section+0x5c/0x150 [f2fs] f2fs_expand_inode_data+0x1cc/0x3c0 [f2fs] f2fs_fallocate+0x3c3/0x410 [f2fs] vfs_fallocate+0x15f/0x4b0 __x64_sys_fallocate+0x4a/0x80 x64_sys_call+0x15e8/0x1b80 do_syscall_64+0x68/0x130 entry_SYSCALL_64_after_hwframe+0x67/0x6f RIP: 0033:0x7f9dba5197ca F2FS-fs (nullb0): Stopped filesystem due to reason: 4 The reason is f2fs_gc_range() may try to migrate block in curseg, however, its SSA block is not uptodate due to the last summary block data is still in cache of curseg. In this patch, we add a condition in f2fs_gc_range() to check whether section is opened or not, and skip block migration for opened section. Fixes: 9703d69d9d15 ("f2fs: support file pinning for zoned devices") Reviewed-by: Daeho Jeong <daehojeong@google.com> Cc: Daeho Jeong <daehojeong@google.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-10f2fs: fix to set atomic write status more clearChao Yu
1. After we start atomic write in a database file, before committing all data, we'd better not set inode w/ vfs dirty status to avoid redundant updates, instead, we only set inode w/ atomic dirty status. 2. After we commit all data, before committing metadata, we need to clear atomic dirty status, and set vfs dirty status to allow vfs flush dirty inode. Cc: Daeho Jeong <daehojeong@google.com> Reported-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-10f2fs: fix to update injection attrs according to fault_optionChao Yu
When we update inject type via sysfs, it shows wrong rate value as below, there is a same problem when we update inject rate, fix it. Before: F2FS-fs (vdd): build fault injection attr: rate: 0, type: 0xffff F2FS-fs (vdd): build fault injection attr: rate: 1, type: 0x0 After: F2FS-fs (vdd): build fault injection type: 0x1 F2FS-fs (vdd): build fault injection rate: 1 Meanwhile, let's avoid turning on all fault types when we enable fault injection via fault_injection mount option, it will lead to shutdown filesystem or fail the mount() easily. mount -o fault_injection=4 /dev/vdd /mnt/f2fs F2FS-fs (vdd): build fault injection attr: rate: 4, type: 0x7fffff F2FS-fs (vdd): inject kmalloc in f2fs_kmalloc of f2fs_fill_super+0xbdf/0x27c0 Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-10f2fs: add a proc entry show inject statsChao Yu
This patch adds a proc entry named inject_stats to show total injected count for each fault type. cat /proc/fs/f2fs/<dev>/inject_stats fault_type injected_count kmalloc 0 kvmalloc 0 page alloc 0 page get 0 alloc bio(obsolete) 0 alloc nid 0 orphan 0 no more block 0 too big dir depth 0 evict_inode fail 0 truncate fail 0 read IO error 0 checkpoint error 0 discard error 0 write IO error 0 slab alloc 0 dquot initialize 0 lock_op 0 invalid blkaddr 0 inconsistent blkaddr 0 no free segment 0 inconsistent footer 0 Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-10f2fs: remove redundant assignment to variable errColin Ian King
The variable err is being assigned a value zero and then the following goto page_hit reassigns err a new value. The zero assignment is redundant and can be removed. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> [Jaegeuk Kim: clean up braces and if condition, suggested by Dan Carpenter] Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-04-06Merge tag 'sched-urgent-2025-04-06' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: - Fix a nonsensical Kconfig combination - Remove an unnecessary rseq-notification * tag 'sched-urgent-2025-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rseq: Eliminate useless task_work on execve sched/isolation: Make CONFIG_CPU_ISOLATION depend on CONFIG_SMP
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-04Merge tag '6.15-rc-part2-smb3-client-fixes' of ↵Linus Torvalds
git://git.samba.org/sfrench/cifs-2.6 Pull more smb client updates from Steve French: - reconnect fixes: three for updating rsize/wsize and an SMB1 reconnect fix - RFC1001 fixes: fixing connections to nonstandard ports, and negprot retries - fix mfsymlinks to old servers - make mapping of open flags for SMB1 more accurate - permission fixes: adding retry on open for write, and one for stat to workaround unexpected access denied - add two new xattrs, one for retrieving SACL and one for retrieving owner (without having to retrieve the whole ACL) - fix mount parm validation for echo_interval - minor cleanup (including removing now unneeded cifs_truncate_page) * tag '6.15-rc-part2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: update internal version number cifs: Implement is_network_name_deleted for SMB1 cifs: Remove cifs_truncate_page() as it should be superfluous cifs: Do not add FILE_READ_ATTRIBUTES when using GENERIC_READ/EXECUTE/ALL cifs: Improve SMB2+ stat() to work also without FILE_READ_ATTRIBUTES cifs: Add fallback for SMB2 CREATE without FILE_READ_ATTRIBUTES cifs: Fix querying and creating MF symlinks over SMB1 cifs: Fix access_flags_to_smbopen_mode cifs: Fix negotiate retry functionality cifs: Improve handling of NetBIOS packets cifs: Allow to disable or force initialization of NetBIOS session cifs: Add a new xattr system.smb3_ntsd_owner for getting or setting owner cifs: Add a new xattr system.smb3_ntsd_sacl for getting or setting SACLs smb: client: Update IO sizes after reconnection smb: client: Store original IO parameters and prevent zero IO sizes smb:client: smb: client: Add reverse mapping from tcon to superblocks cifs: remove unreachable code in cifs_get_tcp_session() cifs: fix integer overflow in match_server()
2025-04-03Merge tag 'v6.15rc-part2-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull smb server fixes from Steve French: "Four ksmbd SMB3 server fixes, all also for stable" * tag 'v6.15rc-part2-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: fix null pointer dereference in alloc_preauth_hash() ksmbd: validate zero num_subauth before sub_auth is accessed ksmbd: fix overflow in dacloffset bounds check ksmbd: fix session use-after-free in multichannel connection
2025-04-03fs: actually hold the namespace semaphoreChristian Brauner
Don't use a scoped guard that only protects the next statement. Use a regular guard to make sure that the namespace semaphore is held across the whole function. Signed-off-by: Christian Brauner <brauner@kernel.org> Reported-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/all/20250401170715.GA112019@unreal/ Fixes: db04662e2f4f ("fs: allow detached mounts in clone_private_mount()") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-03Merge tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull more bcachefs updates from Kent Overstreet: "More notable fixes: - Fix for striping behaviour on tiering filesystems where replicas exceeds durability on destination target - Fix a race in device removal where deleting alloc info races with the discard worker - Some small stack usage improvements: this is just enough for KMSAN builds to not blow the stack, more is queued up for 6.16" * tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs: bcachefs: Fix "journal stuck" during recovery bcachefs: backpointer_get_key: check for null from peek_slot() bcachefs: Fix null ptr deref in invalidate_one_bucket() bcachefs: Fix check_snapshot_exists() restart handling bcachefs: use nonblocking variant of print_string_as_lines in error path bcachefs: Fix scheduling while atomic from logging changes bcachefs: Add error handling for zlib_deflateInit2() bcachefs: add missing selection of XARRAY_MULTI bcachefs: bch_dev_usage_full bcachefs: Kill btree_iter.trans bcachefs: do_trace_key_cache_fill() bcachefs: Split up bch_dev.io_ref bcachefs: fix ref leak in btree_node_read_all_replicas bcachefs: Fix null ptr deref in bch2_write_endio() bcachefs: Fix field spanning write warning bcachefs: Fix striping behaviour
2025-04-03Merge tag '9p-for-6.15-rc1' of https://github.com/martinetd/linuxLinus Torvalds
Pull 9p updates from Dominique Martinet: - fix handling of bogus (negative/too long) replies - fix crash on mkdir with ACLs (... looks like nobody is using ACLs with semi-recent kernels...) - ipv6 support for trans=tcp - minor concurrency fix to make syzbot happy - minor cleanup * tag '9p-for-6.15-rc1' of https://github.com/martinetd/linux: docs: fs/9p: Add missing "not" in cache documentation 9p: Use hashtable.h for hash_errmap Documentation/fs/9p: fix broken link 9p/trans_fd: mark concurrent read and writes to p9_conn->err 9p/net: return error on bogus (longer than requested) replies 9p/net: fix improper handling of bogus negative read/write replies fs/9p: fix NULL pointer dereference on mkdir net/9p/fd: support ipv6 for trans=tcp
2025-04-03Merge tag 'mm-hotfixes-stable-2025-04-02-21-57' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM hotfixes from Andrew Morton: "Five hotfixes. Three are cc:stable and the remainder address post-6.14 issues or aren't considered necessary for -stable kernels. All patches are for MM" * tag 'mm-hotfixes-stable-2025-04-02-21-57' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: zswap: fix crypto_free_acomp() deadlock in zswap_cpu_comp_dead() mm/hugetlb: move hugetlb_sysctl_init() to the __init section mm: page_isolation: avoid calling folio_hstate() without hugetlb_lock mm/hugetlb_vmemmap: fix memory loads ordering mm/userfaultfd: fix release hang over concurrent GUP
2025-04-03bcachefs: Fix "journal stuck" during recoveryKent Overstreet
If we crash when the journal pin fifo is completely full - i.e. we're at the maximum number of dirty journal entries - that may put us in a sticky situation in recovery, as journal replay will need to be able to open new journal entries in order to get going. bch2_fs_journal_start() already had provisions for resizing the journal pin fifo if needed, but it needs a fudge factor to ensure there's room for journal replay. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03bcachefs: backpointer_get_key: check for null from peek_slot()Kent Overstreet
peek_slot() doesn't normally return bkey_s_c_null - except when we ask for a key at a btree level that doesn't exist, which can happen here. We might want to revisit this, but we'll have to look over all the places where we use peek_slot() on interior nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03bcachefs: Fix null ptr deref in invalidate_one_bucket()Kent Overstreet
bch2_backpointer_get_key() returns bkey_s_c_null when the target isn't found. backpointer_get_key() flags the error, so there's nothing else to do here - just skip it and move on. Link: https://github.com/koverstreet/bcachefs/issues/847 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03bcachefs: Fix check_snapshot_exists() restart handlingKent Overstreet
Codepaths that create entries in the snapshots btree currently call bch2_mark_snapshot(), which updates the in-memory snapshot table, before transaction commit. This is because bch2_mark_snapshot() is an atomic trigger, run with btree write locks held, and isn't allowed to fail - but it might need to reallocate the table, hence we call it early when we're still allowed to fail. This is generally harmless - if we fail, we'll have left an entry in the snapshots table around, but nothing will reference it and it'll get overwritten if reused by another transaction. But check_snapshot_exists(), which reconstructs snapshots when the snapshots btree has been corrupted or lost, was erronously rechecking if the snapshot exists inside the transaction commit loop - so on transaction restart (in this case mem_realloced), the second iteration would return without repairing. This code needs some cleanup: splitting out a "maybe realloc snapshots table" helper would have avoided this, that will be in the next patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03bcachefs: use nonblocking variant of print_string_as_lines in error pathBharadwaj Raju
The inconsistency error path calls print_string_as_lines, which calls console_lock, which is a potentially-sleeping function and so can't be called in an atomic context. Replace calls to it with the nonblocking variant which is safe to call. Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03bcachefs: Fix scheduling while atomic from logging changesKent Overstreet
Two fixes from the recent logging changes: bch2_inconsistent(), bch2_fs_inconsistent() be called from interrupt context, or with rcu_read_lock() held. The one syzbot found is in bch2_bkey_pick_read_device bch2_dev_rcu bch2_fs_inconsistent We're starting to switch to lift the printbufs up to higher levels so we can emit better log messages and print them all in one go (avoid garbling), so that conversion will help with spotting these in the future; when we declare a printbuf it must be flagged if we're in an atomic context. Secondly, in btree_node_write_endio: 00085 BUG: sleeping function called from invalid context at include/linux/sched/mm.h:321 00085 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 618, name: bch-reclaim/fa6 00085 preempt_count: 10001, expected: 0 00085 RCU nest depth: 0, expected: 0 00085 4 locks held by bch-reclaim/fa6/618: 00085 #0: ffffff80d7ccad68 (&j->reclaim_lock){+.+.}-{4:4}, at: bch2_journal_reclaim_thread+0x84/0x198 00085 #1: ffffff80d7c84218 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x1c0/0x440 00085 #2: ffffff80cd3f8140 (bcachefs_btree){+.+.}-{0:0}, at: __bch2_trans_get+0x22c/0x440 00085 #3: ffffff80c3823c20 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x58/0x130 00085 irq event stamp: 328 00085 hardirqs last enabled at (327): [<ffffffc080073a14>] finish_task_switch.isra.0+0xbc/0x2a0 00085 hardirqs last disabled at (328): [<ffffffc080971a10>] el1_interrupt+0x20/0x60 00085 softirqs last enabled at (0): [<ffffffc08002f920>] copy_process+0x7c8/0x2118 00085 softirqs last disabled at (0): [<0000000000000000>] 0x0 00085 Preemption disabled at: 00085 [<ffffffc08003ada0>] irq_enter_rcu+0x18/0x90 00085 CPU: 8 UID: 0 PID: 618 Comm: bch-reclaim/fa6 Not tainted 6.14.0-rc6-ktest-g04630bde23e8 #18798 00085 Hardware name: linux,dummy-virt (DT) 00085 Call trace: 00085 show_stack+0x1c/0x30 (C) 00085 dump_stack_lvl+0x84/0xc0 00085 dump_stack+0x14/0x20 00085 __might_resched+0x180/0x288 00085 __might_sleep+0x4c/0x88 00085 __kmalloc_node_track_caller_noprof+0x34c/0x3e0 00085 krealloc_noprof+0x1a0/0x2d8 00085 bch2_printbuf_make_room+0x9c/0x120 00085 bch2_prt_printf+0x60/0x1b8 00085 btree_node_write_endio+0x1b0/0x2d8 00085 bio_endio+0x138/0x1f0 00085 btree_node_write_endio+0xe8/0x2d8 00085 bio_endio+0x138/0x1f0 00085 blk_update_request+0x220/0x4c0 00085 blk_mq_end_request+0x28/0x148 00085 virtblk_request_done+0x64/0xe8 00085 blk_mq_complete_request+0x34/0x40 00085 virtblk_done+0x78/0x130 00085 vring_interrupt+0x6c/0xb0 00085 __handle_irq_event_percpu+0x8c/0x2e0 00085 handle_irq_event+0x50/0xb0 00085 handle_fasteoi_irq+0xc4/0x250 00085 handle_irq_desc+0x44/0x60 00085 generic_handle_domain_irq+0x20/0x30 00085 gic_handle_irq+0x54/0xc8 00085 call_on_irq_stack+0x24/0x40 Reported-by: syzbot+c82cd2906e2f192410bb@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03bcachefs: Add error handling for zlib_deflateInit2()Wentao Liang
In attempt_compress(), the return value of zlib_deflateInit2() needs to be checked. A proper implementation can be found in pstore_compress(). Add an error check and return 0 immediately if the initialzation fails. Fixes: 986e9842fb68 ("bcachefs: Compression levels") Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-03rseq: Eliminate useless task_work on execveMathieu Desnoyers
Eliminate a useless task_work on execve by moving the call to rseq_set_notify_resume() from sched_mm_cid_after_execve() to the error path of bprm_execve(). The call to rseq_set_notify_resume() from sched_mm_cid_after_execve() is pointless in the success case, because rseq_execve() will clear the rseq pointer before returning to userspace. sched_mm_cid_after_execve() is called from both the success and error paths of bprm_execve(). The call to rseq_set_notify_resume() is needed on error because the mm_cid may have changed. Also move the rseq_execve() to right after sched_mm_cid_after_execve() in bprm_execve(). [ mingo: Merged to a recent upstream kernel, extended the changelog. ] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250327132945.1558783-1-mathieu.desnoyers@efficios.com
2025-04-02cifs: update internal version numberSteve French
To 2.52 Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-02cifs: Implement is_network_name_deleted for SMB1Pali Rohár
This change allows Linux SMB1 client to autoreconnect the share when it is modified on server by admin operation which removes and re-adds it. Implementation is reused from SMB2+ is_network_name_deleted callback. There are just adjusted checks for error codes and access to struct smb_hdr. Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-02cifs: Remove cifs_truncate_page() as it should be superfluousDavid Howells
The calls to cifs_truncate_page() should be superfluous as the places that call it also call truncate_setsize() or cifs_setsize() and therefore truncate_pagecache() which should also clear the tail part of the folio containing the EOF marker. Further, smb3_simple_falloc() calls both cifs_setsize() and truncate_setsize() in addition to cifs_truncate_page(). Remove the superfluous calls. This gets rid of another function referring to struct page. [Should cifs_setsize() also set inode->i_blocks?] Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <stfrench@microsoft.com> Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> cc: Matthew Wilcox <willy@infradead.org> cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-02Merge tag 'nfs-for-6.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Bugfixes: - Three fixes for looping in the NFSv4 state manager delegation code - Fix for the NFSv4 state XDR code (Neil Brown) - Fix a leaked reference in nfs_lock_and_join_requests() - Fix a use-after-free in the delegation return code Features: - Implement the NFSv4.2 copy offload OFFLOAD_STATUS operation to allow monitoring of an in-progress copy - Add a mount option to force NFSv3/NFSv4 to use READDIRPLUS in a getdents() call - SUNRPC now allows some basic management of an existing RPC client's connections using sysfs - Improvements to the automated teardown of a NFS client when the container it was initiated from gets killed - Improvements to prevent tasks from getting stuck in a killable wait state after calling exit_signals()" * tag 'nfs-for-6.15-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (29 commits) nfs: Add missing release on error in nfs_lock_and_join_requests() NFSv4: Check for delegation validity in nfs_start_delegation_return_locked() NFS: Don't allow waiting for exiting tasks SUNRPC: Don't allow waiting for exiting tasks NFSv4: Treat ENETUNREACH errors as fatal for state recovery NFSv4: clp->cl_cons_state < 0 signifies an invalid nfs_client NFSv4: Further cleanups to shutdown loops NFS: Shut down the nfs_client only after all the superblocks SUNRPC: rpc_clnt_set_transport() must not change the autobind setting SUNRPC: rpcbind should never reset the port to the value '0' pNFS/flexfiles: Report ENETDOWN as a connection error pNFS/flexfiles: Treat ENETUNREACH errors as fatal in containers NFS: Treat ENETUNREACH errors as fatal in containers NFS: Add a mount option to make ENETUNREACH errors fatal sunrpc: Add a sysfs file for one-step xprt deletion sunrpc: Add a sysfs file for adding a new xprt sunrpc: Add a sysfs files for rpc_clnt information sunrpc: Add a sysfs attr for xprtsec NFS: Add implid to sysfs NFS: Extend rdirplus mount option with "force|none" ...
2025-04-02Merge tag 'fuse-update-6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Allow connection to server to time out (Joanne Koong) - If server doesn't support creating a hard link, return EPERM rather than ENOSYS (Matt Johnston) - Allow file names longer than 1024 chars (Bernd Schubert) - Fix a possible race if request on io_uring queue is interrupted (Bernd Schubert) - Misc fixes and cleanups * tag 'fuse-update-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: remove unneeded atomic set in uring creation fuse: fix uring race condition for null dereference of fc fuse: Increase FUSE_NAME_MAX to PATH_MAX fuse: Allocate only namelen buf memory in fuse_notify_ fuse: add default_request_timeout and max_request_timeout sysctls fuse: add kernel-enforced timeout option for requests fuse: optmize missing FUSE_LINK support fuse: Return EPERM rather than ENOSYS from link() fuse: removed unused function fuse_uring_create() from header fuse: {io-uring} Fix a possible req cancellation race
2025-04-02Merge tag 'ntfs3_for_6.15' of ↵Linus Torvalds
https://github.com/Paragon-Software-Group/linux-ntfs3 Pull ntfs3 updates from Konstantin Komarov: - Fix integer overflows on 32-bit systems and in hdr_first_de() - Fix 'proc_info_root' leak on NTFS initialization failure - Remove unused functions ni_load_attr, ntfs_sb_read, ntfs_flush_inodes - update inode->i_mapping->a_ops on compression state - ensure atomicity of write operations - refactor ntfs_{create/remove}_{procdir,proc_root}() * tag 'ntfs3_for_6.15' of https://github.com/Paragon-Software-Group/linux-ntfs3: fs/ntfs3: Remove unused ntfs_flush_inodes fs/ntfs3: Remove unused ntfs_sb_read fs/ntfs3: Remove unused ni_load_attr fs/ntfs3: Prevent integer overflow in hdr_first_de() fs/ntfs3: Fix a couple integer overflows on 32bit systems fs/ntfs3: Update inode->i_mapping->a_ops on compression state fs/ntfs3: Fix WARNING in ntfs_extend_initialized_size fs/ntfs3: Fix 'proc_info_root' leak when init ntfs failed fs/ntfs3: Factor out ntfs_{create/remove}_proc_root() fs/ntfs3: Factor out ntfs_{create/remove}_procdir() fs/ntfs3: Keep write operations atomic
2025-04-02Merge tag 'vfs-6.15-rc1.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Add a new maintainer for configfs - Fix exportfs module description - Place flexible array memeber at the end of an internal struct in the mount code - Add new maintainer for netfslib as Jeff Layton is stepping down as current co-maintainer - Fix error handling in cachefiles_get_directory() - Cleanup do_notify_pidfd() - Fix syscall number definitions in pidfd selftests - Fix racy usage of fs_struct->in exec during multi-threaded exec - Ensure correct exit code is reported when pidfs_exit() is called from release_task() for a delayed thread-group leader exit - Fix conflicting iomap flag definitions * tag 'vfs-6.15-rc1.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: iomap: Fix conflicting values of iomap flags fs: namespace: Avoid -Wflex-array-member-not-at-end warning MAINTAINERS: configfs: add Andreas Hindborg as maintainer exportfs: add module description exit: fix the usage of delay_group_leader->exit_code in do_notify_parent() and pidfs_exit() netfs: add Paulo as maintainer and remove myself as Reviewer cachefiles: Fix oops in vfs_mkdir from cachefiles_get_directory exec: fix the racy usage of fs_struct->in_exec selftests/pidfd: fixes syscall number defines pidfs: cleanup the usage of do_notify_pidfd()
2025-04-02Merge tag 'uml-for-linux-6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux Pull UML updates from Johannes Berg: - proper nofault accesses and read-only rodata - hostfs fix for host inode number reuse - fixes for host errno handling - various cleanups/small fixes * tag 'uml-for-linux-6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: um: Rewrite the sigio workaround based on epoll and tgkill um: Prohibit the VM_CLONE flag in run_helper_thread() um: Switch to the pthread-based helper in sigio workaround um: ubd: Switch to the pthread-based helper um: Add pthread-based helper support um: x86: clean up elf specific definitions um: Store full CSGSFS and SS register from mcontext um: virt-pci: Refactor virtio_pcidev into its own module um: work around sched_yield not yielding in time-travel mode um/locking: Remove semicolon from "lock" prefix um: Update min_low_pfn to match changes in uml_reserved um: use str_yes_no() to remove hardcoded "yes" and "no" um: hostfs: avoid issues on inode number reuse by host um: Allocate vdso page pointer statically um: remove copy_from_kernel_nofault_allowed um: mark rodata read-only and implement _nofault accesses um: Pass the correct Rust target and options with gcc
2025-04-02bcachefs: add missing selection of XARRAY_MULTIEric Biggers
When CONFIG_XARRAY_MULTI is not set, reading from a bcachefs file hits the 'BUG_ON(order > 0);' in xas_set_order(), because it tries to insert a large folio in the page cache. Fix this by making bcachefs select XARRAY_MULTI. Fixes: be212d86b19c ("bcachefs: bs > ps support") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: bch_dev_usage_fullKent Overstreet
All the fastpaths that need device usage don't need the sector totals or fragmentation, just bucket counts. Split bch_dev_usage up into two different versions, the normal one with just bucket counts. This is also a stack usage improvement, since we have a bch_dev_usage on the stack in the allocation path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: Kill btree_iter.transKent Overstreet
This was planned to be done ages ago, now finally completed; there are places where we have quite a few btree_trans objects on the stack, so this reduces stack usage somewhat. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: do_trace_key_cache_fill()Kent Overstreet
Reducing stack frame usage; this moves the printbuf out of the main stack frame. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02bcachefs: Split up bch_dev.io_refKent Overstreet
We now have separate per device io_refs for read and write access. This fixes a device removal bug where the discard workers were still running while we're removing alloc info for that device. It's also a bit of hardening; we no longer allow writes to devices that are read-only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-04-02nfs: Add missing release on error in nfs_lock_and_join_requests()Dan Carpenter
Call nfs_release_request() on this error path before returning. Fixes: c3f2235782c3 ("nfs: fold nfs_folio_find_and_lock_request into nfs_lock_and_join_requests") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/3aaaa3d5-1c8a-41e4-98c7-717801ddd171@stanley.mountain Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2025-04-01ksmbd: fix null pointer dereference in alloc_preauth_hash()Namjae Jeon
The Client send malformed smb2 negotiate request. ksmbd return error response. Subsequently, the client can send smb2 session setup even thought conn->preauth_info is not allocated. This patch add KSMBD_SESS_NEED_SETUP status of connection to ignore session setup request if smb2 negotiate phase is not complete. Cc: stable@vger.kernel.org Tested-by: Steve French <stfrench@microsoft.com> Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-26505 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-04-01mm/userfaultfd: fix release hang over concurrent GUPPeter Xu
This patch should fix a possible userfaultfd release() hang during concurrent GUP. This problem was initially reported by Dimitris Siakavaras in July 2023 [1] in a firecracker use case. Firecracker has a separate process handling page faults remotely, and when the process releases the userfaultfd it can race with a concurrent GUP from KVM trying to fault in a guest page during the secondary MMU page fault process. A similar problem was reported recently again by Jinjiang Tu in March 2025 [2], even though the race happened this time with a mlockall() operation, which does GUP in a similar fashion. In 2017, commit 656710a60e36 ("userfaultfd: non-cooperative: closing the uffd without triggering SIGBUS") was trying to fix this issue. AFAIU, that fixes well the fault paths but may not work yet for GUP. In GUP, the issue is NOPAGE will be almost treated the same as "page fault resolved" in faultin_page(), then the GUP will follow page again, seeing page missing, and it'll keep going into a live lock situation as reported. This change makes core mm return RETRY instead of NOPAGE for both the GUP and fault paths, proactively releasing the mmap read lock. This should guarantee the other release thread make progress on taking the write lock and avoid the live lock even for GUP. When at it, rearrange the comments to make sure it's uptodate. [1] https://lore.kernel.org/r/79375b71-db2e-3e66-346b-254c90d915e2@cslab.ece.ntua.gr [2] https://lore.kernel.org/r/20250307072133.3522652-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20250312145131.1143062-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Jinjiang Tu <tujinjiang@huawei.com> Cc: Dimitris Siakavaras <jimsiak@cslab.ece.ntua.gr> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>