summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-06-26bcachefs: Ensure btree node scan runs before checking for scanned nodesKent Overstreet
Previously, calling bch2_btree_has_scanned_nodes() when btree node scan hadn't actually run would erroniously return false - causing us to think a btree was entirely gone. This fixes a 6.16 regression from moving the scheduling of btree node scan out of bch2_btree_lost_data() (fixing the bug where we'd schedule it persistently in the superblock) and only scheduling it when check_toploogy() is asking for scanned btree nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-26bcachefs: btree_root_unreadable_and_scan_found_nothing should not be autofixKent Overstreet
Autofix is specified in btree_gc.c if it's not an important btree. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-25Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull mount fixes from Al Viro: "Several mount-related fixes" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: userns and mnt_idmap leak in open_tree_attr(2) attach_recursive_mnt(): do not lock the covering tree when sliding something under it replace collect_mounts()/drop_collected_mounts() with a safer variant
2025-06-25fuse: fix runtime warning on truncate_folio_batch_exceptionals()Haiyue Wang
The WARN_ON_ONCE is introduced on truncate_folio_batch_exceptionals() to capture whether the filesystem has removed all DAX entries or not. And the fix has been applied on the filesystem xfs and ext4 by the commit 0e2f80afcfa6 ("fs/dax: ensure all pages are idle prior to filesystem unmount"). Apply the missed fix on filesystem fuse to fix the runtime warning: [ 2.011450] ------------[ cut here ]------------ [ 2.011873] WARNING: CPU: 0 PID: 145 at mm/truncate.c:89 truncate_folio_batch_exceptionals+0x272/0x2b0 [ 2.012468] Modules linked in: [ 2.012718] CPU: 0 UID: 1000 PID: 145 Comm: weston Not tainted 6.16.0-rc2-WSL2-STABLE #2 PREEMPT(undef) [ 2.013292] RIP: 0010:truncate_folio_batch_exceptionals+0x272/0x2b0 [ 2.013704] Code: 48 63 d0 41 29 c5 48 8d 1c d5 00 00 00 00 4e 8d 6c 2a 01 49 c1 e5 03 eb 09 48 83 c3 08 49 39 dd 74 83 41 f6 44 1c 08 01 74 ef <0f> 0b 49 8b 34 1e 48 89 ef e8 10 a2 17 00 eb df 48 8b 7d 00 e8 35 [ 2.014845] RSP: 0018:ffffa47ec33f3b10 EFLAGS: 00010202 [ 2.015279] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 2.015884] RDX: 0000000000000000 RSI: ffffa47ec33f3ca0 RDI: ffff98aa44f3fa80 [ 2.016377] RBP: ffff98aa44f3fbf0 R08: ffffa47ec33f3ba8 R09: 0000000000000000 [ 2.016942] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa47ec33f3ca0 [ 2.017437] R13: 0000000000000008 R14: ffffa47ec33f3ba8 R15: 0000000000000000 [ 2.017972] FS: 000079ce006afa40(0000) GS:ffff98aade441000(0000) knlGS:0000000000000000 [ 2.018510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2.018987] CR2: 000079ce03e74000 CR3: 000000010784f006 CR4: 0000000000372eb0 [ 2.019518] Call Trace: [ 2.019729] <TASK> [ 2.019901] truncate_inode_pages_range+0xd8/0x400 [ 2.020280] ? timerqueue_add+0x66/0xb0 [ 2.020574] ? get_nohz_timer_target+0x2a/0x140 [ 2.020904] ? timerqueue_add+0x66/0xb0 [ 2.021231] ? timerqueue_del+0x2e/0x50 [ 2.021646] ? __remove_hrtimer+0x39/0x90 [ 2.022017] ? srso_alias_untrain_ret+0x1/0x10 [ 2.022497] ? psi_group_change+0x136/0x350 [ 2.023046] ? _raw_spin_unlock+0xe/0x30 [ 2.023514] ? finish_task_switch.isra.0+0x8d/0x280 [ 2.024068] ? __schedule+0x532/0xbd0 [ 2.024551] fuse_evict_inode+0x29/0x190 [ 2.025131] evict+0x100/0x270 [ 2.025641] ? _atomic_dec_and_lock+0x39/0x50 [ 2.026316] ? __pfx_generic_delete_inode+0x10/0x10 [ 2.026843] __dentry_kill+0x71/0x180 [ 2.027335] dput+0xeb/0x1b0 [ 2.027725] __fput+0x136/0x2b0 [ 2.028054] __x64_sys_close+0x3d/0x80 [ 2.028469] do_syscall_64+0x6d/0x1b0 [ 2.028832] ? clear_bhb_loop+0x30/0x80 [ 2.029182] ? clear_bhb_loop+0x30/0x80 [ 2.029533] ? clear_bhb_loop+0x30/0x80 [ 2.029902] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 2.030423] RIP: 0033:0x79ce03d0d067 [ 2.030820] Code: b8 ff ff ff ff e9 3e ff ff ff 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 c3 a7 f8 ff [ 2.032354] RSP: 002b:00007ffef0498948 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 [ 2.032939] RAX: ffffffffffffffda RBX: 00007ffef0498960 RCX: 000079ce03d0d067 [ 2.033612] RDX: 0000000000000003 RSI: 0000000000001000 RDI: 000000000000000d [ 2.034289] RBP: 00007ffef0498a30 R08: 000000000000000d R09: 0000000000000000 [ 2.034944] R10: 00007ffef0498978 R11: 0000000000000246 R12: 0000000000000001 [ 2.035610] R13: 00007ffef0498960 R14: 000079ce03e09ce0 R15: 0000000000000003 [ 2.036301] </TASK> [ 2.036532] ---[ end trace 0000000000000000 ]--- Link: https://lkml.kernel.org/r/20250621171507.3770-1-haiyuewa@163.com Fixes: bde708f1a65d ("fs/dax: always remove DAX page-cache entries when breaking layouts") Signed-off-by: Haiyue Wang <haiyuewa@163.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-25fs/proc/task_mmu: fix PAGE_IS_PFNZERO detection for the huge zero folioDavid Hildenbrand
is_zero_pfn() does not work for the huge zero folio. Fix it by using is_huge_zero_pmd(). This can cause the PAGEMAP_SCAN ioctl against /proc/pid/pagemap to present pages as PAGE_IS_PRESENT rather than as PAGE_IS_PFNZERO. Found by code inspection. Link: https://lkml.kernel.org/r/20250617143532.2375383-1-david@redhat.com Fixes: 52526ca7fdb9 ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-25smb: client: remove \t from TP_printk statementsStefan Metzmacher
The generate '[FAILED TO PARSE]' strings in trace-cmd report output like this: rm-5298 [001] 6084.533748493: smb3_exit_err: [FAILED TO PARSE] xid=972 func_name=cifs_rmdir rc=-39 rm-5298 [001] 6084.533959234: smb3_enter: [FAILED TO PARSE] xid=973 func_name=cifs_closedir rm-5298 [001] 6084.533967630: smb3_close_enter: [FAILED TO PARSE] xid=973 fid=94489281833 tid=1 sesid=96758029877361 rm-5298 [001] 6084.534004008: smb3_cmd_enter: [FAILED TO PARSE] tid=1 sesid=96758029877361 cmd=6 mid=566 rm-5298 [001] 6084.552248232: smb3_cmd_done: [FAILED TO PARSE] tid=1 sesid=96758029877361 cmd=6 mid=566 rm-5298 [001] 6084.552280542: smb3_close_done: [FAILED TO PARSE] xid=973 fid=94489281833 tid=1 sesid=96758029877361 rm-5298 [001] 6084.552316034: smb3_exit_done: [FAILED TO PARSE] xid=973 func_name=cifs_closedir Cc: stable@vger.kernel.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-25smb: client: let smbd_post_send_iter() respect the peers max_send_size and ↵Stefan Metzmacher
transmit all data We should not send smbdirect_data_transfer messages larger than the negotiated max_send_size, typically 1364 bytes, which means 24 bytes of the smbdirect_data_transfer header + 1340 payload bytes. This happened when doing an SMB2 write with more than 1340 bytes (which is done inline as it's below rdma_readwrite_threshold). It means the peer resets the connection. When testing between cifs.ko and ksmbd.ko something like this is logged: client: CIFS: VFS: RDMA transport re-established siw: got TERMINATE. layer 1, type 2, code 2 siw: got TERMINATE. layer 1, type 2, code 2 siw: got TERMINATE. layer 1, type 2, code 2 siw: got TERMINATE. layer 1, type 2, code 2 siw: got TERMINATE. layer 1, type 2, code 2 siw: got TERMINATE. layer 1, type 2, code 2 siw: got TERMINATE. layer 1, type 2, code 2 siw: got TERMINATE. layer 1, type 2, code 2 siw: got TERMINATE. layer 1, type 2, code 2 CIFS: VFS: \\carina Send error in SessSetup = -11 smb2_reconnect: 12 callbacks suppressed CIFS: VFS: reconnect tcon failed rc = -11 CIFS: VFS: reconnect tcon failed rc = -11 CIFS: VFS: reconnect tcon failed rc = -11 CIFS: VFS: SMB: Zero rsize calculated, using minimum value 65536 and: CIFS: VFS: RDMA transport re-established siw: got TERMINATE. layer 1, type 2, code 2 CIFS: VFS: smbd_recv:1894 disconnected siw: got TERMINATE. layer 1, type 2, code 2 The ksmbd dmesg is showing things like: smb_direct: Recv error. status='local length error (1)' opcode=128 smb_direct: disconnected smb_direct: Recv error. status='local length error (1)' opcode=128 ksmbd: smb_direct: disconnected ksmbd: sock_read failed: -107 As smbd_post_send_iter() limits the transmitted number of bytes we need loop over it in order to transmit the whole iter. Reviewed-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Tested-by: Meetakshi Setiya <msetiya@microsoft.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Cc: <stable+noautosel@kernel.org> # sp->max_send_size should be info->max_send_size in backports Fixes: 3d78fe73fa12 ("cifs: Build the RDMA SGE list directly from an iterator") Signed-off-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-24bcachefs: fix bch2_journal_keys_peek_prev_min() underflowKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24bcachefs: Use wait_on_allocator() when allocating journalKent Overstreet
wait_on_allocator() emits debug info when we hang trying to allocate. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24f2fs: convert F2FS_I_SB to sbi in f2fs_setattr()wangzijie
Introduce sbi in f2fs_setattr() and convert F2FS_I_SB to it. No logic change, just cleanup and prepare to get CAP_BLKS_PER_SEC(sbi). Signed-off-by: wangzijie <wangzijie1@honor.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-24f2fs: Fix the typos in commentsSwarna Prabhu
This patch fixes minor typos in comments in f2fs. Signed-off-by: Swarna Prabhu <s.prabhu@samsung.com> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-24bcachefs: Check for bad write buffer key when moving from journalKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24bcachefs: Don't unlock the trans if ret doesn't match BCH_ERR_operation_blockedAlan Huang
Reported-by: syzbot+d540192e763531d307ff@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24fhandle, pidfs: support open_by_handle_at() purely based on file handleChristian Brauner
Various filesystems such as pidfs (and likely drm in the future) have a use-case to support opening files purely based on the handle without having to require a file descriptor to another object. That's especially the case for filesystems that don't do any lookup whatsoever and there's zero relationship between the objects. Such filesystems are also singletons that stay around for the lifetime of the system meaning that they can be uniquely identified and accessed purely based on the file handle type. Enable that so that userspace doesn't have to allocate an object needlessly especially if they can't do that for whatever reason. Link: https://lore.kernel.org/20250624-work-pidfs-fhandle-v2-10-d02a04858fe3@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24userns and mnt_idmap leak in open_tree_attr(2)Al Viro
Once want_mount_setattr() has returned a positive, it does require finish_mount_kattr() to release ->mnt_userns. Failing do_mount_setattr() does not change that. As the result, we can end up leaking userns and possibly mnt_idmap as well. Fixes: c4a16820d901 ("fs: add open_tree_attr()") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-24fs: Remove three arguments from block_write_end()Matthew Wilcox (Oracle)
block_write_end() looks like it can be used as a ->write_end() implementation. However, it can't as it does not unlock nor put the folio. Since it does not use the 'file', 'mapping' nor 'fsdata' arguments, remove them. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/20250624132130.1590285-1-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24fhandle: reflow get_path_anchor()Christian Brauner
Switch to a more common coding style. Link: https://lore.kernel.org/20250624-work-pidfs-fhandle-v2-5-d02a04858fe3@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24pidfs: add pidfs_root_path() helperChristian Brauner
Allow to return the root of the global pidfs filesystem. Link: https://lore.kernel.org/20250624-work-pidfs-fhandle-v2-4-d02a04858fe3@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24fhandle: rename to get_path_anchor()Christian Brauner
Rename as we're going to expand the function in the next step. The path just serves as the anchor tying the decoding to the filesystem. Link: https://lore.kernel.org/20250624-work-pidfs-fhandle-v2-3-d02a04858fe3@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24fhandle: hoist copy_from_user() above get_path_from_fd()Christian Brauner
In follow-up patches we need access to @file_handle->handle_type before we start caring about get_path_from_fd(). Link: https://lore.kernel.org/20250624-work-pidfs-fhandle-v2-2-d02a04858fe3@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24fhandle: raise FILEID_IS_DIR in handle_typeChristian Brauner
Currently FILEID_IS_DIR is raised in fh_flags which is wrong. Raise it in handle->handle_type were it's supposed to be. Link: https://lore.kernel.org/20250624-work-pidfs-fhandle-v2-1-d02a04858fe3@kernel.org Fixes: c374196b2b9f ("fs: name_to_handle_at() support for "explicit connectable" file handles") Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Cc: stable@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24fuse: fix fuse_fill_write_pages() upper bound calculationJoanne Koong
This fixes a bug in commit 63c69ad3d18a ("fuse: refactor fuse_fill_write_pages()") where max_pages << PAGE_SHIFT is mistakenly used as the calculation for the max_pages upper limit but there's the possibility that copy_folio_from_iter_atomic() may copy over bytes from the iov_iter that are less than the full length of the folio, which would lead to exceeding max_pages. This commit fixes it by adding a 'ap->num_folios < max_folios' check. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/20250614000114.910380-1-joannelkoong@gmail.com Fixes: 63c69ad3d18a ("fuse: refactor fuse_fill_write_pages()") Tested-by: Brian Foster <bfoster@redhat.com> Reported-by: Brian Foster <bfoster@redhat.com> Closes: https://lore.kernel.org/linux-fsdevel/aEq4haEQScwHIWK6@bfoster/ Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23net: make sk->sk_rcvtimeo locklessEric Dumazet
Followup of commit 285975dd6742 ("net: annotate data-races around sk->sk_{rcv|snd}timeo"). Remove lock_sock()/release_sock() from ksmbd_tcp_rcv_timeout() and add READ_ONCE()/WRITE_ONCE() where it is needed. Also SO_RCVTIMEO_OLD and SO_RCVTIMEO_NEW can call sock_set_timeout() without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250620155536.335520-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23f2fs: compress: fix UAF of f2fs_inode_info in f2fs_free_dicZhiguo Niu
The decompress_io_ctx may be released asynchronously after I/O completion. If this file is deleted immediately after read, and the kworker of processing post_read_wq has not been executed yet due to high workloads, It is possible that the inode(f2fs_inode_info) is evicted and freed before it is used f2fs_free_dic. The UAF case as below: Thread A Thread B - f2fs_decompress_end_io - f2fs_put_dic - queue_work add free_dic work to post_read_wq - do_unlink - iput - evict - call_rcu This file is deleted after read. Thread C kworker to process post_read_wq - rcu_do_batch - f2fs_free_inode - kmem_cache_free inode is freed by rcu - process_scheduled_works - f2fs_late_free_dic - f2fs_free_dic - f2fs_release_decomp_mem read (dic->inode)->i_compress_algorithm This patch store compress_algorithm and sbi in dic to avoid inode UAF. In addition, the previous solution is deprecated in [1] may cause system hang. [1] https://lore.kernel.org/all/c36ab955-c8db-4a8b-a9d0-f07b5f426c3f@kernel.org Cc: Daeho Jeong <daehojeong@google.com> Fixes: bff139b49d9f ("f2fs: handle decompress only post processing in softirq") Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Baocong Liu <baocong.liu@unisoc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: compress: change the first parameter of page_array_{alloc,free} to sbiZhiguo Niu
No logic changes, just cleanup and prepare for fixing the UAF issue in f2fs_free_dic. Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Baocong Liu <baocong.liu@unisoc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: introduce reserved_pin_section sysfs entryChao Yu
This patch introduces /sys/fs/f2fs/<dev>/reserved_pin_section for tuning @needed parameter of has_not_enough_free_secs(), if we configure it w/ zero, it can avoid f2fs_gc() as much as possible while fallocating on pinned file. Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: wangzijie <wangzijie1@honor.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: fix to avoid invalid wait context issueChao Yu
============================= [ BUG: Invalid wait context ] 6.13.0-rc1 #84 Tainted: G O ----------------------------- cat/56160 is trying to lock: ffff888105c86648 (&cprc->stat_lock){+.+.}-{3:3}, at: update_general_status+0x32a/0x8c0 [f2fs] other info that might help us debug this: context-{5:5} 2 locks held by cat/56160: #0: ffff88810a002a98 (&p->lock){+.+.}-{4:4}, at: seq_read_iter+0x56/0x4c0 #1: ffffffffa0462638 (f2fs_stat_lock){....}-{2:2}, at: stat_show+0x29/0x1020 [f2fs] stack backtrace: CPU: 0 UID: 0 PID: 56160 Comm: cat Tainted: G O 6.13.0-rc1 #84 Tainted: [O]=OOT_MODULE Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 Call Trace: <TASK> dump_stack_lvl+0x88/0xd0 dump_stack+0x14/0x20 __lock_acquire+0x8d4/0xbb0 lock_acquire+0xd6/0x300 _raw_spin_lock+0x38/0x50 update_general_status+0x32a/0x8c0 [f2fs] stat_show+0x50/0x1020 [f2fs] seq_read_iter+0x116/0x4c0 seq_read+0xfa/0x130 full_proxy_read+0x66/0x90 vfs_read+0xc4/0x350 ksys_read+0x74/0xf0 __x64_sys_read+0x1d/0x20 x64_sys_call+0x17d9/0x1b80 do_syscall_64+0x68/0x130 entry_SYSCALL_64_after_hwframe+0x67/0x6f RIP: 0033:0x7f2ca53147e2 - seq_read - stat_show - raw_spin_lock_irqsave(&f2fs_stat_lock, flags) : f2fs_stat_lock is raw_spinlock_t type variable - update_general_status - spin_lock(&sbi->cprc_info.stat_lock); : stat_lock is spinlock_t type variable The root cause is the lock order is incorrect [1], we should not acquire spinlock_t lock after raw_spinlock_t lock, as if CONFIG_PREEMPT_LOCK is on, spinlock_t is implemented based on rtmutex, which can sleep after holding the lock. To fix this issue, let's use change f2fs_stat_lock lock type from raw_spinlock_t to spinlock_t, it's safe due to: - we don't need to use raw version of spinlock as the path is not performance sensitive. - we don't need to use irqsave version of spinlock as it won't be used in irq context. Quoted from [1]: "Extend lockdep to validate lock wait-type context. The current wait-types are: LD_WAIT_FREE, /* wait free, rcu etc.. */ LD_WAIT_SPIN, /* spin loops, raw_spinlock_t etc.. */ LD_WAIT_CONFIG, /* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */ LD_WAIT_SLEEP, /* sleeping locks, mutex_t etc.. */ Where lockdep validates that the current lock (the one being acquired) fits in the current wait-context (as generated by the held stack). This ensures that there is no attempt to acquire mutexes while holding spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In other words, its a more fancy might_sleep()." [1] https://lore.kernel.org/all/20200321113242.427089655@linutronix.de Fixes: 98237fcda4a2 ("f2fs: use spin_lock to avoid hang") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: fix bio memleak when committing super blockSheng Yong
When committing new super block, bio is allocated but not freed, and kmemleak complains: unreferenced object 0xffff88801d185600 (size 192): comm "kworker/3:2", pid 128, jiffies 4298624992 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 80 67 c3 00 81 88 ff ff .........g...... 01 08 06 00 00 00 00 00 00 00 00 00 01 00 00 00 ................ backtrace (crc 650ecdb1): kmem_cache_alloc_noprof+0x3a9/0x460 mempool_alloc_noprof+0x12f/0x310 bio_alloc_bioset+0x1e2/0x7e0 __f2fs_commit_super+0xe0/0x370 f2fs_commit_super+0x4ed/0x8c0 f2fs_record_error_work+0xc7/0x190 process_one_work+0x7db/0x1970 worker_thread+0x518/0xea0 kthread+0x359/0x690 ret_from_fork+0x34/0x70 ret_from_fork_asm+0x1a/0x30 The issue can be reproduced by: mount /dev/vda /mnt i=0 while :; do echo '[h]abc' > /sys/fs/f2fs/vda/extension_list echo '[h]!abc' > /sys/fs/f2fs/vda/extension_list echo scan > /sys/kernel/debug/kmemleak dmesg | grep "new suspected memory leaks" [ $? -eq 0 ] && break i=$((i + 1)) echo "$i" done umount /mnt Fixes: 5bcde4557862 ("f2fs: get rid of buffer_head use") Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: do sanity check on fio.new_blkaddr in do_write_page()Chao Yu
F2FS-fs (dm-55): access invalid blkaddr:972878540 Call trace: dump_backtrace+0xec/0x128 show_stack+0x18/0x28 dump_stack_lvl+0x40/0x88 dump_stack+0x18/0x24 __f2fs_is_valid_blkaddr+0x360/0x3b4 f2fs_is_valid_blkaddr+0x10/0x20 f2fs_get_node_info+0x21c/0x60c __write_node_page+0x15c/0x734 f2fs_sync_node_pages+0x4f8/0x700 f2fs_write_checkpoint+0x4a8/0x99c __checkpoint_and_complete_reqs+0x7c/0x20c issue_checkpoint_thread+0x4c/0xd8 kthread+0x11c/0x1b0 ret_from_fork+0x10/0x20 If f2fs_allocate_data_block() fails, we may update nat.blkaddr w/ uninitialized fio.new_blkaddr. - __write_node_folio - f2fs_do_write_node_page - do_write_page - f2fs_allocate_data_block : once it fails, it may not allocate new blkaddr - set_node_addr : update w/ uninitialized fio.new_blkaddr variable I've checked all error paths in f2fs_allocate_data_block(), it should be tagged w/ CP_ERROR_FLAG. In addition, f2fs_allocate_data_block() succeeds, fio.new_blkaddr should be valid. Let's add f2fs_bug_on() to check above two conditions to detect any potential bugs. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: handle nat.blkaddr corruption in f2fs_get_node_info()Chao Yu
F2FS-fs (dm-55): access invalid blkaddr:972878540 Call trace: dump_backtrace+0xec/0x128 show_stack+0x18/0x28 dump_stack_lvl+0x40/0x88 dump_stack+0x18/0x24 __f2fs_is_valid_blkaddr+0x360/0x3b4 f2fs_is_valid_blkaddr+0x10/0x20 f2fs_get_node_info+0x21c/0x60c __write_node_page+0x15c/0x734 f2fs_sync_node_pages+0x4f8/0x700 f2fs_write_checkpoint+0x4a8/0x99c __checkpoint_and_complete_reqs+0x7c/0x20c issue_checkpoint_thread+0x4c/0xd8 kthread+0x11c/0x1b0 ret_from_fork+0x10/0x20 If nat.blkaddr is corrupted, during checkpoint, f2fs_sync_node_pages() will loop to flush node page w/ corrupted nat.blkaddr. Although, it tags SBI_NEED_FSCK, checkpoint can not persist it due to deadloop. Let's call f2fs_handle_error(, ERROR_INCONSISTENT_NAT) to record such error into superblock, it expects fsck can detect the error and repair inconsistent nat.blkaddr after device reboot. Note that, let's add sanity check in f2fs_get_node_info() to detect in-memory nat.blkaddr inconsistency, but only if CONFIG_F2FS_CHECK_FS is enabled. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: turn off one_time when forcibly set to foreground GCDaeho Jeong
one_time mode is only for background GC. So, we need to set it back to false when foreground GC is enforced. Fixes: 9748c2ddea4a ("f2fs: do FG_GC when GC boosting is required for zoned devices") Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: make sure zoned device GC to use FG_GC in shortage of free sectionDaeho Jeong
We already use FG_GC when we have free sections under gc_boost_zoned_gc_percent. So, let's make it consistent. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23Merge tag 'f2fs-for-6.16-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fixes from Jaegeuk Kim: - fix double-unlock introduced by the recent folio conversion - fix stale page content beyond EOF complained by xfstests/generic/363 * tag 'f2fs-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: fix to zero post-eof page f2fs: Fix __write_node_folio() conversion
2025-06-23Merge tag 'for-6.16-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fixes: - fix invalid inode pointer dereferences during log replay - fix a race between renames and directory logging - fix shutting down delayed iput worker - fix device byte accounting when dropping chunk - in zoned mode, fix offset calculations for DUP profile when conventional and sequential zones are used together Regression fixes: - fix possible double unlock of extent buffer tree (xarray conversion) - in zoned mode, fix extent buffer refcount when writing out extents (xarray conversion) Error handling fixes and updates: - handle unexpected extent type when replaying log - check and warn if there are remaining delayed inodes when putting a root - fix assertion when building free space tree - handle csum tree error with mount option 'rescue=ibadroot' Other: - error message updates: add prefix to all scrub related messages, include other information in messages" * tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix alloc_offset calculation for partly conventional block groups btrfs: handle csum tree error with rescue=ibadroots correctly btrfs: fix race between async reclaim worker and close_ctree() btrfs: fix assertion when building free space tree btrfs: don't silently ignore unexpected extent type when replaying log btrfs: fix invalid inode pointer dereferences during log replay btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb btrfs: update superblock's device bytes_used when dropping chunk btrfs: fix a race between renames and directory logging btrfs: scrub: add prefix for the error messages btrfs: warn if leaking delayed_nodes in btrfs_put_root() btrfs: fix delayed ref refcount leak in debug assertion btrfs: include root in error message when unlinking inode btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails
2025-06-23attach_recursive_mnt(): do not lock the covering tree when sliding something ↵Al Viro
under it If we are propagating across the userns boundary, we need to lock the mounts added there. However, in case when something has already been mounted there and we end up sliding a new tree under that, the stuff that had been there before should not get locked. IOW, lock_mnt_tree() should be called before we reparent the preexisting tree on top of what we are adding. Fixes: 3bd045cc9c4b ("separate copying and locking mount tree on cross-userns copies") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-23replace collect_mounts()/drop_collected_mounts() with a safer variantAl Viro
collect_mounts() has several problems - one can't iterate over the results directly, so it has to be done with callback passed to iterate_mounts(); it has an oopsable race with d_invalidate(); it creates temporary clones of mounts invisibly for sync umount (IOW, you can have non-lazy umount succeed leaving filesystem not mounted anywhere and yet still busy). A saner approach is to give caller an array of struct path that would pin every mount in a subtree, without cloning any mounts. * collect_mounts()/drop_collected_mounts()/iterate_mounts() is gone * collect_paths(where, preallocated, size) gives either ERR_PTR(-E...) or a pointer to array of struct path, one for each chunk of tree visible under 'where' (i.e. the first element is a copy of where, followed by (mount,root) for everything mounted under it - the same set collect_mounts() would give). Unlike collect_mounts(), the mounts are *not* cloned - we just get pinning references to the roots of subtrees in the caller's namespace. Array is terminated by {NULL, NULL} struct path. If it fits into preallocated array (on-stack, normally), that's where it goes; otherwise it's allocated by kmalloc_array(). Passing 0 as size means that 'preallocated' is ignored (and expected to be NULL). * drop_collected_paths(paths, preallocated) is given the array returned by an earlier call of collect_paths() and the preallocated array passed to that call. All mount/dentry references are dropped and array is kfree'd if it's not equal to 'preallocated'. * instead of iterate_mounts(), users should just iterate over array of struct path - nothing exotic is needed for that. Existing users (all in audit_tree.c) are converted. [folded a fix for braino reported by Venkat Rao Bagalkote <venkat88@linux.ibm.com>] Fixes: 80b5dce8c59b0 ("vfs: Add a function to lazily unmount all mounts from any dentry") Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-23fs/ntfs3: cancle set bad inode after removing name failsEdward Adam Davis
The reproducer uses a file0 on a ntfs3 file system with a corrupted i_link. When renaming, the file0's inode is marked as a bad inode because the file name cannot be deleted. The underlying bug is that make_bad_inode() is called on a live inode. In some cases it's "icache lookup finds a normal inode, d_splice_alias() is called to attach it to dentry, while another thread decides to call make_bad_inode() on it - that would evict it from icache, but we'd already found it there earlier". In some it's outright "we have an inode attached to dentry - that's how we got it in the first place; let's call make_bad_inode() on it just for shits and giggles". Fixes: 78ab59fee07f ("fs/ntfs3: Rework file operations") Reported-by: syzbot+1aa90f0eb1fc3e77d969@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1aa90f0eb1fc3e77d969 Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2025-06-23fs/ntfs3: Add sanity check for file nameLizhi Xu
The length of the file name should be smaller than the directory entry size. Reported-by: syzbot+598057afa0f49e62bd23@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=598057afa0f49e62bd23 Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2025-06-23fs/ntfs3: correctly create symlink for relative pathRong Zhang
After applying this patch, could correctly create symlink: ln -s "relative/path/to/file" symlink Signed-off-by: Rong Zhang <ulin0208@gmail.com> [almaz.alexandrovich@paragon-software.com: added cpu_to_le32 macro to rs->Flags assignment] Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2025-06-23fs/ntfs3: fix symlinks cannot be handled correctlyRong Zhang
The symlinks created in windows will be broken in linux by ntfs3, the patch fixes it. Signed-off-by: Rong Zhang <ulin0208@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2025-06-23NFSv4/pNFS: Fix a race to wake on NFS_LAYOUT_DRAINBenjamin Coddington
We found a few different systems hung up in writeback waiting on the same page lock, and one task waiting on the NFS_LAYOUT_DRAIN bit in pnfs_update_layout(), however the pnfs_layout_hdr's plh_outstanding count was zero. It seems most likely that this is another race between the waiter and waker similar to commit ed0172af5d6f ("SUNRPC: Fix a race to wake a sync task"). Fix it up by applying the advised barrier. Fixes: 880265c77ac4 ("pNFS: Avoid a live lock condition in pnfs_update_layout()") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-06-23nfs: Clean up /proc/net/rpc/nfs when nfs_fs_proc_net_init() fails.Kuniyuki Iwashima
syzbot reported a warning below [1] following a fault injection in nfs_fs_proc_net_init(). [0] When nfs_fs_proc_net_init() fails, /proc/net/rpc/nfs is not removed. Later, rpc_proc_exit() tries to remove /proc/net/rpc, and the warning is logged as the directory is not empty. Let's handle the error of nfs_fs_proc_net_init() properly. [0]: FAULT_INJECTION: forcing a failure. name failslab, interval 1, probability 0, space 0, times 0 CPU: 1 UID: 0 PID: 6120 Comm: syz.2.27 Not tainted 6.16.0-rc1-syzkaller-00010-g2c4a1f3fe03e #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:123) should_fail_ex (lib/fault-inject.c:73 lib/fault-inject.c:174) should_failslab (mm/failslab.c:46) kmem_cache_alloc_noprof (mm/slub.c:4178 mm/slub.c:4204) __proc_create (fs/proc/generic.c:427) proc_create_reg (fs/proc/generic.c:554) proc_create_net_data (fs/proc/proc_net.c:120) nfs_fs_proc_net_init (fs/nfs/client.c:1409) nfs_net_init (fs/nfs/inode.c:2600) ops_init (net/core/net_namespace.c:138) setup_net (net/core/net_namespace.c:443) copy_net_ns (net/core/net_namespace.c:576) create_new_namespaces (kernel/nsproxy.c:110) unshare_nsproxy_namespaces (kernel/nsproxy.c:218 (discriminator 4)) ksys_unshare (kernel/fork.c:3123) __x64_sys_unshare (kernel/fork.c:3190) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) </TASK> [1]: remove_proc_entry: removing non-empty directory 'net/rpc', leaking at least 'nfs' WARNING: CPU: 1 PID: 6120 at fs/proc/generic.c:727 remove_proc_entry+0x45e/0x530 fs/proc/generic.c:727 Modules linked in: CPU: 1 UID: 0 PID: 6120 Comm: syz.2.27 Not tainted 6.16.0-rc1-syzkaller-00010-g2c4a1f3fe03e #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 RIP: 0010:remove_proc_entry+0x45e/0x530 fs/proc/generic.c:727 Code: 3c 02 00 0f 85 85 00 00 00 48 8b 93 d8 00 00 00 4d 89 f0 4c 89 e9 48 c7 c6 40 ba a2 8b 48 c7 c7 60 b9 a2 8b e8 33 81 1d ff 90 <0f> 0b 90 90 e9 5f fe ff ff e8 04 69 5e ff 90 48 b8 00 00 00 00 00 RSP: 0018:ffffc90003637b08 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff88805f534140 RCX: ffffffff817a92c8 RDX: ffff88807da99e00 RSI: ffffffff817a92d5 RDI: 0000000000000001 RBP: ffff888033431ac0 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffff888033431a00 R13: ffff888033431ae4 R14: ffff888033184724 R15: dffffc0000000000 FS: 0000555580328500(0000) GS:ffff888124a62000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f71733743e0 CR3: 000000007f618000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> sunrpc_exit_net+0x46/0x90 net/sunrpc/sunrpc_syms.c:76 ops_exit_list net/core/net_namespace.c:200 [inline] ops_undo_list+0x2eb/0xab0 net/core/net_namespace.c:253 setup_net+0x2e1/0x510 net/core/net_namespace.c:457 copy_net_ns+0x2a6/0x5f0 net/core/net_namespace.c:574 create_new_namespaces+0x3ea/0xa90 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:218 ksys_unshare+0x45b/0xa40 kernel/fork.c:3121 __do_sys_unshare kernel/fork.c:3192 [inline] __se_sys_unshare kernel/fork.c:3190 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:3190 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0x490 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fa1a6b8e929 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fff3a090368 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007fa1a6db5fa0 RCX: 00007fa1a6b8e929 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000080 RBP: 00007fa1a6c10b39 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fa1a6db5fa0 R14: 00007fa1a6db5fa0 R15: 0000000000000001 </TASK> Fixes: d47151b79e32 ("nfs: expose /proc/net/sunrpc/nfs in net namespaces") Reported-by: syzbot+a4cc4ac22daa4a71b87c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=a4cc4ac22daa4a71b87c Tested-by: syzbot+a4cc4ac22daa4a71b87c@syzkaller.appspotmail.com Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-06-23smb: client: fix regression with native SMB symlinksPaulo Alcantara
Some users and customers reported that their backup/copy tools started to fail when the directory being copied contained symlink targets that the client couldn't parse - even when those symlinks weren't followed. Fix this by allowing lstat(2) and readlink(2) to succeed even when the client can't resolve the symlink target, restoring old behavior. Cc: linux-cifs@vger.kernel.org Cc: stable@vger.kernel.org Reported-by: Remy Monsen <monsen@monsen.cc> Closes: https://lore.kernel.org/r/CAN+tdP7y=jqw3pBndZAGjQv0ObFq8Q=+PUDHgB36HdEz9QA6FQ@mail.gmail.com Reported-by: Pierguido Lambri <plambri@redhat.com> Fixes: 12b466eb52d9 ("cifs: Fix creating and resolving absolute NT-style symlinks") Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-23pidfs: fix pidfs_free_pid()Christian Brauner
Ensure that we handle the case where task creation fails and pid->attr was never accessed at all. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23fs/ecryptfs: replace snprintf with sysfs_emit in show functionAnkit Chauhan
Use sysfs_emit() instead of snprintf() in version_show() function to follow the preferred kernel API. Signed-off-by: Ankit Chauhan <ankitchauhan2065@gmail.com> Link: https://lore.kernel.org/20250619031536.19352-1-ankitchauhan2065@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's nodeSong Liu
BPF programs, such as LSM and sched_ext, would benefit from tags on cgroups. One common practice to apply such tags is to set xattrs on cgroupfs folders. Introduce kfunc bpf_cgroup_read_xattr, which allows reading cgroup's xattr. Note that, we already have bpf_get_[file|dentry]_xattr. However, these two APIs are not ideal for reading cgroupfs xattrs, because: 1) These two APIs only works in sleepable contexts; 2) There is no kfunc that matches current cgroup to cgroupfs dentry. bpf_cgroup_read_xattr is generic and can be useful for many program types. It is also safe, because it requires trusted or rcu protected argument (KF_RCU). Therefore, we make it available to all program types. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/20250623063854.1896364-3-song@kernel.org Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23kernfs: remove iattr_mutexChristian Brauner
All allocations of struct kernfs_iattrs are serialized through a global mutex. Simply do a racy allocation and let the first one win. I bet most callers are under inode->i_rwsem anyway and it wouldn't be needed but let's not require that. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/20250623063854.1896364-2-song@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23ext4: add FALLOC_FL_WRITE_ZEROES supportZhang Yi
Add support for FALLOC_FL_WRITE_ZEROES if the underlying device enable the unmap write zeroes operation. This first allocates blocks as unwritten, then issues a zero command outside of the running journal handle, and finally converts them to a written state. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-10-yi.zhang@huaweicloud.com Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23fs: introduce FALLOC_FL_WRITE_ZEROES to fallocateZhang Yi
With the development of flash-based storage devices, we can quickly write zeros to SSDs using the WRITE_ZERO command if the devices do not actually write physical zeroes to the media. Therefore, we can use this command to quickly preallocate a real all-zero file with written extents. This approach should be beneficial for subsequent pure overwriting within this file, as it can save on block allocation and, consequently, significant metadata changes, which should greatly improve overwrite performance on certain filesystems. Therefore, introduce a new operation FALLOC_FL_WRITE_ZEROES to fallocate. This flag is used to convert a specified range of a file to zeros by issuing a zeroing operation. Blocks should be allocated for the regions that span holes in the file, and the entire range is converted to written extents. If the underlying device supports the actual offload write zeroes command, the process of zeroing out operation can be accelerated. If it does not, we currently don't prevent the file system from writing actual zeros to the device. This provides users with a new method to quickly generate a zeroed file, users no longer need to write zero data to create a file with written extents. Users can determine whether a disk supports the unmap write zeroes feature through querying this sysfs interface: /sys/block/<disk>/queue/write_zeroes_unmap_max_hw_bytes Users can also enable or disable the unmap write zeroes operation through this sysfs interface: /sys/block/<disk>/queue/write_zeroes_unmap_max_bytes Finally, this flag cannot be specified in conjunction with the FALLOC_FL_KEEP_SIZE since allocating written extents beyond file EOF is not permitted. In addition, filesystems that always require out-of-place writes should not support this flag since they still need to allocated new blocks during subsequent overwrites. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-7-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypassShivank Garg
Export anon_inode_make_secure_inode() to allow KVM guest_memfd to create anonymous inodes with proper security context. This replaces the current pattern of calling alloc_anon_inode() followed by inode_init_security_anon() for creating security context manually. This change also fixes a security regression in secretmem where the S_PRIVATE flag was not cleared after alloc_anon_inode(), causing LSM/SELinux checks to be bypassed for secretmem file descriptors. As guest_memfd currently resides in the KVM module, we need to export this symbol for use outside the core kernel. In the future, guest_memfd might be moved to core-mm, at which point the symbols no longer would have to be exported. When/if that happens is still unclear. Fixes: 2bfe15c52612 ("mm: create security context for memfd_secret inodes") Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Shivank Garg <shivankg@amd.com> Link: https://lore.kernel.org/20250620070328.803704-3-shivankg@amd.com Acked-by: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>