summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-06-24bcachefs: Check for bad write buffer key when moving from journalKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24bcachefs: Don't unlock the trans if ret doesn't match BCH_ERR_operation_blockedAlan Huang
Reported-by: syzbot+d540192e763531d307ff@syzkaller.appspotmail.com Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-24fhandle, pidfs: support open_by_handle_at() purely based on file handleChristian Brauner
Various filesystems such as pidfs (and likely drm in the future) have a use-case to support opening files purely based on the handle without having to require a file descriptor to another object. That's especially the case for filesystems that don't do any lookup whatsoever and there's zero relationship between the objects. Such filesystems are also singletons that stay around for the lifetime of the system meaning that they can be uniquely identified and accessed purely based on the file handle type. Enable that so that userspace doesn't have to allocate an object needlessly especially if they can't do that for whatever reason. Link: https://lore.kernel.org/20250624-work-pidfs-fhandle-v2-10-d02a04858fe3@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24userns and mnt_idmap leak in open_tree_attr(2)Al Viro
Once want_mount_setattr() has returned a positive, it does require finish_mount_kattr() to release ->mnt_userns. Failing do_mount_setattr() does not change that. As the result, we can end up leaking userns and possibly mnt_idmap as well. Fixes: c4a16820d901 ("fs: add open_tree_attr()") Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-24fs: Remove three arguments from block_write_end()Matthew Wilcox (Oracle)
block_write_end() looks like it can be used as a ->write_end() implementation. However, it can't as it does not unlock nor put the folio. Since it does not use the 'file', 'mapping' nor 'fsdata' arguments, remove them. Signed-off-by: "Matthew Wilcox (Oracle)" <willy@infradead.org> Link: https://lore.kernel.org/20250624132130.1590285-1-willy@infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24fhandle: reflow get_path_anchor()Christian Brauner
Switch to a more common coding style. Link: https://lore.kernel.org/20250624-work-pidfs-fhandle-v2-5-d02a04858fe3@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24pidfs: add pidfs_root_path() helperChristian Brauner
Allow to return the root of the global pidfs filesystem. Link: https://lore.kernel.org/20250624-work-pidfs-fhandle-v2-4-d02a04858fe3@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24fhandle: rename to get_path_anchor()Christian Brauner
Rename as we're going to expand the function in the next step. The path just serves as the anchor tying the decoding to the filesystem. Link: https://lore.kernel.org/20250624-work-pidfs-fhandle-v2-3-d02a04858fe3@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24fhandle: hoist copy_from_user() above get_path_from_fd()Christian Brauner
In follow-up patches we need access to @file_handle->handle_type before we start caring about get_path_from_fd(). Link: https://lore.kernel.org/20250624-work-pidfs-fhandle-v2-2-d02a04858fe3@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24fhandle: raise FILEID_IS_DIR in handle_typeChristian Brauner
Currently FILEID_IS_DIR is raised in fh_flags which is wrong. Raise it in handle->handle_type were it's supposed to be. Link: https://lore.kernel.org/20250624-work-pidfs-fhandle-v2-1-d02a04858fe3@kernel.org Fixes: c374196b2b9f ("fs: name_to_handle_at() support for "explicit connectable" file handles") Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Cc: stable@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-24fuse: fix fuse_fill_write_pages() upper bound calculationJoanne Koong
This fixes a bug in commit 63c69ad3d18a ("fuse: refactor fuse_fill_write_pages()") where max_pages << PAGE_SHIFT is mistakenly used as the calculation for the max_pages upper limit but there's the possibility that copy_folio_from_iter_atomic() may copy over bytes from the iov_iter that are less than the full length of the folio, which would lead to exceeding max_pages. This commit fixes it by adding a 'ap->num_folios < max_folios' check. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/20250614000114.910380-1-joannelkoong@gmail.com Fixes: 63c69ad3d18a ("fuse: refactor fuse_fill_write_pages()") Tested-by: Brian Foster <bfoster@redhat.com> Reported-by: Brian Foster <bfoster@redhat.com> Closes: https://lore.kernel.org/linux-fsdevel/aEq4haEQScwHIWK6@bfoster/ Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23net: make sk->sk_rcvtimeo locklessEric Dumazet
Followup of commit 285975dd6742 ("net: annotate data-races around sk->sk_{rcv|snd}timeo"). Remove lock_sock()/release_sock() from ksmbd_tcp_rcv_timeout() and add READ_ONCE()/WRITE_ONCE() where it is needed. Also SO_RCVTIMEO_OLD and SO_RCVTIMEO_NEW can call sock_set_timeout() without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250620155536.335520-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-23f2fs: compress: fix UAF of f2fs_inode_info in f2fs_free_dicZhiguo Niu
The decompress_io_ctx may be released asynchronously after I/O completion. If this file is deleted immediately after read, and the kworker of processing post_read_wq has not been executed yet due to high workloads, It is possible that the inode(f2fs_inode_info) is evicted and freed before it is used f2fs_free_dic. The UAF case as below: Thread A Thread B - f2fs_decompress_end_io - f2fs_put_dic - queue_work add free_dic work to post_read_wq - do_unlink - iput - evict - call_rcu This file is deleted after read. Thread C kworker to process post_read_wq - rcu_do_batch - f2fs_free_inode - kmem_cache_free inode is freed by rcu - process_scheduled_works - f2fs_late_free_dic - f2fs_free_dic - f2fs_release_decomp_mem read (dic->inode)->i_compress_algorithm This patch store compress_algorithm and sbi in dic to avoid inode UAF. In addition, the previous solution is deprecated in [1] may cause system hang. [1] https://lore.kernel.org/all/c36ab955-c8db-4a8b-a9d0-f07b5f426c3f@kernel.org Cc: Daeho Jeong <daehojeong@google.com> Fixes: bff139b49d9f ("f2fs: handle decompress only post processing in softirq") Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Baocong Liu <baocong.liu@unisoc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: compress: change the first parameter of page_array_{alloc,free} to sbiZhiguo Niu
No logic changes, just cleanup and prepare for fixing the UAF issue in f2fs_free_dic. Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com> Signed-off-by: Baocong Liu <baocong.liu@unisoc.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: introduce reserved_pin_section sysfs entryChao Yu
This patch introduces /sys/fs/f2fs/<dev>/reserved_pin_section for tuning @needed parameter of has_not_enough_free_secs(), if we configure it w/ zero, it can avoid f2fs_gc() as much as possible while fallocating on pinned file. Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: wangzijie <wangzijie1@honor.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: fix to avoid invalid wait context issueChao Yu
============================= [ BUG: Invalid wait context ] 6.13.0-rc1 #84 Tainted: G O ----------------------------- cat/56160 is trying to lock: ffff888105c86648 (&cprc->stat_lock){+.+.}-{3:3}, at: update_general_status+0x32a/0x8c0 [f2fs] other info that might help us debug this: context-{5:5} 2 locks held by cat/56160: #0: ffff88810a002a98 (&p->lock){+.+.}-{4:4}, at: seq_read_iter+0x56/0x4c0 #1: ffffffffa0462638 (f2fs_stat_lock){....}-{2:2}, at: stat_show+0x29/0x1020 [f2fs] stack backtrace: CPU: 0 UID: 0 PID: 56160 Comm: cat Tainted: G O 6.13.0-rc1 #84 Tainted: [O]=OOT_MODULE Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 Call Trace: <TASK> dump_stack_lvl+0x88/0xd0 dump_stack+0x14/0x20 __lock_acquire+0x8d4/0xbb0 lock_acquire+0xd6/0x300 _raw_spin_lock+0x38/0x50 update_general_status+0x32a/0x8c0 [f2fs] stat_show+0x50/0x1020 [f2fs] seq_read_iter+0x116/0x4c0 seq_read+0xfa/0x130 full_proxy_read+0x66/0x90 vfs_read+0xc4/0x350 ksys_read+0x74/0xf0 __x64_sys_read+0x1d/0x20 x64_sys_call+0x17d9/0x1b80 do_syscall_64+0x68/0x130 entry_SYSCALL_64_after_hwframe+0x67/0x6f RIP: 0033:0x7f2ca53147e2 - seq_read - stat_show - raw_spin_lock_irqsave(&f2fs_stat_lock, flags) : f2fs_stat_lock is raw_spinlock_t type variable - update_general_status - spin_lock(&sbi->cprc_info.stat_lock); : stat_lock is spinlock_t type variable The root cause is the lock order is incorrect [1], we should not acquire spinlock_t lock after raw_spinlock_t lock, as if CONFIG_PREEMPT_LOCK is on, spinlock_t is implemented based on rtmutex, which can sleep after holding the lock. To fix this issue, let's use change f2fs_stat_lock lock type from raw_spinlock_t to spinlock_t, it's safe due to: - we don't need to use raw version of spinlock as the path is not performance sensitive. - we don't need to use irqsave version of spinlock as it won't be used in irq context. Quoted from [1]: "Extend lockdep to validate lock wait-type context. The current wait-types are: LD_WAIT_FREE, /* wait free, rcu etc.. */ LD_WAIT_SPIN, /* spin loops, raw_spinlock_t etc.. */ LD_WAIT_CONFIG, /* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */ LD_WAIT_SLEEP, /* sleeping locks, mutex_t etc.. */ Where lockdep validates that the current lock (the one being acquired) fits in the current wait-context (as generated by the held stack). This ensures that there is no attempt to acquire mutexes while holding spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In other words, its a more fancy might_sleep()." [1] https://lore.kernel.org/all/20200321113242.427089655@linutronix.de Fixes: 98237fcda4a2 ("f2fs: use spin_lock to avoid hang") Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: fix bio memleak when committing super blockSheng Yong
When committing new super block, bio is allocated but not freed, and kmemleak complains: unreferenced object 0xffff88801d185600 (size 192): comm "kworker/3:2", pid 128, jiffies 4298624992 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 80 67 c3 00 81 88 ff ff .........g...... 01 08 06 00 00 00 00 00 00 00 00 00 01 00 00 00 ................ backtrace (crc 650ecdb1): kmem_cache_alloc_noprof+0x3a9/0x460 mempool_alloc_noprof+0x12f/0x310 bio_alloc_bioset+0x1e2/0x7e0 __f2fs_commit_super+0xe0/0x370 f2fs_commit_super+0x4ed/0x8c0 f2fs_record_error_work+0xc7/0x190 process_one_work+0x7db/0x1970 worker_thread+0x518/0xea0 kthread+0x359/0x690 ret_from_fork+0x34/0x70 ret_from_fork_asm+0x1a/0x30 The issue can be reproduced by: mount /dev/vda /mnt i=0 while :; do echo '[h]abc' > /sys/fs/f2fs/vda/extension_list echo '[h]!abc' > /sys/fs/f2fs/vda/extension_list echo scan > /sys/kernel/debug/kmemleak dmesg | grep "new suspected memory leaks" [ $? -eq 0 ] && break i=$((i + 1)) echo "$i" done umount /mnt Fixes: 5bcde4557862 ("f2fs: get rid of buffer_head use") Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: do sanity check on fio.new_blkaddr in do_write_page()Chao Yu
F2FS-fs (dm-55): access invalid blkaddr:972878540 Call trace: dump_backtrace+0xec/0x128 show_stack+0x18/0x28 dump_stack_lvl+0x40/0x88 dump_stack+0x18/0x24 __f2fs_is_valid_blkaddr+0x360/0x3b4 f2fs_is_valid_blkaddr+0x10/0x20 f2fs_get_node_info+0x21c/0x60c __write_node_page+0x15c/0x734 f2fs_sync_node_pages+0x4f8/0x700 f2fs_write_checkpoint+0x4a8/0x99c __checkpoint_and_complete_reqs+0x7c/0x20c issue_checkpoint_thread+0x4c/0xd8 kthread+0x11c/0x1b0 ret_from_fork+0x10/0x20 If f2fs_allocate_data_block() fails, we may update nat.blkaddr w/ uninitialized fio.new_blkaddr. - __write_node_folio - f2fs_do_write_node_page - do_write_page - f2fs_allocate_data_block : once it fails, it may not allocate new blkaddr - set_node_addr : update w/ uninitialized fio.new_blkaddr variable I've checked all error paths in f2fs_allocate_data_block(), it should be tagged w/ CP_ERROR_FLAG. In addition, f2fs_allocate_data_block() succeeds, fio.new_blkaddr should be valid. Let's add f2fs_bug_on() to check above two conditions to detect any potential bugs. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: handle nat.blkaddr corruption in f2fs_get_node_info()Chao Yu
F2FS-fs (dm-55): access invalid blkaddr:972878540 Call trace: dump_backtrace+0xec/0x128 show_stack+0x18/0x28 dump_stack_lvl+0x40/0x88 dump_stack+0x18/0x24 __f2fs_is_valid_blkaddr+0x360/0x3b4 f2fs_is_valid_blkaddr+0x10/0x20 f2fs_get_node_info+0x21c/0x60c __write_node_page+0x15c/0x734 f2fs_sync_node_pages+0x4f8/0x700 f2fs_write_checkpoint+0x4a8/0x99c __checkpoint_and_complete_reqs+0x7c/0x20c issue_checkpoint_thread+0x4c/0xd8 kthread+0x11c/0x1b0 ret_from_fork+0x10/0x20 If nat.blkaddr is corrupted, during checkpoint, f2fs_sync_node_pages() will loop to flush node page w/ corrupted nat.blkaddr. Although, it tags SBI_NEED_FSCK, checkpoint can not persist it due to deadloop. Let's call f2fs_handle_error(, ERROR_INCONSISTENT_NAT) to record such error into superblock, it expects fsck can detect the error and repair inconsistent nat.blkaddr after device reboot. Note that, let's add sanity check in f2fs_get_node_info() to detect in-memory nat.blkaddr inconsistency, but only if CONFIG_F2FS_CHECK_FS is enabled. Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: turn off one_time when forcibly set to foreground GCDaeho Jeong
one_time mode is only for background GC. So, we need to set it back to false when foreground GC is enforced. Fixes: 9748c2ddea4a ("f2fs: do FG_GC when GC boosting is required for zoned devices") Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23f2fs: make sure zoned device GC to use FG_GC in shortage of free sectionDaeho Jeong
We already use FG_GC when we have free sections under gc_boost_zoned_gc_percent. So, let's make it consistent. Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-06-23Merge tag 'f2fs-for-6.16-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs fixes from Jaegeuk Kim: - fix double-unlock introduced by the recent folio conversion - fix stale page content beyond EOF complained by xfstests/generic/363 * tag 'f2fs-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: f2fs: fix to zero post-eof page f2fs: Fix __write_node_folio() conversion
2025-06-23Merge tag 'for-6.16-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Fixes: - fix invalid inode pointer dereferences during log replay - fix a race between renames and directory logging - fix shutting down delayed iput worker - fix device byte accounting when dropping chunk - in zoned mode, fix offset calculations for DUP profile when conventional and sequential zones are used together Regression fixes: - fix possible double unlock of extent buffer tree (xarray conversion) - in zoned mode, fix extent buffer refcount when writing out extents (xarray conversion) Error handling fixes and updates: - handle unexpected extent type when replaying log - check and warn if there are remaining delayed inodes when putting a root - fix assertion when building free space tree - handle csum tree error with mount option 'rescue=ibadroot' Other: - error message updates: add prefix to all scrub related messages, include other information in messages" * tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix alloc_offset calculation for partly conventional block groups btrfs: handle csum tree error with rescue=ibadroots correctly btrfs: fix race between async reclaim worker and close_ctree() btrfs: fix assertion when building free space tree btrfs: don't silently ignore unexpected extent type when replaying log btrfs: fix invalid inode pointer dereferences during log replay btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb btrfs: update superblock's device bytes_used when dropping chunk btrfs: fix a race between renames and directory logging btrfs: scrub: add prefix for the error messages btrfs: warn if leaking delayed_nodes in btrfs_put_root() btrfs: fix delayed ref refcount leak in debug assertion btrfs: include root in error message when unlinking inode btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails
2025-06-23attach_recursive_mnt(): do not lock the covering tree when sliding something ↵Al Viro
under it If we are propagating across the userns boundary, we need to lock the mounts added there. However, in case when something has already been mounted there and we end up sliding a new tree under that, the stuff that had been there before should not get locked. IOW, lock_mnt_tree() should be called before we reparent the preexisting tree on top of what we are adding. Fixes: 3bd045cc9c4b ("separate copying and locking mount tree on cross-userns copies") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-23replace collect_mounts()/drop_collected_mounts() with a safer variantAl Viro
collect_mounts() has several problems - one can't iterate over the results directly, so it has to be done with callback passed to iterate_mounts(); it has an oopsable race with d_invalidate(); it creates temporary clones of mounts invisibly for sync umount (IOW, you can have non-lazy umount succeed leaving filesystem not mounted anywhere and yet still busy). A saner approach is to give caller an array of struct path that would pin every mount in a subtree, without cloning any mounts. * collect_mounts()/drop_collected_mounts()/iterate_mounts() is gone * collect_paths(where, preallocated, size) gives either ERR_PTR(-E...) or a pointer to array of struct path, one for each chunk of tree visible under 'where' (i.e. the first element is a copy of where, followed by (mount,root) for everything mounted under it - the same set collect_mounts() would give). Unlike collect_mounts(), the mounts are *not* cloned - we just get pinning references to the roots of subtrees in the caller's namespace. Array is terminated by {NULL, NULL} struct path. If it fits into preallocated array (on-stack, normally), that's where it goes; otherwise it's allocated by kmalloc_array(). Passing 0 as size means that 'preallocated' is ignored (and expected to be NULL). * drop_collected_paths(paths, preallocated) is given the array returned by an earlier call of collect_paths() and the preallocated array passed to that call. All mount/dentry references are dropped and array is kfree'd if it's not equal to 'preallocated'. * instead of iterate_mounts(), users should just iterate over array of struct path - nothing exotic is needed for that. Existing users (all in audit_tree.c) are converted. [folded a fix for braino reported by Venkat Rao Bagalkote <venkat88@linux.ibm.com>] Fixes: 80b5dce8c59b0 ("vfs: Add a function to lazily unmount all mounts from any dentry") Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-06-23fs/ntfs3: cancle set bad inode after removing name failsEdward Adam Davis
The reproducer uses a file0 on a ntfs3 file system with a corrupted i_link. When renaming, the file0's inode is marked as a bad inode because the file name cannot be deleted. The underlying bug is that make_bad_inode() is called on a live inode. In some cases it's "icache lookup finds a normal inode, d_splice_alias() is called to attach it to dentry, while another thread decides to call make_bad_inode() on it - that would evict it from icache, but we'd already found it there earlier". In some it's outright "we have an inode attached to dentry - that's how we got it in the first place; let's call make_bad_inode() on it just for shits and giggles". Fixes: 78ab59fee07f ("fs/ntfs3: Rework file operations") Reported-by: syzbot+1aa90f0eb1fc3e77d969@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1aa90f0eb1fc3e77d969 Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2025-06-23fs/ntfs3: Add sanity check for file nameLizhi Xu
The length of the file name should be smaller than the directory entry size. Reported-by: syzbot+598057afa0f49e62bd23@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=598057afa0f49e62bd23 Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2025-06-23fs/ntfs3: correctly create symlink for relative pathRong Zhang
After applying this patch, could correctly create symlink: ln -s "relative/path/to/file" symlink Signed-off-by: Rong Zhang <ulin0208@gmail.com> [almaz.alexandrovich@paragon-software.com: added cpu_to_le32 macro to rs->Flags assignment] Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2025-06-23fs/ntfs3: fix symlinks cannot be handled correctlyRong Zhang
The symlinks created in windows will be broken in linux by ntfs3, the patch fixes it. Signed-off-by: Rong Zhang <ulin0208@gmail.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2025-06-23NFSv4/pNFS: Fix a race to wake on NFS_LAYOUT_DRAINBenjamin Coddington
We found a few different systems hung up in writeback waiting on the same page lock, and one task waiting on the NFS_LAYOUT_DRAIN bit in pnfs_update_layout(), however the pnfs_layout_hdr's plh_outstanding count was zero. It seems most likely that this is another race between the waiter and waker similar to commit ed0172af5d6f ("SUNRPC: Fix a race to wake a sync task"). Fix it up by applying the advised barrier. Fixes: 880265c77ac4 ("pNFS: Avoid a live lock condition in pnfs_update_layout()") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-06-23nfs: Clean up /proc/net/rpc/nfs when nfs_fs_proc_net_init() fails.Kuniyuki Iwashima
syzbot reported a warning below [1] following a fault injection in nfs_fs_proc_net_init(). [0] When nfs_fs_proc_net_init() fails, /proc/net/rpc/nfs is not removed. Later, rpc_proc_exit() tries to remove /proc/net/rpc, and the warning is logged as the directory is not empty. Let's handle the error of nfs_fs_proc_net_init() properly. [0]: FAULT_INJECTION: forcing a failure. name failslab, interval 1, probability 0, space 0, times 0 CPU: 1 UID: 0 PID: 6120 Comm: syz.2.27 Not tainted 6.16.0-rc1-syzkaller-00010-g2c4a1f3fe03e #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:123) should_fail_ex (lib/fault-inject.c:73 lib/fault-inject.c:174) should_failslab (mm/failslab.c:46) kmem_cache_alloc_noprof (mm/slub.c:4178 mm/slub.c:4204) __proc_create (fs/proc/generic.c:427) proc_create_reg (fs/proc/generic.c:554) proc_create_net_data (fs/proc/proc_net.c:120) nfs_fs_proc_net_init (fs/nfs/client.c:1409) nfs_net_init (fs/nfs/inode.c:2600) ops_init (net/core/net_namespace.c:138) setup_net (net/core/net_namespace.c:443) copy_net_ns (net/core/net_namespace.c:576) create_new_namespaces (kernel/nsproxy.c:110) unshare_nsproxy_namespaces (kernel/nsproxy.c:218 (discriminator 4)) ksys_unshare (kernel/fork.c:3123) __x64_sys_unshare (kernel/fork.c:3190) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) </TASK> [1]: remove_proc_entry: removing non-empty directory 'net/rpc', leaking at least 'nfs' WARNING: CPU: 1 PID: 6120 at fs/proc/generic.c:727 remove_proc_entry+0x45e/0x530 fs/proc/generic.c:727 Modules linked in: CPU: 1 UID: 0 PID: 6120 Comm: syz.2.27 Not tainted 6.16.0-rc1-syzkaller-00010-g2c4a1f3fe03e #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 RIP: 0010:remove_proc_entry+0x45e/0x530 fs/proc/generic.c:727 Code: 3c 02 00 0f 85 85 00 00 00 48 8b 93 d8 00 00 00 4d 89 f0 4c 89 e9 48 c7 c6 40 ba a2 8b 48 c7 c7 60 b9 a2 8b e8 33 81 1d ff 90 <0f> 0b 90 90 e9 5f fe ff ff e8 04 69 5e ff 90 48 b8 00 00 00 00 00 RSP: 0018:ffffc90003637b08 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff88805f534140 RCX: ffffffff817a92c8 RDX: ffff88807da99e00 RSI: ffffffff817a92d5 RDI: 0000000000000001 RBP: ffff888033431ac0 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffff888033431a00 R13: ffff888033431ae4 R14: ffff888033184724 R15: dffffc0000000000 FS: 0000555580328500(0000) GS:ffff888124a62000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f71733743e0 CR3: 000000007f618000 CR4: 00000000003526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> sunrpc_exit_net+0x46/0x90 net/sunrpc/sunrpc_syms.c:76 ops_exit_list net/core/net_namespace.c:200 [inline] ops_undo_list+0x2eb/0xab0 net/core/net_namespace.c:253 setup_net+0x2e1/0x510 net/core/net_namespace.c:457 copy_net_ns+0x2a6/0x5f0 net/core/net_namespace.c:574 create_new_namespaces+0x3ea/0xa90 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:218 ksys_unshare+0x45b/0xa40 kernel/fork.c:3121 __do_sys_unshare kernel/fork.c:3192 [inline] __se_sys_unshare kernel/fork.c:3190 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:3190 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xcd/0x490 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fa1a6b8e929 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fff3a090368 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007fa1a6db5fa0 RCX: 00007fa1a6b8e929 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000080 RBP: 00007fa1a6c10b39 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fa1a6db5fa0 R14: 00007fa1a6db5fa0 R15: 0000000000000001 </TASK> Fixes: d47151b79e32 ("nfs: expose /proc/net/sunrpc/nfs in net namespaces") Reported-by: syzbot+a4cc4ac22daa4a71b87c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=a4cc4ac22daa4a71b87c Tested-by: syzbot+a4cc4ac22daa4a71b87c@syzkaller.appspotmail.com Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2025-06-23smb: client: fix regression with native SMB symlinksPaulo Alcantara
Some users and customers reported that their backup/copy tools started to fail when the directory being copied contained symlink targets that the client couldn't parse - even when those symlinks weren't followed. Fix this by allowing lstat(2) and readlink(2) to succeed even when the client can't resolve the symlink target, restoring old behavior. Cc: linux-cifs@vger.kernel.org Cc: stable@vger.kernel.org Reported-by: Remy Monsen <monsen@monsen.cc> Closes: https://lore.kernel.org/r/CAN+tdP7y=jqw3pBndZAGjQv0ObFq8Q=+PUDHgB36HdEz9QA6FQ@mail.gmail.com Reported-by: Pierguido Lambri <plambri@redhat.com> Fixes: 12b466eb52d9 ("cifs: Fix creating and resolving absolute NT-style symlinks") Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-06-23pidfs: fix pidfs_free_pid()Christian Brauner
Ensure that we handle the case where task creation fails and pid->attr was never accessed at all. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23fs/ecryptfs: replace snprintf with sysfs_emit in show functionAnkit Chauhan
Use sysfs_emit() instead of snprintf() in version_show() function to follow the preferred kernel API. Signed-off-by: Ankit Chauhan <ankitchauhan2065@gmail.com> Link: https://lore.kernel.org/20250619031536.19352-1-ankitchauhan2065@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23bpf: Introduce bpf_cgroup_read_xattr to read xattr of cgroup's nodeSong Liu
BPF programs, such as LSM and sched_ext, would benefit from tags on cgroups. One common practice to apply such tags is to set xattrs on cgroupfs folders. Introduce kfunc bpf_cgroup_read_xattr, which allows reading cgroup's xattr. Note that, we already have bpf_get_[file|dentry]_xattr. However, these two APIs are not ideal for reading cgroupfs xattrs, because: 1) These two APIs only works in sleepable contexts; 2) There is no kfunc that matches current cgroup to cgroupfs dentry. bpf_cgroup_read_xattr is generic and can be useful for many program types. It is also safe, because it requires trusted or rcu protected argument (KF_RCU). Therefore, we make it available to all program types. Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/20250623063854.1896364-3-song@kernel.org Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23kernfs: remove iattr_mutexChristian Brauner
All allocations of struct kernfs_iattrs are serialized through a global mutex. Simply do a racy allocation and let the first one win. I bet most callers are under inode->i_rwsem anyway and it wouldn't be needed but let's not require that. Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/20250623063854.1896364-2-song@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23ext4: add FALLOC_FL_WRITE_ZEROES supportZhang Yi
Add support for FALLOC_FL_WRITE_ZEROES if the underlying device enable the unmap write zeroes operation. This first allocates blocks as unwritten, then issues a zero command outside of the running journal handle, and finally converts them to a written state. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-10-yi.zhang@huaweicloud.com Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23fs: introduce FALLOC_FL_WRITE_ZEROES to fallocateZhang Yi
With the development of flash-based storage devices, we can quickly write zeros to SSDs using the WRITE_ZERO command if the devices do not actually write physical zeroes to the media. Therefore, we can use this command to quickly preallocate a real all-zero file with written extents. This approach should be beneficial for subsequent pure overwriting within this file, as it can save on block allocation and, consequently, significant metadata changes, which should greatly improve overwrite performance on certain filesystems. Therefore, introduce a new operation FALLOC_FL_WRITE_ZEROES to fallocate. This flag is used to convert a specified range of a file to zeros by issuing a zeroing operation. Blocks should be allocated for the regions that span holes in the file, and the entire range is converted to written extents. If the underlying device supports the actual offload write zeroes command, the process of zeroing out operation can be accelerated. If it does not, we currently don't prevent the file system from writing actual zeros to the device. This provides users with a new method to quickly generate a zeroed file, users no longer need to write zero data to create a file with written extents. Users can determine whether a disk supports the unmap write zeroes feature through querying this sysfs interface: /sys/block/<disk>/queue/write_zeroes_unmap_max_hw_bytes Users can also enable or disable the unmap write zeroes operation through this sysfs interface: /sys/block/<disk>/queue/write_zeroes_unmap_max_bytes Finally, this flag cannot be specified in conjunction with the FALLOC_FL_KEEP_SIZE since allocating written extents beyond file EOF is not permitted. In addition, filesystems that always require out-of-place writes should not support this flag since they still need to allocated new blocks during subsequent overwrites. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-7-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypassShivank Garg
Export anon_inode_make_secure_inode() to allow KVM guest_memfd to create anonymous inodes with proper security context. This replaces the current pattern of calling alloc_anon_inode() followed by inode_init_security_anon() for creating security context manually. This change also fixes a security regression in secretmem where the S_PRIVATE flag was not cleared after alloc_anon_inode(), causing LSM/SELinux checks to be bypassed for secretmem file descriptors. As guest_memfd currently resides in the KVM module, we need to export this symbol for use outside the core kernel. In the future, guest_memfd might be moved to core-mm, at which point the symbols no longer would have to be exported. When/if that happens is still unclear. Fixes: 2bfe15c52612 ("mm: create security context for memfd_secret inodes") Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Shivank Garg <shivankg@amd.com> Link: https://lore.kernel.org/20250620070328.803704-3-shivankg@amd.com Acked-by: "Mike Rapoport (Microsoft)" <rppt@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23fs: annotate suspected data race between poll_schedule_timeout() and pollwake()Dmitry Antipov
When running almost any select()/poll() workload intense enough, KCSAN is likely to report data races around using 'triggered' flag of 'struct poll_wqueues'. For example, running 'find /' on a tty console may trigger the following: BUG: KCSAN: data-race in poll_schedule_timeout / pollwake write to 0xffffc900030cfb90 of 4 bytes by task 97 on cpu 5: pollwake+0xd1/0x130 __wake_up_common_lock+0x7f/0xd0 n_tty_receive_buf_common+0x776/0xc30 n_tty_receive_buf2+0x3d/0x60 tty_ldisc_receive_buf+0x6b/0x100 tty_port_default_receive_buf+0x63/0xa0 flush_to_ldisc+0x169/0x3c0 process_scheduled_works+0x6fe/0xf40 worker_thread+0x53b/0x7b0 kthread+0x4f8/0x590 ret_from_fork+0x28c/0x450 ret_from_fork_asm+0x1a/0x30 read to 0xffffc900030cfb90 of 4 bytes by task 5802 on cpu 4: poll_schedule_timeout+0x96/0x160 do_sys_poll+0x966/0xb30 __se_sys_ppoll+0x1c3/0x210 __x64_sys_ppoll+0x71/0x90 x64_sys_call+0x3079/0x32b0 do_syscall_64+0xfa/0x3b0 entry_SYSCALL_64_after_hwframe+0x77/0x7f According to Jan, "there's no practical issue here because it is hard to imagine how the compiler could compile the above code using some intermediate values stored into 'triggered' or multiple fetches from 'triggered'". Nevertheless, silence KCSAN by using WRITE_ONCE() in __pollwake() and READ_ONCE() in poll_schedule_timeout(), respectively. Link: https://lore.kernel.org/linux-fsdevel/bwx72orsztfjx6aoftzzkl7wle3hi4syvusuwc7x36nw6t235e@bjwrosehblty Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Link: https://lore.kernel.org/20250620063059.1800689-1-dmantipov@yandex.ru Acked-by: Marco Elver <elver@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23pidfs: add some CONFIG_DEBUG_VFS assertsChristian Brauner
Allow to catch some obvious bugs. Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-16-98f3456fd552@kernel.org Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23pidfs: support xattrs on pidfdsChristian Brauner
Now that we have a way to persist information for pidfs dentries we can start supporting extended attributes on pidfds. This will allow userspace to attach meta information to tasks. One natural extension would be to introduce a custom pidfs.* extended attribute space and allow for the inheritance of extended attributes across fork() and exec(). The first simple scheme will allow privileged userspace to set trusted extended attributes on pidfs inodes. Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-12-98f3456fd552@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23docs/vfs: update references to i_mutex to i_rwsemJunxuan Liao
VFS has switched to i_rwsem for ten years now (9902af79c01a: parallel lookups actual switch to rwsem), but the VFS documentation and comments still has references to i_mutex. Signed-off-by: Junxuan Liao <ljx@cs.wisc.edu> Link: https://lore.kernel.org/72223729-5471-474a-af3c-f366691fba82@cs.wisc.edu Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-23Merge 6.16-rc3 into driver-core-nextGreg Kroah-Hartman
We need the driver-core fixes that are in 6.16-rc3 into here as well to build on top of. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-22Merge tag 'x86_urgent_for_v6.16_rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Make sure the array tracking which kernel text positions need to be alternatives-patched doesn't get mishandled by out-of-order modifications, leading to it overflowing and causing page faults when patching - Avoid an infinite loop when early code does a ranged TLB invalidation before the broadcast TLB invalidation count of how many pages it can flush, has been read from CPUID - Fix a CONFIG_MODULES typo - Disable broadcast TLB invalidation when PTI is enabled to avoid an overflow of the bitmap tracking dynamic ASIDs which need to be flushed when the kernel switches between the user and kernel address space - Handle the case of a CPU going offline and thus reporting zeroes when reading top-level events in the resctrl code * tag 'x86_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/alternatives: Fix int3 handling failure from broken text_poke array x86/mm: Fix early boot use of INVPLGB x86/its: Fix an ifdef typo in its_alloc() x86/mm: Disable INVLPGB when PTI is enabled x86,fs/resctrl: Remove inappropriate references to cacheinfo in the resctrl subsystem
2025-06-22Merge tag 'v6.16-rc2-smb3-client-fixes-v2' of ↵Linus Torvalds
git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - Multichannel channel allocation fix for Kerberos mounts - Two reconnect fixes - Fix netfs_writepages crash with smbdirect/RDMA - Directory caching fix - Three minor cleanup fixes - Log error when close cached dirs fails * tag 'v6.16-rc2-smb3-client-fixes-v2' of git://git.samba.org/sfrench/cifs-2.6: smb: minor fix to use SMB2_NTLMV2_SESSKEY_SIZE for auth_key size smb: minor fix to use sizeof to initialize flags_string buffer smb: Use loff_t for directory position in cached_dirents smb: Log an error when close_all_cached_dirs fails cifs: Fix prepare_write to negotiate wsize if needed smb: client: fix max_sge overflow in smb_extract_folioq_to_rdma() smb: client: fix first command failure during re-negotiation cifs: Remove duplicate fattr->cf_dtype assignment from wsl_to_fattr() function smb: fix secondary channel creation issue with kerberos by populating hostname when adding channels
2025-06-22bcachefs: Fix range in bch2_lookup_indirect_extent() error pathKent Overstreet
Before calling bch2_indirect_extent_missing_error(), we have to calculate the missing range, which is the intersection of the reflink pointer and the non-indirect-extent we found. The calculation didn't take into account that the returned extent may span the iter position, leading to an infinite loop when we (unnecessarily) resized the extent we were returning to one that didn't extend past the offset we were looking up. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-22bcachefs: fix spurious error_throwKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-22bcachefs: Add missing bch2_err_class() to fileattr_set()Kent Overstreet
Make sure we return a standard error code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-06-21Merge tag 'nfsd-6.16-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Two fixes for commits in the nfsd-6.16 merge - One fix for the recently-added NFSD netlink facility - One fix for a remote SunRPC crasher * tag 'nfsd-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: sunrpc: handle SVC_GARBAGE during svc auth processing as auth error nfsd: use threads array as-is in netlink interface SUNRPC: Cleanup/fix initial rq_pages allocation NFSD: Avoid corruption of a referring call list