summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-07-08bcachefs: Don't set BCH_FS_error on transaction restartKent Overstreet
This started showing up more when we started logging the error being corrected in the journal - but __bch2_fsck_err() could return transaction restarts before that. Setting BCH_FS_error incorrectly causes recovery passes to not be cleared, among other issues. Fixes: b43f72492768 ("bcachefs: Log fsck errors in the journal") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-08ksmbd: fix potential use-after-free in oplock/lease break ackNamjae Jeon
If ksmbd_iov_pin_rsp return error, use-after-free can happen by accessing opinfo->state and opinfo_put and ksmbd_fd_put could called twice. Reported-by: Ziyan Xu <research@securitygossip.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-08ksmbd: fix a mount write count leak in ksmbd_vfs_kern_path_locked()Al Viro
If the call of ksmbd_vfs_lock_parent() fails, we drop the parent_path references and return an error. We need to drop the write access we just got on parent_path->mnt before we drop the mount reference - callers assume that ksmbd_vfs_kern_path_locked() returns with mount write access grabbed if and only if it has returned 0. Fixes: 864fb5d37163 ("ksmbd: fix possible deadlock in smb2_open") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-08smb: server: make use of rdma_destroy_qp()Stefan Metzmacher
The qp is created by rdma_create_qp() as t->cm_id->qp and t->qp is just a shortcut. rdma_destroy_qp() also calls ib_destroy_qp(cm_id->qp) internally, but it is protected by a mutex, clears the cm_id and also calls trace_cm_qp_destroy(). This should make the tracing more useful as both rdma_create_qp() and rdma_destroy_qp() are traces and it makes the code look more sane as functions from the same layer are used for the specific qp object. trace-cmd stream -e rdma_cma:cm_qp_create -e rdma_cma:cm_qp_destroy shows this now while doing a mount and unmount from a client: <...>-80 [002] 378.514182: cm_qp_create: cm.id=1 src=172.31.9.167:5445 dst=172.31.9.166:37113 tos=0 pd.id=0 qp_type=RC send_wr=867 recv_wr=255 qp_num=1 rc=0 <...>-6283 [001] 381.686172: cm_qp_destroy: cm.id=1 src=172.31.9.167:5445 dst=172.31.9.166:37113 tos=0 qp_num=1 Before we only saw the first line. Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <stfrench@microsoft.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers") Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Tom Talpey <tom@talpey.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-08fs/libfs: don't assume blocksize <= PAGE_SIZE in generic_check_addressablePankaj Raghav
Since [1], it is possible for filesystems to have blocksize > PAGE_SIZE of the system. Remove the assumption and make the check generic for all blocksizes in generic_check_addressable(). [1] https://lore.kernel.org/linux-xfs/20240822135018.1931258-1-kernel@pankajraghav.com/ Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/20250630104018.213985-1-p.raghav@samsung.com Reviewed-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-08fs/buffer: remove the min and max limit checks in __getblk_slow()Pankaj Raghav
All filesystems will already check the max and min value of their block size during their initialization. __getblk_slow() is a very low-level function to have these checks. Remove them and only check for logical block size alignment. As this check with logical block size alignment might never trigger, add WARN_ON_ONCE() to the check. As WARN_ON_ONCE() will already print the stack, remove the call to dump_stack(). Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/20250626113223.181399-1-p.raghav@samsung.com Reviewed-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-08fs: Prevent file descriptor table allocations exceeding INT_MAXSasha Levin
When sysctl_nr_open is set to a very high value (for example, 1073741816 as set by systemd), processes attempting to use file descriptors near the limit can trigger massive memory allocation attempts that exceed INT_MAX, resulting in a WARNING in mm/slub.c: WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288 This happens because kvmalloc_array() and kvmalloc() check if the requested size exceeds INT_MAX and emit a warning when the allocation is not flagged with __GFP_NOWARN. Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a process calls dup2(oldfd, 1073741880), the kernel attempts to allocate: - File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes - Multiple bitmaps: ~400MB - Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647) Reproducer: 1. Set /proc/sys/fs/nr_open to 1073741816: # echo 1073741816 > /proc/sys/fs/nr_open 2. Run a program that uses a high file descriptor: #include <unistd.h> #include <sys/resource.h> int main() { struct rlimit rlim = {1073741824, 1073741824}; setrlimit(RLIMIT_NOFILE, &rlim); dup2(2, 1073741880); // Triggers the warning return 0; } 3. Observe WARNING in dmesg at mm/slub.c:5027 systemd commit a8b627a introduced automatic bumping of fs.nr_open to the maximum possible value. The rationale was that systems with memory control groups (memcg) no longer need separate file descriptor limits since memory is properly accounted. However, this change overlooked that: 1. The kernel's allocation functions still enforce INT_MAX as a maximum size regardless of memcg accounting 2. Programs and tests that legitimately test file descriptor limits can inadvertently trigger massive allocations 3. The resulting allocations (>8GB) are impractical and will always fail systemd's algorithm starts with INT_MAX and keeps halving the value until the kernel accepts it. On most systems, this results in nr_open being set to 1073741816 (0x3ffffff8), which is just under 1GB of file descriptors. While processes rarely use file descriptors near this limit in normal operation, certain selftests (like tools/testing/selftests/core/unshare_test.c) and programs that test file descriptor limits can trigger this issue. Fix this by adding a check in alloc_fdtable() to ensure the requested allocation size does not exceed INT_MAX. This causes the operation to fail with -EMFILE instead of triggering a kernel warning and avoids the impractical >8GB memory allocation request. Fixes: 9cfe015aa424 ("get rid of NR_OPEN and introduce a sysctl_nr_open") Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org> Link: https://lore.kernel.org/20250629074021.1038845-1-sashal@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-08xfs: remove the bt_bdev_file buftarg fieldChristoph Hellwig
And use bt_file for both bdev and shmem backed buftargs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: rename the bt_bdev_* buftarg fieldsChristoph Hellwig
The extra bdev_ is weird, so drop it. Also improve the comment to make it clear these are the hardware limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: refactor xfs_calc_atomic_write_unit_maxChristoph Hellwig
This function and the helpers used by it duplicate the same logic for AGs and RTGs. Use the xfs_group_type enum to unify both variants. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: add a xfs_group_type_buftarg helperChristoph Hellwig
Generalize the xfs_group_type helper in the discard code to return a buftarg and move it to xfs_mount.h, and use the result in xfs_dax_notify_dev_failure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: remove the call to sync_blockdev in xfs_configure_buftargChristoph Hellwig
This extra call is not needed as xfs_alloc_buftarg already calls sync_blockdev. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: clean up the initial read logic in xfs_readsbChristoph Hellwig
The initial sb read is always for a device logical block size buffer. The device logical block size is provided in the bt_logical_sectorsize in struct buftarg, so use that instead of the confusingly named xfs_getsize_buftarg buffer that reads it from the bdev. Update the comments surrounding the code to better describe what is going on. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: replace strncpy with memcpy in xattr listingPranav Tyagi
Use memcpy() in place of strncpy() in __xfs_xattr_put_listent(). The length is known and a null byte is added manually. No functional change intended. Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08fold fs_struct->{lock,seq} into a seqlockAl Viro
The combination of spinlock_t lock and seqcount_spinlock_t seq in struct fs_struct is an open-coded seqlock_t (see linux/seqlock_types.h). Combine and switch to equivalent seqlock_t primitives. AFAICS, that does end up with the same sequence of underlying operations in all cases. While we are at it, get_fs_pwd() is open-coded verbatim in get_path_from_fd(); rather than applying conversion to it, replace with the call of get_fs_pwd() there. Not worth splitting the commit for that, IMO... A bit of historical background - conversion of seqlock_t to use of seqcount_spinlock_t happened several months after the same had been done to struct fs_struct; switching fs_struct to seqlock_t could've been done immediately after that, but it looks like nobody had gotten around to that until now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/20250702053437.GC1880847@ZenIV Acked-by: Ahmed S. Darwish <darwi@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-08Revert "fs/ntfs3: Replace inode_trylock with inode_lock"Konstantin Komarov
This reverts commit 69505fe98f198ee813898cbcaf6770949636430b. Initially, conditional lock acquisition was removed to fix an xfstest bug that was observed during internal testing. The deadlock reported by syzbot is resolved by reintroducing conditional acquisition. The xfstest bug no longer occurs on kernel version 6.16-rc1 during internal testing. I assume that changes in other modules may have contributed to this. Fixes: 69505fe98f19 ("fs/ntfs3: Replace inode_trylock with inode_lock") Reported-by: syzbot+a91fcdbd2698f99db8f4@syzkaller.appspotmail.com Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2025-07-07bcachefs: Fix additional misalignment in journal space calculationsKent Overstreet
Additional fix on top of f54b2a80d0df bcachefs: Fix misaligned bucket check in journal space calculations Make sure that when we calculate space for the next entry it's not misaligned: we need to round_down() to filesystem block size in multiple places (next entry size calculation as well as total space available). Reported-by: Ondřej Kraus <neverberlerfellerer@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-07bcachefs: Don't schedule non persistent passes persistentlyKent Overstreet
if (!(in_recovery && (flags & RUN_RECOVERY_PASS_nopersistent))) should have been if (!in_recovery && !(flags & RUN_RECOVERY_PASS_nopersistent))) But the !in_recovery part was also wrong: the assumption is that if we're in recovery we'll just rewind and run the recovery pass immediately, but we're not able to do so if we've already gone RW and the pass must be run before we go RW. In that case, we need to schedule it in the superblock so it can be run on the next mount attempt. Scheduling it persistently is fine, because it'll be cleared in the superblock immediately when the pass completes successfully. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-07fs/ntfs3: Exclude call make_bad_inode for live nodes.Konstantin Komarov
Use ntfs_inode field 'ni_bad' to mark inode as bad (if something went wrong) and to avoid any operations Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2025-07-07coredump: fix PIDFD_INFO_COREDUMP ioctl checkLaura Brehm
In Commit 1d8db6fd698de1f73b1a7d72aea578fdd18d9a87 ("pidfs, coredump: add PIDFD_INFO_COREDUMP"), the following code was added: if (mask & PIDFD_INFO_COREDUMP) { kinfo.mask |= PIDFD_INFO_COREDUMP; kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask); } [...] if (!(kinfo.mask & PIDFD_INFO_COREDUMP)) { task_lock(task); if (task->mm) kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags); task_unlock(task); } The second bit in particular looks off to me - the condition in essence checks whether PIDFD_INFO_COREDUMP was **not** requested, and if so fetches the coredump_mask in kinfo, since it's checking !(kinfo.mask & PIDFD_INFO_COREDUMP), which is unconditionally set in the earlier hunk. I'm tempted to assume the idea in the second hunk was to calculate the coredump mask if one was requested but fetched in the first hunk, in which case the check should be if ((kinfo.mask & PIDFD_INFO_COREDUMP) && !(kinfo.coredump_mask)) which might be more legibly written as if ((mask & PIDFD_INFO_COREDUMP) && !(kinfo.coredump_mask)) This could also instead be achieved by changing the first hunk to be: if (mask & PIDFD_INFO_COREDUMP) { kinfo.coredump_mask = READ_ONCE(pidfs_i(inode)->__pei.coredump_mask); if (kinfo.coredump_mask) kinfo.mask |= PIDFD_INFO_COREDUMP; } and the second hunk to: if ((mask & PIDFD_INFO_COREDUMP) && !(kinfo.mask & PIDFD_INFO_COREDUMP)) { task_lock(task); if (task->mm) { kinfo.coredump_mask = pidfs_coredump_mask(task->mm->flags); kinfo.mask |= PIDFD_INFO_COREDUMP; } task_unlock(task); } However, when looking at this, the supposition that the second hunk means to cover cases where the coredump info was requested but the first hunk failed to get it starts getting doubtful, so apologies if I'm completely off-base. This patch addresses the issue by fixing the check in the second hunk. Signed-off-by: Laura Brehm <laurabrehm@hey.com> Link: https://lore.kernel.org/20250703120244.96908-3-laurabrehm@hey.com Cc: brauner@kernel.org Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: add coredump_skip() helperChristian Brauner
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-24-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: avoid pointless variableChristian Brauner
we don't use that value at all so don't bother with it in the first place. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-23-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: order auto cleanup variables at the topChristian Brauner
so they're easy to spot. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-22-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: add coredump_cleanup()Christian Brauner
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-21-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: auto cleanup prepare_creds()Christian Brauner
which will allow us to simplify the exit path in further patches. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-20-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: directly returnChristian Brauner
instead of jumping to a pointless cleanup label. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-18-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: auto cleanup argvChristian Brauner
to prepare for a simpler exit path. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-17-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: add coredump_write()Christian Brauner
to encapsulate that logic simplifying vfs_coredump() even further. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-16-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: use a single helper for the socketChristian Brauner
Don't split it into multiple functions. Just use a single one like we do for coredump_file() and coredump_pipe() now. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-15-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: move pipe specific file check into coredump_pipe()Christian Brauner
There's no point in having this eyesore in the middle of vfs_coredump(). Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-14-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-07coredump: split pipe coredumping into coredump_pipe()Christian Brauner
* Move that whole mess into a separate helper instead of having all that hanging around in vfs_coredump() directly. Cleanup paths are already centralized. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-13-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-06hfsplus: remove mutex_lock check in hfsplus_free_extentsYangtao Li
Syzbot reported an issue in hfsplus filesystem: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 4400 at fs/hfsplus/extents.c:346 hfsplus_free_extents+0x700/0xad0 Call Trace: <TASK> hfsplus_file_truncate+0x768/0xbb0 fs/hfsplus/extents.c:606 hfsplus_write_begin+0xc2/0xd0 fs/hfsplus/inode.c:56 cont_expand_zero fs/buffer.c:2383 [inline] cont_write_begin+0x2cf/0x860 fs/buffer.c:2446 hfsplus_write_begin+0x86/0xd0 fs/hfsplus/inode.c:52 generic_cont_expand_simple+0x151/0x250 fs/buffer.c:2347 hfsplus_setattr+0x168/0x280 fs/hfsplus/inode.c:263 notify_change+0xe38/0x10f0 fs/attr.c:420 do_truncate+0x1fb/0x2e0 fs/open.c:65 do_sys_ftruncate+0x2eb/0x380 fs/open.c:193 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd To avoid deadlock, Commit 31651c607151 ("hfsplus: avoid deadlock on file truncation") unlock extree before hfsplus_free_extents(), and add check wheather extree is locked in hfsplus_free_extents(). However, when operations such as hfsplus_file_release, hfsplus_setattr, hfsplus_unlink, and hfsplus_get_block are executed concurrently in different files, it is very likely to trigger the WARN_ON, which will lead syzbot and xfstest to consider it as an abnormality. The comment above this warning also describes one of the easy triggering situations, which can easily trigger and cause xfstest&syzbot to report errors. [task A] [task B] ->hfsplus_file_release ->hfsplus_file_truncate ->hfs_find_init ->mutex_lock ->mutex_unlock ->hfsplus_write_begin ->hfsplus_get_block ->hfsplus_file_extend ->hfsplus_ext_read_extent ->hfs_find_init ->mutex_lock ->hfsplus_free_extents WARN_ON(mutex_is_locked) !!! Several threads could try to lock the shared extents tree. And warning can be triggered in one thread when another thread has locked the tree. This is the wrong behavior of the code and we need to remove the warning. Fixes: 31651c607151f ("hfsplus: avoid deadlock on file truncation") Reported-by: syzbot+8c0bc9f818702ff75b76@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/00000000000057fa4605ef101c4c@google.com/ Signed-off-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> Link: https://lore.kernel.org/r/20250529061807.2213498-1-frank.li@vivo.com Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
2025-07-06hfs: make splice write available againYangtao Li
Since 5.10, splice() or sendfile() return EINVAL. This was caused by commit 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops"). This patch initializes the splice_write field in file_operations, like most file systems do, to restore the functionality. Fixes: 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops") Signed-off-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> Link: https://lore.kernel.org/r/20250529140033.2296791-2-frank.li@vivo.com Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
2025-07-06hfsplus: make splice write available againYangtao Li
Since 5.10, splice() or sendfile() return EINVAL. This was caused by commit 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops"). This patch initializes the splice_write field in file_operations, like most file systems do, to restore the functionality. Fixes: 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops") Signed-off-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> Link: https://lore.kernel.org/r/20250529140033.2296791-1-frank.li@vivo.com Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
2025-07-06hfs: fix not erasing deleted b-tree node issueViacheslav Dubeyko
The generic/001 test of xfstests suite fails and corrupts the HFS volume: sudo ./check generic/001 FSTYP -- hfs PLATFORM -- Linux/x86_64 hfsplus-testing-0001 6.15.0-rc2+ #3 SMP PREEMPT_DYNAMIC Fri Apr 25 17:13:00 PDT 2> MKFS_OPTIONS -- /dev/loop51 MOUNT_OPTIONS -- /dev/loop51 /mnt/scratch generic/001 32s ... _check_generic_filesystem: filesystem on /dev/loop50 is inconsistent (see /home/slavad/XFSTESTS-2/xfstests-dev/results//generic/001.full for details) Ran: generic/001 Failures: generic/001 Failed 1 of 1 tests fsck.hfs -d -n ./test-image.bin ** ./test-image.bin (NO WRITE) Using cacheBlockSize=32K cacheTotalBlock=1024 cacheSize=32768K. Executing fsck_hfs (version 540.1-Linux). ** Checking HFS volume. The volume name is untitled ** Checking extents overflow file. ** Checking catalog file. Unused node is not erased (node = 2) Unused node is not erased (node = 4) <skipped> Unused node is not erased (node = 253) Unused node is not erased (node = 254) Unused node is not erased (node = 255) Unused node is not erased (node = 256) ** Checking catalog hierarchy. ** Checking volume bitmap. ** Checking volume information. Verify Status: VIStat = 0x0000, ABTStat = 0x0000 EBTStat = 0x0000 CBTStat = 0x0004 CatStat = 0x00000000 ** The volume untitled was found corrupt and needs to be repaired. volume type is HFS primary MDB is at block 2 0x02 alternate MDB is at block 20971518 0x13ffffe primary VHB is at block 0 0x00 alternate VHB is at block 0 0x00 sector size = 512 0x200 VolumeObject flags = 0x19 total sectors for volume = 20971520 0x1400000 total sectors for embedded volume = 0 0x00 This patch adds logic of clearing the deleted b-tree node. sudo ./check generic/001 FSTYP -- hfs PLATFORM -- Linux/x86_64 hfsplus-testing-0001 6.15.0-rc2+ #3 SMP PREEMPT_DYNAMIC Fri Apr 25 17:13:00 PDT 2025 MKFS_OPTIONS -- /dev/loop51 MOUNT_OPTIONS -- /dev/loop51 /mnt/scratch generic/001 9s ... 32s Ran: generic/001 Passed all 1 tests fsck.hfs -d -n ./test-image.bin ** ./test-image.bin (NO WRITE) Using cacheBlockSize=32K cacheTotalBlock=1024 cacheSize=32768K. Executing fsck_hfs (version 540.1-Linux). ** Checking HFS volume. The volume name is untitled ** Checking extents overflow file. ** Checking catalog file. ** Checking catalog hierarchy. ** Checking volume bitmap. ** Checking volume information. ** The volume untitled appears to be OK. Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250430001211.1912533-1-slava@dubeyko.com Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
2025-07-06Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull /proc/sys dcache lookup fix from Al Viro: "Fix for the breakage spotted by Neil in the interplay between /proc/sys ->d_compare() weirdness and parallel lookups" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix proc_sys_compare() handling of in-lookup dentries
2025-07-05Merge tag 'v6.16-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: - Two reconnect fixes including one for a reboot/reconnect race - Fix for incorrect file type that can be returned by SMB3.1.1 POSIX extensions - tcon initialization fix - Fix for resolving Windows symlinks with absolute paths * tag 'v6.16-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: client: fix native SMB symlink traversal smb: client: fix race condition in negotiate timeout by using more precise timing cifs: all initializations for tcon should happen in tcon_info_alloc smb: client: fix warning when reconnecting channel smb: client: fix readdir returning wrong type with POSIX extensions
2025-07-05bcachefs: Fix bch2_btree_transactions_read() synchronizationKent Overstreet
Since we're accessing btree_trans objects owned by another thread, we need to guard against using pointers to freed key cache entries: we need our own srcu read lock, and we should skip a btree_trans if it didn't hold the srcu lock (and thus it might have pointers to freed key cache entries). 00693 Mem abort info: 00693 ESR = 0x0000000096000005 00693 EC = 0x25: DABT (current EL), IL = 32 bits 00693 SET = 0, FnV = 0 00693 EA = 0, S1PTW = 0 00693 FSC = 0x05: level 1 translation fault 00693 Data abort info: 00693 ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 00693 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 00693 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 00693 user pgtable: 4k pages, 39-bit VAs, pgdp=000000012e650000 00693 [000000008fb96218] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 00693 Internal error: Oops: 0000000096000005 [#1] SMP 00693 Modules linked in: 00693 CPU: 0 UID: 0 PID: 4307 Comm: cat Not tainted 6.16.0-rc2-ktest-g9e15af94fd86 #27578 NONE 00693 Hardware name: linux,dummy-virt (DT) 00693 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00693 pc : six_lock_counts+0x20/0xe8 00693 lr : bch2_btree_bkey_cached_common_to_text+0x38/0x130 00693 sp : ffffff80ca98bb60 00693 x29: ffffff80ca98bb60 x28: 000000008fb96200 x27: 0000000000000007 00693 x26: ffffff80eafd06b8 x25: 0000000000000000 x24: ffffffc080d75a60 00693 x23: ffffff80eafd0000 x22: ffffffc080bdfcc0 x21: ffffff80eafd0210 00693 x20: ffffff80c192ff08 x19: 000000008fb96200 x18: 00000000ffffffff 00693 x17: 0000000000000000 x16: 0000000000000000 x15: 00000000ffffffff 00693 x14: 0000000000000000 x13: ffffff80ceb5a29a x12: 20796220646c6568 00693 x11: 72205d3e303c5b20 x10: 0000000000000020 x9 : ffffffc0805fb6b0 00693 x8 : 0000000000000020 x7 : 0000000000000000 x6 : 0000000000000020 00693 x5 : ffffff80ceb5a29c x4 : 0000000000000001 x3 : 000000000000029c 00693 x2 : 0000000000000000 x1 : ffffff80ef66c000 x0 : 000000008fb96200 00693 Call trace: 00693 six_lock_counts+0x20/0xe8 (P) 00693 bch2_btree_bkey_cached_common_to_text+0x38/0x130 00693 bch2_btree_trans_to_text+0x260/0x2a8 00693 bch2_btree_transactions_read+0xac/0x1e8 00693 full_proxy_read+0x74/0xd8 00693 vfs_read+0x90/0x300 00693 ksys_read+0x6c/0x108 00693 __arm64_sys_read+0x20/0x30 00693 invoke_syscall.constprop.0+0x54/0xe8 00693 do_el0_svc+0x44/0xc8 00693 el0_svc+0x18/0x58 00693 el0t_64_sync_handler+0x104/0x130 00693 el0t_64_sync+0x154/0x158 00693 Code: 910003fd f9423c22 f90017e2 d2800002 (f9400c01) 00693 ---[ end trace 0000000000000000 ]--- Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: btree read retry fixesKent Overstreet
Fix btree node read retries after validate errors: __btree_err() is the wrong place to flag a topology error: that is done by btree_lost_data(). Additionally, some calls to bch2_bkey_pick_read_device() were not updated in the 6.16 rework for improved log messages; we were failing to signal that we still had a retry. Cc: Nikita Ofitserov <himikof@gmail.com> Cc: Alan Huang <mmpgouride@gmail.com> Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-05bcachefs: btree node scan no longer uses btree cacheKent Overstreet
Previously, btree node scan used the btree node cache to check if btree nodes were readable, but this is subject to interference from threads scanning different devices trying to read the same node - and more critically, nodes that we already attempted and failed to read before kicking off scan. Instead, we now allocate a 'struct btree' that does not live in the btree node cache, and call bch2_btree_node_read_done() directly. Cc: Nikita Ofitserov <himikof@gmail.com> Reviewed-by: Nikita Ofitserov <himikof@gmail.com> Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-04bcachefs: Tweak btree cache helpers for use by btree node scanKent Overstreet
btree node scan needs to not use the btree node cache: that causes interference from prior failed reads and parallel workers. Instead we need to allocate btree nodes that don't live in the btree cache, so that we can call bch2_btree_node_read_done() directly. This patch tweaks the low level helpers so they don't touch the btree cache lists. Cc: Nikita Ofitserov <himikof@gmail.com> Reviewed-by: Nikita Ofitserov <himikof@gmail.com> Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-04bcachefs: Fix btree for nonexistent tree depthKent Overstreet
The fix for when we should increase tree depth in journal replay was entirely bogus. We should only increase the tree depth in journal replay when recovery from btree node scan, and then only for keys found by btree node scan. This needs additional work - we should be shooting down existing interior node pointers when recovery from scan, they shouldn't be showing up here. Fixes: b47a82ff4772 ("bcachefs: Only run 'increase_depth' for keys from btree node csan") Cc: Alan Huang <mmpgouride@gmail.com> Reported-by: syzbot+8deb6ff4415db67a9f18@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-04bcachefs: Fix bch2_io_failures_to_text()Kent Overstreet
This wasn't updated when we added tracking for btree validate errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-04bcachefs: bch2_fpunch_snapshot()Kent Overstreet
Add a new version of fpunch for operating on a snapshot ID, not a subvolume - and use it for "extent past end of inode" repair. Previously, repair would try to delete everything at once, but deleting too many extents at once can overflow the btree_trans bump allocator, as well as causing other problems - the new helper properly uses bch2_extent_trim_atomic(). Reported-and-tested-by: Edoardo Codeglia <bcachefs@404.blue> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-04fscrypt: Don't use problematic non-inline crypto enginesEric Biggers
Make fscrypt no longer use Crypto API drivers for non-inline crypto engines, even when the Crypto API prioritizes them over CPU-based code (which unfortunately it often does). These drivers tend to be really problematic, especially for fscrypt's workload. This commit has no effect on inline crypto engines, which are different and do work well. Specifically, exclude drivers that have CRYPTO_ALG_KERN_DRIVER_ONLY or CRYPTO_ALG_ALLOCATES_MEMORY set. (Later, CRYPTO_ALG_ASYNC should be excluded too. That's omitted for now to keep this commit backportable, since until recently some CPU-based code had CRYPTO_ALG_ASYNC set.) There are two major issues with these drivers: bugs and performance. First, these drivers tend to be buggy. They're fundamentally much more error-prone and harder to test than the CPU-based code. They often don't get tested before kernel releases, and even if they do, the crypto self-tests don't properly test these drivers. Released drivers have en/decrypted or hashed data incorrectly. These bugs cause issues for fscrypt users who often didn't even want to use these drivers, e.g.: - https://github.com/google/fscryptctl/issues/32 - https://github.com/google/fscryptctl/issues/9 - https://lore.kernel.org/r/PH0PR02MB731916ECDB6C613665863B6CFFAA2@PH0PR02MB7319.namprd02.prod.outlook.com These drivers have also similarly caused issues for dm-crypt users, including data corruption and deadlocks. Since Linux v5.10, dm-crypt has disabled most of them by excluding CRYPTO_ALG_ALLOCATES_MEMORY. Second, these drivers tend to be *much* slower than the CPU-based code. This may seem counterintuitive, but benchmarks clearly show it. There's a *lot* of overhead associated with going to a hardware driver, off the CPU, and back again. To prove this, I gathered as many systems with this type of crypto engine as I could, and I measured synchronous encryption of 4096-byte messages (which matches fscrypt's workload): Intel Emerald Rapids server: AES-256-XTS: xts-aes-vaes-avx512 16171 MB/s [CPU-based, Vector AES] qat_aes_xts 289 MB/s [Offload, Intel QuickAssist] Qualcomm SM8650 HDK: AES-256-XTS: xts-aes-ce 4301 MB/s [CPU-based, ARMv8 Crypto Extensions] xts-aes-qce 73 MB/s [Offload, Qualcomm Crypto Engine] i.MX 8M Nano LPDDR4 EVK: AES-256-XTS: xts-aes-ce 647 MB/s [CPU-based, ARMv8 Crypto Extensions] xts(ecb-aes-caam) 20 MB/s [Offload, CAAM] AES-128-CBC-ESSIV: essiv(cbc-aes-caam,sha256-lib) 23 MB/s [Offload, CAAM] STM32MP157F-DK2: AES-256-XTS: xts-aes-neonbs 13.2 MB/s [CPU-based, ARM NEON] xts(stm32-ecb-aes) 3.1 MB/s [Offload, STM32 crypto engine] AES-128-CBC-ESSIV: essiv(cbc-aes-neonbs,sha256-lib) 14.7 MB/s [CPU-based, ARM NEON] essiv(stm32-cbc-aes,sha256-lib) 3.2 MB/s [Offload, STM32 crypto engine] Adiantum: adiantum(xchacha12-arm,aes-arm,nhpoly1305-neon) 52.8 MB/s [CPU-based, ARM scalar + NEON] So, there was no case in which the crypto engine was even *close* to being faster. On the first three, which have AES instructions in the CPU, the CPU was 30 to 55 times faster (!). Even on STM32MP157F-DK2 which has a Cortex-A7 CPU that doesn't have AES instructions, AES was over 4 times faster on the CPU. And Adiantum encryption, which is what actually should be used on CPUs like that, was over 17 times faster. Other justifications that have been given for these non-inline crypto engines (almost always coming from the hardware vendors, not actual users) don't seem very plausible either: - The crypto engine throughput could be improved by processing multiple requests concurrently. Currently irrelevant to fscrypt, since it doesn't do that. This would also be complex, and unhelpful in many cases. 2 of the 4 engines I tested even had only one queue. - Some of the engines, e.g. STM32, support hardware keys. Also currently irrelevant to fscrypt, since it doesn't support these. Interestingly, the STM32 driver itself doesn't support this either. - Free up CPU for other tasks and/or reduce energy usage. Not very plausible considering the "short" message length, driver overhead, and scheduling overhead. There's just very little time for the CPU to do something else like run another task or enter low-power state, before the message finishes and it's time to process the next one. - Some of these engines resist power analysis and electromagnetic attacks, while the CPU-based crypto generally does not. In theory, this sounds great. In practice, if this benefit requires the use of an off-CPU offload that massively regresses performance and has a low-quality, buggy driver, the price for this hardening (which is not relevant to most fscrypt users, and tends to be incomplete) is just too high. Inline crypto engines are much more promising here, as are on-CPU solutions like RISC-V High Assurance Cryptography. Fixes: b30ab0e03407 ("ext4 crypto: add ext4 encryption facilities") Cc: stable@vger.kernel.org Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250704070322.20692-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org>
2025-07-04Merge tag 'bcachefs-2025-07-03' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet: "The 'opts.casefold_disabled' patch is non critical, but would be a 6.15 backport; it's to address the casefolding + overlayfs incompatibility that was discovvered late. It's late because I was hoping that this would be addressed on the overlayfs side (and will be in 6.17), but user reports keep coming in on this one (lots of people are using docker these days)" * tag 'bcachefs-2025-07-03' of git://evilpiepirate.org/bcachefs: bcachefs: opts.casefold_disabled bcachefs: Work around deadlock to btree node rewrites in journal replay bcachefs: Fix incorrect transaction restart handling bcachefs: fix btree_trans_peek_prev_journal() bcachefs: mark invalid_btree_id autofix
2025-07-04Merge tag 'vfs-6.16-rc5.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix a regression caused by the anonymous inode rework. Making them regular files causes various places in the kernel to tip over starting with io_uring. Revert to the former status quo and port our assertion to be based on checking the inode so we don't lose the valuable VFS_*_ON_*() assertions that have already helped discover weird behavior our outright bugs. - Fix the the upper bound calculation in fuse_fill_write_pages() - Fix priority inversion issues in the eventpoll code - Make secretmen use anon_inode_make_secure_inode() to avoid bypassing the LSM layer - Fix a netfs hang due to missing case in final DIO read result collection - Fix a double put of the netfs_io_request struct - Provide some helpers to abstract out NETFS_RREQ_IN_PROGRESS flag wrangling - Fix infinite looping in netfs_wait_for_pause/request() - Fix a netfs ref leak on an extra subrequest inserted into a request's list of subreqs - Fix various cifs RPC callbacks to set NETFS_SREQ_NEED_RETRY if a subrequest fails retriably - Fix a cifs warning in the workqueue code when reconnecting a channel - Fix the updating of i_size in netfs to avoid a race between testing if we should have extended the file with a DIO write and changing i_size - Merge the places in netfs that update i_size on write - Fix coredump socket selftests * tag 'vfs-6.16-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: anon_inode: rework assertions netfs: Update tracepoints in a number of ways netfs: Renumber the NETFS_RREQ_* flags to make traces easier to read netfs: Merge i_size update functions netfs: Fix i_size updating smb: client: set missing retry flag in cifs_writev_callback() smb: client: set missing retry flag in cifs_readv_callback() smb: client: set missing retry flag in smb2_writev_callback() netfs: Fix ref leak on inserted extra subreq in write retry netfs: Fix looping in wait functions netfs: Provide helpers to perform NETFS_RREQ_IN_PROGRESS flag wangling netfs: Fix double put of request netfs: Fix hang due to missing case in final DIO read result collection eventpoll: Fix priority inversion problem fuse: fix fuse_fill_write_pages() upper bound calculation fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypass selftests/coredump: Fix "socket_detect_userspace_client" test failure
2025-07-04tree-wide: s/struct fileattr/struct file_kattr/gChristian Brauner
Now that we expose struct file_attr as our uapi struct rename all the internal struct to struct file_kattr to clearly communicate that it is a kernel internal struct. This is similar to struct mount_{k}attr and others. Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR (net-6.16-rc5). No conflicts. No adjacent changes. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-03fix proc_sys_compare() handling of in-lookup dentriesAl Viro
There's one case where ->d_compare() can be called for an in-lookup dentry; usually that's nothing special from ->d_compare() point of view, but... proc_sys_compare() is weird. The thing is, /proc/sys subdirectories can look differently for different processes. Up to and including having the same name resolve to different dentries - all of them hashed. The way it's done is ->d_compare() refusing to admit a match unless this dentry is supposed to be visible to this caller. The information needed to discriminate between them is stored in inode; it is set during proc_sys_lookup() and until it's done d_splice_alias() we really can't tell who should that dentry be visible for. Normally there's no negative dentries in /proc/sys; we can run into a dying dentry in RCU dcache lookup, but those can be safely rejected. However, ->d_compare() is also called for in-lookup dentries, before they get positive - or hashed, for that matter. In case of match we will wait until dentry leaves in-lookup state and repeat ->d_compare() afterwards. In other words, the right behaviour is to treat the name match as sufficient for in-lookup dentries; if dentry is not for us, we'll see that when we recheck once proc_sys_lookup() is done with it. While we are at it, fix the misspelled READ_ONCE and WRITE_ONCE there. Fixes: d9171b934526 ("parallel lookups machinery, part 4 (and last)") Reported-by: NeilBrown <neilb@brown.name> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>