summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
5 daysMerge tag 'mm-hotfixes-stable-2025-07-24-18-03' of ↵HEADmasterLinus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "11 hotfixes. 9 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. 7 are for MM" * tag 'mm-hotfixes-stable-2025-07-24-18-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: sprintf.h requires stdarg.h resource: fix false warning in __request_region() mm/damon/core: commit damos_quota_goal->nid kasan: use vmalloc_dump_obj() for vmalloc error reports mm/ksm: fix -Wsometimes-uninitialized from clang-21 in advisor_mode_show() mm: update MAINTAINERS entry for HMM nilfs2: reject invalid file types when reading inodes selftests/mm: fix split_huge_page_test for folio_split() tests mailmap: add entry for Senozhatsky mm/zsmalloc: do not pass __GFP_MOVABLE if CONFIG_COMPACTION=n mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
6 daysfix the regression in ufs options parsingAl Viro
A really dumb braino on rebasing and a dumber fuckup with managing #for-next Fixes: b70cb459890b ("ufs: convert ufs to the new mount API") Fucked-up-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
10 daysnilfs2: reject invalid file types when reading inodesRyusuke Konishi
To prevent inodes with invalid file types from tripping through the vfs and causing malfunctions or assertion failures, add a missing sanity check when reading an inode from a block device. If the file type is not valid, treat it as a filesystem error. Link: https://lkml.kernel.org/r/20250710134952.29862-1-konishi.ryusuke@gmail.com Fixes: 05fe58fdc10d ("nilfs2: inode operations") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+895c23f6917da440ed0d@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=895c23f6917da440ed0d Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 daysMerge tag 'efi-fixes-for-v6.16-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fix from Ard Biesheuvel: - Fix potential memory leak reported by kmemleak * tag 'efi-fixes-for-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efivarfs: Fix memory leak of efivarfs_fs_info in fs_context error paths
10 daysMerge tag 'vfs-6.16-rc7.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix a memory leak in fcntl_dirnotify() - Raise SB_I_NOEXEC on secrement superblock instead of messing with flags on the mount - Add fsdevel and block mailing lists to uio entry. We had a few instances were very questionable stuff was added without either block or the VFS being aware of it - Fix netfs copy-to-cache so that it performs collection with ceph+fscache - Fix netfs race between cache write completion and ALL_QUEUED being set - Verify the inode mode when loading entries from disk in isofs - Avoid state_lock in iomap_set_range_uptodate() - Fix PIDFD_INFO_COREDUMP check in PIDFD_GET_INFO ioctl - Fix the incorrect return value in __cachefiles_write() * tag 'vfs-6.16-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: MAINTAINERS: add block and fsdevel lists to iov_iter netfs: Fix race between cache write completion and ALL_QUEUED being set netfs: Fix copy-to-cache so that it performs collection with ceph+fscache fix a leak in fcntl_dirnotify() iomap: avoid unnecessary ifs_set_range_uptodate() with locks isofs: Verify inode mode when loading from disk cachefiles: Fix the incorrect return value in __cachefiles_write() secretmem: use SB_I_NOEXEC coredump: fix PIDFD_INFO_COREDUMP ioctl check
11 daysMerge tag 'v6.16-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: - fix creating special files to Samba when using SMB3.1.1 POSIX Extensions - fix incorrect caching on new file creation with directory leases enabled - two use after free fixes: one in oplock_break and one in async decryption * tag 'v6.16-rc6-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: Fix SMB311 posix special file creation to servers which do not advertise reparse support smb: invalidate and close cached directory when creating child entries smb: client: fix use-after-free in crypt_message when using async crypto smb: client: fix use-after-free in cifs_oplock_break
11 daysFix SMB311 posix special file creation to servers which do not advertise ↵Steve French
reparse support Some servers (including Samba), support the SMB3.1.1 POSIX Extensions (which use reparse points for handling special files) but do not properly advertise file system attribute FILE_SUPPORTS_REPARSE_POINTS. Although we don't check for this attribute flag when querying special file information, we do check it when creating special files which causes them to fail unnecessarily. If we have negotiated SMB3.1.1 POSIX Extensions with the server we can expect the server to support creating special files via reparse points, and even if the server fails the operation due to really forbidding creating special files, then it should be no problem and is more likely to return a more accurate rc in any case (e.g. EACCES instead of EOPNOTSUPP). Allow creating special files as long as the server supports either reparse points or the SMB3.1.1 POSIX Extensions (note that if the "sfu" mount option is specified it uses a different way of storing special files that does not rely on reparse points). Cc: <stable@vger.kernel.org> Fixes: 6c06be908ca19 ("cifs: Check if server supports reparse points before using them") Acked-by: Ralph Boehme <slow@samba.org> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Steve French <stfrench@microsoft.com>
11 daysMerge tag 'xfs-fixes-6.16-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Carlos Maiolino: "This contains mostly code clean up, refactoring and comments modification. The most important patch in this series is the last one that removes an unnecessary data structure allocation of xfs busy extents which might lead to a memory leak on the zoned allocator code" * tag 'xfs-fixes-6.16-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: don't allocate the xfs_extent_busy structure for zoned RTGs xfs: remove the bt_bdev_file buftarg field xfs: rename the bt_bdev_* buftarg fields xfs: refactor xfs_calc_atomic_write_unit_max xfs: add a xfs_group_type_buftarg helper xfs: remove the call to sync_blockdev in xfs_configure_buftarg xfs: clean up the initial read logic in xfs_readsb xfs: replace strncpy with memcpy in xattr listing
11 daysxfs: don't allocate the xfs_extent_busy structure for zoned RTGsChristoph Hellwig
Busy extent tracking is primarily used to ensure that freed blocks are not reused for data allocations before the transaction that deleted them has been committed to stable storage, and secondarily to drive online discard. None of the use cases applies to zoned RTGs, as the zoned allocator can't overwrite blocks before resetting the zone, which already flushes out all transactions touching the RTGs. So the busy extent tracking is not needed for zoned RTGs, and also not called for zoned RTGs. But somehow the code to skip allocating and freeing the structure got lost during the zoned XFS upstreaming process. This not only causes these structures to unnecessarily allocated, but can also lead to memory leaks as the xg_busy_extents pointer in the xfs_group structure is overlayed with the pointer for the linked list of to be reset zones. Stop allocating and freeing the structure to not pointlessly allocate memory which is then leaked when the zone is reset. Fixes: 080d01c41d44 ("xfs: implement zoned garbage collection") Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: <stable@vger.kernel.org> # v6.15 [cem: Fix type and add stable tag] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
12 daysefivarfs: Fix memory leak of efivarfs_fs_info in fs_context error pathsBreno Leitao
When processing mount options, efivarfs allocates efivarfs_fs_info (sfi) early in fs_context initialization. However, sfi is associated with the superblock and typically freed when the superblock is destroyed. If the fs_context is released (final put) before fill_super is called—such as on error paths or during reconfiguration—the sfi structure would leak, as ownership never transfers to the superblock. Implement the .free callback in efivarfs_context_ops to ensure any allocated sfi is properly freed if the fs_context is torn down before fill_super, preventing this memory leak. Suggested-by: James Bottomley <James.Bottomley@HansenPartnership.com> Fixes: 5329aa5101f73c ("efivarfs: Add uid/gid mount options") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
13 daysbcachefs: Fix bch2_maybe_casefold() when CONFIG_UTF8=nKent Overstreet
maybe_casefold() shouldn't have been nooped, just bch2_casefold(). Fixes: 94426e4201fb ("bcachefs: opts.casefold_disabled") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Fix build when CONFIG_UNICODE=nKent Overstreet
94426e4201fb, which added the killswitch for casefolding, accidentally removed some of the ifdefs we need to avoid build errors. It appears we need better build testing for different configurations, it took two weeks for the robots to catch this one. Fixes: 94426e4201fb ("bcachefs: opts.casefold_disabled") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Fix reference to invalid bucket in copygcKent Overstreet
Use bch2_dev_bucket_tryget() instead of bch2_dev_tryget() before checking the bucket bitmap. Reported-by: syzbot+3168625f36f4a539237e@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Don't build aux search tree when still repairing nodeKent Overstreet
bch2_btree_node_drop_keys_outside_node() will (re)build aux search trees, because it's also called by topology repair. bch2_btree_node_read_done() was calling it before validating individual keys; invalid ones have to be dropped. If we call drop_keys_outside_node() first, then bch2_bset_build_aux_tree() doesn't run because the node already has an aux search tree - which was invalidated by the repair. Reported-by: syzbot+c5e7a66b3b23ae65d44f@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Tweak threshold for allocator triggering discardsKent Overstreet
The allocator path has a "if we're really low on free buckets, check if we should issue discards" - tweak this to also trigger discards if more than 1/128th of the device is in need_discard state. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
13 daysbcachefs: Fix triggering of discard by the journal pathKent Overstreet
It becomes possible to do discards after a journal flush, which naturally the journal code is reponsible for. A prior refactoring seems to have broken this - which went unnoticed because the foreground allocator path can also trigger discards. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-14netfs: Fix race between cache write completion and ALL_QUEUED being setDavid Howells
When netfslib is issuing subrequests, the subrequests start processing immediately and may complete before we reach the end of the issuing function. At the end of the issuing function we set NETFS_RREQ_ALL_QUEUED to indicate to the collector that we aren't going to issue any more subreqs and that it can do the final notifications and cleanup. Now, this isn't a problem if the request is synchronous (NETFS_RREQ_OFFLOAD_COLLECTION is unset) as the result collection will be done in-thread and we're guaranteed an opportunity to run the collector. However, if the request is asynchronous, collection is primarily triggered by the termination of subrequests queuing it on a workqueue. Now, a race can occur here if the app thread sets ALL_QUEUED after the last subrequest terminates. This can happen most easily with the copy2cache code (as used by Ceph) where, in the collection routine of a read request, an asynchronous write request is spawned to copy data to the cache. Folios are added to the write request as they're unlocked, but there may be a delay before ALL_QUEUED is set as the write subrequests may complete before we get there. If all the write subreqs have finished by the ALL_QUEUED point, no further events happen and the collection never happens, leaving the request hanging. Fix this by queuing the collector after setting ALL_QUEUED. This is a bit heavy-handed and it may be sufficient to do it only if there are no extant subreqs. Also add a tracepoint to cross-reference both requests in a copy-to-request operation and add a trace to the netfs_rreq tracepoint to indicate the setting of ALL_QUEUED. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250711151005.2956810-3-dhowells@redhat.com Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: Paulo Alcantara <pc@manguebit.org> cc: Viacheslav Dubeyko <slava@dubeyko.com> cc: Alex Markuze <amarkuze@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14netfs: Fix copy-to-cache so that it performs collection with ceph+fscacheDavid Howells
The netfs copy-to-cache that is used by Ceph with local caching sets up a new request to write data just read to the cache. The request is started and then left to look after itself whilst the app continues. The request gets notified by the backing fs upon completion of the async DIO write, but then tries to wake up the app because NETFS_RREQ_OFFLOAD_COLLECTION isn't set - but the app isn't waiting there, and so the request just hangs. Fix this by setting NETFS_RREQ_OFFLOAD_COLLECTION which causes the notification from the backing filesystem to put the collection onto a work queue instead. Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Reported-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/CAKPOu+8z_ijTLHdiCYGU_Uk7yYD=shxyGLwfe-L7AV3DhebS3w@mail.gmail.com/ Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/20250711151005.2956810-2-dhowells@redhat.com Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> cc: Paulo Alcantara <pc@manguebit.org> cc: Viacheslav Dubeyko <slava@dubeyko.com> cc: Alex Markuze <amarkuze@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: netfs@lists.linux.dev cc: ceph-devel@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-14fix a leak in fcntl_dirnotify()Al Viro
[into #fixes, unless somebody objects] Lifetime of new_dn_mark is controlled by that of its ->fsn_mark, pointed to by new_fsn_mark. Unfortunately, a failure exit had been inserted between the allocation of new_dn_mark and the call of fsnotify_init_mark(), ending up with a leak. Fixes: 1934b212615d "file: reclaim 24 bytes from f_owner" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/20250712171843.GB1880847@ZenIV Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-13smb: invalidate and close cached directory when creating child entriesBharath SM
When a parent lease key is passed to the server during a create operation while holding a directory lease, the server may not send a lease break to the client. In such cases, it becomes the client’s responsibility to ensure cache consistency. This led to a problem where directory listings (e.g., `ls` or `readdir`) could return stale results after a new file is created. eg: ls /mnt/share/ touch /mnt/share/file1 ls /mnt/share/ In this scenario, the final `ls` may not show `file1` due to the stale directory cache. For now, fix this by marking the cached directory as invalid if using the parent lease key during create, and explicitly closing the cached directory after successful file creation. Fixes: 037e1bae588eacf ("smb: client: use ParentLeaseKey in cifs_do_create") Signed-off-by: Bharath SM <bharathsm@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-13smb: client: fix use-after-free in crypt_message when using async cryptoWang Zhaolong
The CVE-2024-50047 fix removed asynchronous crypto handling from crypt_message(), assuming all crypto operations are synchronous. However, when hardware crypto accelerators are used, this can cause use-after-free crashes: crypt_message() // Allocate the creq buffer containing the req creq = smb2_get_aead_req(..., &req); // Async encryption returns -EINPROGRESS immediately rc = enc ? crypto_aead_encrypt(req) : crypto_aead_decrypt(req); // Free creq while async operation is still in progress kvfree_sensitive(creq, ...); Hardware crypto modules often implement async AEAD operations for performance. When crypto_aead_encrypt/decrypt() returns -EINPROGRESS, the operation completes asynchronously. Without crypto_wait_req(), the function immediately frees the request buffer, leading to crashes when the driver later accesses the freed memory. This results in a use-after-free condition when the hardware crypto driver later accesses the freed request structure, leading to kernel crashes with NULL pointer dereferences. The issue occurs because crypto_alloc_aead() with mask=0 doesn't guarantee synchronous operation. Even without CRYPTO_ALG_ASYNC in the mask, async implementations can be selected. Fix by restoring the async crypto handling: - DECLARE_CRYPTO_WAIT(wait) for completion tracking - aead_request_set_callback() for async completion notification - crypto_wait_req() to wait for operation completion This ensures the request buffer isn't freed until the crypto operation completes, whether synchronous or asynchronous, while preserving the CVE-2024-50047 fix. Fixes: b0abcd65ec54 ("smb: client: fix UAF in async decryption") Link: https://lore.kernel.org/all/8b784a13-87b0-4131-9ff9-7a8993538749@huaweicloud.com/ Cc: stable@vger.kernel.org Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-13smb: client: fix use-after-free in cifs_oplock_breakWang Zhaolong
A race condition can occur in cifs_oplock_break() leading to a use-after-free of the cinode structure when unmounting: cifs_oplock_break() _cifsFileInfo_put(cfile) cifsFileInfo_put_final() cifs_sb_deactive() [last ref, start releasing sb] kill_sb() kill_anon_super() generic_shutdown_super() evict_inodes() dispose_list() evict() destroy_inode() call_rcu(&inode->i_rcu, i_callback) spin_lock(&cinode->open_file_lock) <- OK [later] i_callback() cifs_free_inode() kmem_cache_free(cinode) spin_unlock(&cinode->open_file_lock) <- UAF cifs_done_oplock_break(cinode) <- UAF The issue occurs when umount has already released its reference to the superblock. When _cifsFileInfo_put() calls cifs_sb_deactive(), this releases the last reference, triggering the immediate cleanup of all inodes under RCU. However, cifs_oplock_break() continues to access the cinode after this point, resulting in use-after-free. Fix this by holding an extra reference to the superblock during the entire oplock break operation. This ensures that the superblock and its inodes remain valid until the oplock break completes. Link: https://bugzilla.kernel.org/show_bug.cgi?id=220309 Fixes: b98749cac4a6 ("CIFS: keep FileInfo handle live during oplock break") Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-13bcachefs: io_read: remove from async obj list in rbio_done()Kent Overstreet
Previously, only split rbios allocated in io_read.c would be removed from the async obj list. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-12Merge tag 'mm-hotfixes-stable-2025-07-11-16-16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "19 hotfixes. A whopping 16 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. 14 are for MM. Three gdb-script fixes and a kallsyms build fix" * tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: Revert "sched/numa: add statistics of numa balance task" mm: fix the inaccurate memory statistics issue for users mm/damon: fix divide by zero in damon_get_intervals_score() samples/damon: fix damon sample mtier for start failure samples/damon: fix damon sample wsse for start failure samples/damon: fix damon sample prcl for start failure kasan: remove kasan_find_vm_area() to prevent possible deadlock scripts: gdb: vfs: support external dentry names mm/migrate: fix do_pages_stat in compat mode mm/damon/core: handle damon_call_control as normal under kdmond deactivation mm/rmap: fix potential out-of-bounds page table access during batched unmap mm/hugetlb: don't crash when allocating a folio if there are no resv scripts/gdb: de-reference per-CPU MCE interrupts scripts/gdb: fix interrupts.py after maple tree conversion maple_tree: fix mt_destroy_walk() on root leaf node mm/vmalloc: leave lazy MMU mode on PTE mapping error scripts/gdb: fix interrupts display after MCP on x86 lib/alloc_tag: do not acquire non-existent lock in alloc_tag_top_users() kallsyms: fix build without execinfo
2025-07-12Merge tag 'erofs-for-6.16-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "Fix for a cache aliasing issue by adding missing flush_dcache_folio(), which causes execution failures on some arm32 setups. Fix for large compressed fragments, which could be generated by -Eall-fragments option (but should be rare) and was rejected by mistake due to an on-disk hardening commit. The remaining ones are small fixes. Summary: - Address cache aliasing for mappable page cache folios - Allow readdir() to be interrupted - Fix large fragment handling which was errored out by mistake - Add missing tracepoints - Use memcpy_to_folio() to replace copy_to_iter() for inline data" * tag 'erofs-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix large fragment handling erofs: allow readdir() to be interrupted erofs: address D-cache aliasing erofs: use memcpy_to_folio() to replace copy_to_iter() erofs: fix to add missing tracepoint in erofs_read_folio() erofs: fix to add missing tracepoint in erofs_readahead()
2025-07-12Merge tag 'bcachefs-2025-07-11' of git://evilpiepirate.org/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet. * tag 'bcachefs-2025-07-11' of git://evilpiepirate.org/bcachefs: bcachefs: Don't set BCH_FS_error on transaction restart bcachefs: Fix additional misalignment in journal space calculations bcachefs: Don't schedule non persistent passes persistently bcachefs: Fix bch2_btree_transactions_read() synchronization bcachefs: btree read retry fixes bcachefs: btree node scan no longer uses btree cache bcachefs: Tweak btree cache helpers for use by btree node scan bcachefs: Fix btree for nonexistent tree depth bcachefs: Fix bch2_io_failures_to_text() bcachefs: bch2_fpunch_snapshot()
2025-07-12Merge tag 'v6.16-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull smb server fixes from Steve French: - fix use after free in lease break - small fix for freeing rdma transport (fixes missing logging of cm_qp_destroy) - fix write count leak * tag 'v6.16-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: fix potential use-after-free in oplock/lease break ack ksmbd: fix a mount write count leak in ksmbd_vfs_kern_path_locked() smb: server: make use of rdma_destroy_qp()
2025-07-11Revert "eventpoll: Fix priority inversion problem"Linus Torvalds
This reverts commit 8c44dac8add7503c345c0f6c7962e4863b88ba42. I haven't figured out what the actual bug in this commit is, but I did spend a lot of time chasing it down and eventually succeeded in bisecting it down to this. For some reason, this eventpoll commit ends up causing delays and stuck user space processes, but it only happens on one of my machines, and only during early boot or during the flurry of initial activity when logging in. I must be triggering some very subtle timing issue, but once I figured out the behavior pattern that made it reasonably reliable to trigger, it did bisect right to this, and reverting the commit fixes the problem. Of course, that was only after I had failed at bisecting it several times, and had flailed around blaming both the drm people and the netlink people for the odd problems. The most obvious of which happened at the time of the first graphical login (the most common symptom being that some gnome app aborted due to a 30s timeout, often leading to the whole session then failing if it was some critical component like gnome-shell or similar). Acked-by: Nam Cao <namcao@linutronix.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-07-12erofs: fix large fragment handlingGao Xiang
Fragments aren't limited by Z_EROFS_PCLUSTER_MAX_DSIZE. However, if a fragment's logical length is larger than Z_EROFS_PCLUSTER_MAX_DSIZE but the fragment is not the whole inode, it currently returns -EOPNOTSUPP because m_flags has the wrong EROFS_MAP_ENCODED flag set. It is not intended by design but should be rare, as it can only be reproduced by mkfs with `-Eall-fragments` in a specific case. Let's normalize fragment m_flags using the new EROFS_MAP_FRAGMENT. Reported-by: Axel Fontaine <axel@axelfontaine.com> Closes: https://github.com/erofs/erofs-utils/issues/23 Fixes: 7c3ca1838a78 ("erofs: restrict pcluster size limitations") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250711195826.3601157-1-hsiangkao@linux.alibaba.com
2025-07-11iomap: avoid unnecessary ifs_set_range_uptodate() with locksJinliang Zheng
In the buffer write path, iomap_set_range_uptodate() is called every time iomap_end_write() is called. But if folio_test_uptodate() holds, we know that all blocks in this folio are already in the uptodate state, so there is no need to go deep into the critical section of state_lock to execute bitmap_set(). This is because the folios always creep towards ifs_is_fully_uptodate() state and once they've gotten there folio_mark_uptodate() is called, which means the folio is uptodate. Then once a folio is uptodate, there is no route back to !uptodate without going through the removal of the folio from the page cache. Therefore, it's fine to use folio_test_uptodate() to short-circuit unnecessary code paths. Although state_lock may not have significant lock contention due to folio lock, this patch at least reduces the number of instructions, especially the expensive lock-prefixed instructions. Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Link: https://lore.kernel.org/20250711081207.1782667-1-alexjlzheng@tencent.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-11isofs: Verify inode mode when loading from diskJan Kara
Verify that the inode mode is sane when loading it from the disk to avoid complaints from VFS about setting up invalid inodes. Reported-by: syzbot+895c23f6917da440ed0d@syzkaller.appspotmail.com CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/20250709095545.31062-2-jack@suse.cz Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-10erofs: allow readdir() to be interruptedChao Yu
In a quick slow device, readdir() may loop for long time in large directory, let's give a chance to allow it to be interrupted by userspace. Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250710073619.4083422-1-chao@kernel.org [ Gao Xiang: move cond_resched() to the end of the while loop. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-07-10erofs: address D-cache aliasingGao Xiang
Flush the D-cache before unlocking folios for compressed inodes, as they are dirtied during decompression. Avoid calling flush_dcache_folio() on every CPU write, since it's more like playing whack-a-mole without real benefit. It has no impact on x86 and arm64/risc-v: on x86, flush_dcache_folio() is a no-op, and on arm64/risc-v, PG_dcache_clean (PG_arch_1) is clear for new page cache folios. However, certain ARM boards are affected, as reported. Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support") Closes: https://lore.kernel.org/r/c1e51e16-6cc6-49d0-a63e-4e9ff6c4dd53@pengutronix.de Closes: https://lore.kernel.org/r/38d43fae-1182-4155-9c5b-ffc7382d9917@siemens.com Tested-by: Jan Kiszka <jan.kiszka@siemens.com> Tested-by: Stefan Kerkmann <s.kerkmann@pengutronix.de> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250709034614.2780117-2-hsiangkao@linux.alibaba.com
2025-07-10erofs: use memcpy_to_folio() to replace copy_to_iter()Gao Xiang
Using copy_to_iter() here is overkill and even messy. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250709034614.2780117-1-hsiangkao@linux.alibaba.com
2025-07-10erofs: fix to add missing tracepoint in erofs_read_folio()Chao Yu
Commit 771c994ea51f ("erofs: convert all uncompressed cases to iomap") converts to use iomap interface, it removed trace_erofs_readpage() tracepoint in the meantime, let's add it back. Fixes: 771c994ea51f ("erofs: convert all uncompressed cases to iomap") Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250708111942.3120926-1-chao@kernel.org Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-07-10erofs: fix to add missing tracepoint in erofs_readahead()Chao Yu
Commit 771c994ea51f ("erofs: convert all uncompressed cases to iomap") converts to use iomap interface, it removed trace_erofs_readahead() tracepoint in the meantime, let's add it back. Fixes: 771c994ea51f ("erofs: convert all uncompressed cases to iomap") Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250707084832.2725677-1-chao@kernel.org Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2025-07-10cachefiles: Fix the incorrect return value in __cachefiles_write()Zizhi Wo
In __cachefiles_write(), if the return value of the write operation > 0, it is set to 0. This makes it impossible to distinguish scenarios where a partial write has occurred, and will affect the outer calling functions: 1) cachefiles_write_complete() will call "term_func" such as netfs_write_subrequest_terminated(). When "ret" in __cachefiles_write() is used as the "transferred_or_error" of this function, it can not distinguish the amount of data written, makes the WARN meaningless. 2) cachefiles_ondemand_fd_write_iter() can only assume all writes were successful by default when "ret" is 0, and unconditionally return the full length specified by user space. Fix it by modifying "ret" to reflect the actual number of bytes written. Furthermore, returning a value greater than 0 from __cachefiles_write() does not affect other call paths, such as cachefiles_issue_write() and fscache_write(). Fixes: 047487c947e8 ("cachefiles: Implement the I/O routines") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Link: https://lore.kernel.org/20250703024418.2809353-1-wozizhi@huaweicloud.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-09mm: fix the inaccurate memory statistics issue for usersBaolin Wang
On some large machines with a high number of CPUs running a 64K pagesize kernel, we found that the 'RES' field is always 0 displayed by the top command for some processes, which will cause a lot of confusion for users. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd The main reason is that the batch size of the percpu counter is quite large on these machines, caching a significant percpu value, since converting mm's rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter"). Intuitively, the batch number should be optimized, but on some paths, performance may take precedence over statistical accuracy. Therefore, introducing a new interface to add the percpu statistical count and display it to users, which can remove the confusion. In addition, this change is not expected to be on a performance-critical path, so the modification should be acceptable. In addition, the 'mm->rss_stat' is updated by using add_mm_counter() and dec/inc_mm_counter(), which are all wrappers around percpu_counter_add_batch(). In percpu_counter_add_batch(), there is percpu batch caching to avoid 'fbc->lock' contention. This patch changes task_mem() and task_statm() to get the accurate mm counters under the 'fbc->lock', but this should not exacerbate kernel 'mm->rss_stat' lock contention due to the percpu batch caching of the mm counters. The following test also confirm the theoretical analysis. I run the stress-ng that stresses anon page faults in 32 threads on my 32 cores machine, while simultaneously running a script that starts 32 threads to busy-loop pread each stress-ng thread's /proc/pid/status interface. From the following data, I did not observe any obvious impact of this patch on the stress-ng tests. w/o patch: stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles 67.327 B/sec stress-ng: info: [6848] 1,616,524,844,832 Instructions 24.740 B/sec (0.367 instr. per cycle) stress-ng: info: [6848] 39,529,792 Page Faults Total 0.605 M/sec stress-ng: info: [6848] 39,529,792 Page Faults Minor 0.605 M/sec w/patch: stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles 68.382 B/sec stress-ng: info: [2485] 1,615,101,503,296 Instructions 24.750 B/sec (0.362 instr. per cycle) stress-ng: info: [2485] 39,439,232 Page Faults Total 0.604 M/sec stress-ng: info: [2485] 39,439,232 Page Faults Minor 0.604 M/sec On comparing a very simple app which just allocates & touches some memory against v6.1 (which doesn't have f1a7941243c1) and latest Linus tree (4c06e63b9203) I can see that on latest Linus tree the values for VmRSS, RssAnon and RssFile from /proc/self/status are all zeroes while they do report values on v6.1 and a Linus tree with this patch. Link: https://lkml.kernel.org/r/f4586b17f66f97c174f7fd1f8647374fdb53de1c.1749119050.git.baolin.wang@linux.alibaba.com Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by Donet Tom <donettom@linux.ibm.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: SeongJae Park <sj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09eventpoll: don't decrement ep refcount while still holding the ep mutexLinus Torvalds
Jann Horn points out that epoll is decrementing the ep refcount and then doing a mutex_unlock(&ep->mtx); afterwards. That's very wrong, because it can lead to a use-after-free. That pattern is actually fine for the very last reference, because the code in question will delay the actual call to "ep_free(ep)" until after it has unlocked the mutex. But it's wrong for the much subtler "next to last" case when somebody *else* may also be dropping their reference and free the ep while we're still using the mutex. Note that this is true even if that other user is also using the same ep mutex: mutexes, unlike spinlocks, can not be used for object ownership, even if they guarantee mutual exclusion. A mutex "unlock" operation is not atomic, and as one user is still accessing the mutex as part of unlocking it, another user can come in and get the now released mutex and free the data structure while the first user is still cleaning up. See our mutex documentation in Documentation/locking/mutex-design.rst, in particular the section [1] about semantics: "mutex_unlock() may access the mutex structure even after it has internally released the lock already - so it's not safe for another context to acquire the mutex and assume that the mutex_unlock() context is not using the structure anymore" So if we drop our ep ref before the mutex unlock, but we weren't the last one, we may then unlock the mutex, another user comes in, drops _their_ reference and releases the 'ep' as it now has no users - all while the mutex_unlock() is still accessing it. Fix this by simply moving the ep refcount dropping to outside the mutex: the refcount itself is atomic, and doesn't need mutex protection (that's the whole _point_ of refcounts: unlike mutexes, they are inherently about object lifetimes). Reported-by: Jann Horn <jannh@google.com> Link: https://docs.kernel.org/locking/mutex-design.html#semantics [1] Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-07-08bcachefs: Don't set BCH_FS_error on transaction restartKent Overstreet
This started showing up more when we started logging the error being corrected in the journal - but __bch2_fsck_err() could return transaction restarts before that. Setting BCH_FS_error incorrectly causes recovery passes to not be cleared, among other issues. Fixes: b43f72492768 ("bcachefs: Log fsck errors in the journal") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-07-08ksmbd: fix potential use-after-free in oplock/lease break ackNamjae Jeon
If ksmbd_iov_pin_rsp return error, use-after-free can happen by accessing opinfo->state and opinfo_put and ksmbd_fd_put could called twice. Reported-by: Ziyan Xu <research@securitygossip.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-08ksmbd: fix a mount write count leak in ksmbd_vfs_kern_path_locked()Al Viro
If the call of ksmbd_vfs_lock_parent() fails, we drop the parent_path references and return an error. We need to drop the write access we just got on parent_path->mnt before we drop the mount reference - callers assume that ksmbd_vfs_kern_path_locked() returns with mount write access grabbed if and only if it has returned 0. Fixes: 864fb5d37163 ("ksmbd: fix possible deadlock in smb2_open") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-08smb: server: make use of rdma_destroy_qp()Stefan Metzmacher
The qp is created by rdma_create_qp() as t->cm_id->qp and t->qp is just a shortcut. rdma_destroy_qp() also calls ib_destroy_qp(cm_id->qp) internally, but it is protected by a mutex, clears the cm_id and also calls trace_cm_qp_destroy(). This should make the tracing more useful as both rdma_create_qp() and rdma_destroy_qp() are traces and it makes the code look more sane as functions from the same layer are used for the specific qp object. trace-cmd stream -e rdma_cma:cm_qp_create -e rdma_cma:cm_qp_destroy shows this now while doing a mount and unmount from a client: <...>-80 [002] 378.514182: cm_qp_create: cm.id=1 src=172.31.9.167:5445 dst=172.31.9.166:37113 tos=0 pd.id=0 qp_type=RC send_wr=867 recv_wr=255 qp_num=1 rc=0 <...>-6283 [001] 381.686172: cm_qp_destroy: cm.id=1 src=172.31.9.167:5445 dst=172.31.9.166:37113 tos=0 qp_num=1 Before we only saw the first line. Cc: Namjae Jeon <linkinjeon@kernel.org> Cc: Steve French <stfrench@microsoft.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Hyunchul Lee <hyc.lee@gmail.com> Cc: Tom Talpey <tom@talpey.com> Cc: linux-cifs@vger.kernel.org Fixes: 0626e6641f6b ("cifsd: add server handler for central processing and tranport layers") Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Tom Talpey <tom@talpey.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-07-08xfs: remove the bt_bdev_file buftarg fieldChristoph Hellwig
And use bt_file for both bdev and shmem backed buftargs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: rename the bt_bdev_* buftarg fieldsChristoph Hellwig
The extra bdev_ is weird, so drop it. Also improve the comment to make it clear these are the hardware limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: refactor xfs_calc_atomic_write_unit_maxChristoph Hellwig
This function and the helpers used by it duplicate the same logic for AGs and RTGs. Use the xfs_group_type enum to unify both variants. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: add a xfs_group_type_buftarg helperChristoph Hellwig
Generalize the xfs_group_type helper in the discard code to return a buftarg and move it to xfs_mount.h, and use the result in xfs_dax_notify_dev_failure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: remove the call to sync_blockdev in xfs_configure_buftargChristoph Hellwig
This extra call is not needed as xfs_alloc_buftarg already calls sync_blockdev. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: clean up the initial read logic in xfs_readsbChristoph Hellwig
The initial sb read is always for a device logical block size buffer. The device logical block size is provided in the bt_logical_sectorsize in struct buftarg, so use that instead of the confusingly named xfs_getsize_buftarg buffer that reads it from the bdev. Update the comments surrounding the code to better describe what is going on. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
2025-07-08xfs: replace strncpy with memcpy in xattr listingPranav Tyagi
Use memcpy() in place of strncpy() in __xfs_xattr_put_listent(). The length is known and a null byte is added manually. No functional change intended. Signed-off-by: Pranav Tyagi <pranav.tyagi03@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>