summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2025-05-08treewide, timers: Rename destroy_timer_on_stack() as timer_destroy_on_stack()Ingo Molnar
Move this API to the canonical timer_*() namespace. Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250507175338.672442-10-mingo@kernel.org
2025-05-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc6). No conflicts. Adjacent changes: net/core/dev.c: 08e9f2d584c4 ("net: Lock netdevices during dev_shutdown") a82dc19db136 ("net: avoid potential race between netdev_get_by_index_lock() and netns switch") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-08f2fs: return bool from __write_node_folioChristoph Hellwig
__write_node_folio can only return 0 or AOP_WRITEPAGE_ACTIVATE. As part of phasing out AOP_WRITEPAGE_ACTIVATE, switch to a bool return instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-08f2fs: simplify return value handling in f2fs_fsync_node_pagesChristoph Hellwig
Always assign ret where the error happens, and jump to out instead of multiple loop exit conditions to prepare for changes in the __write_node_folio calling convention. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-08f2fs: always unlock the page in f2fs_write_single_data_pageChristoph Hellwig
Consolidate the code to unlock the page in f2fs_write_single_data_page instead of leaving it to the callers for the AOP_WRITEPAGE_ACTIVATE case. Replace AOP_WRITEPAGE_ACTIVATE with a positive return of 1 as this case now doesn't match the historic ->writepage special return code that is on it's way out now that ->writepage has been removed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-08f2fs: remove wbc->for_reclaim handlingChristoph Hellwig
Since commits 7ff0104a8052 ("f2fs: Remove f2fs_write_node_page()") and 3b47398d9861 ("f2fs: Remove f2fs_write_meta_page()'), f2fs can't be called from reclaim context any more. Remove all code keyed of the wbc->for_reclaim flag, which is now only set for writing out swap or shmem pages inside the swap code, but never passed to file systems. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-08Merge tag 'v6.15-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull smb server fixes from Steve French: - Fix UAF closing file table (e.g. in tree disconnect) - Fix potential out of bounds write - Fix potential memory leak parsing lease state in open - Fix oops in rename with empty target * tag 'v6.15-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: Fix UAF in __close_file_table_ids ksmbd: prevent out-of-bounds stream writes by validating *pos ksmbd: fix memory leak in parse_lease_state() ksmbd: prevent rename with empty string
2025-05-08f2fs: return bool from __f2fs_write_meta_folioChristoph Hellwig
__f2fs_write_meta_folio can only return 0 or AOP_WRITEPAGE_ACTIVATE. As part of phasing out AOP_WRITEPAGE_ACTIVATE, switch to a bool return instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-08f2fs: fix to return correct error number in f2fs_sync_node_pages()Chao Yu
If __write_node_folio() failed, it will return AOP_WRITEPAGE_ACTIVATE, the incorrect return value may be passed to userspace in below path, fix it. - sync_filesystem - sync_fs - f2fs_issue_checkpoint - block_operations - f2fs_sync_node_pages - __write_node_folio : return AOP_WRITEPAGE_ACTIVATE Cc: stable@vger.kernel.org Reported-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2025-05-07nilfs2: fix deadlock warnings caused by lock dependency in init_nilfs()Ryusuke Konishi
After commit c0e473a0d226 ("block: fix race between set_blocksize and read paths") was merged, set_blocksize() called by sb_set_blocksize() now locks the inode of the backing device file. As a result of this change, syzbot started reporting deadlock warnings due to a circular dependency involving the semaphore "ns_sem" of the nilfs object, the inode lock of the backing device file, and the locks that this inode lock is transitively dependent on. This is caused by a new lock dependency added by the above change, since init_nilfs() calls sb_set_blocksize() in the lock section of "ns_sem". However, these warnings are false positives because init_nilfs() is called in the early stage of the mount operation and the filesystem has not yet started. The reason why "ns_sem" is locked in init_nilfs() was to avoid a race condition in nilfs_fill_super() caused by sharing a nilfs object among multiple filesystem instances (super block structures) in the early implementation. However, nilfs objects and super block structures have long ago become one-to-one, and there is no longer any need to use the semaphore there. So, fix this issue by removing the use of the semaphore "ns_sem" in init_nilfs(). Link: https://lkml.kernel.org/r/20250503053327.12294-1-konishi.ryusuke@gmail.com Fixes: c0e473a0d226 ("block: fix race between set_blocksize and read paths") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+00f7f5b884b117ee6773@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=00f7f5b884b117ee6773 Tested-by: syzbot+00f7f5b884b117ee6773@syzkaller.appspotmail.com Reported-by: syzbot+f30591e72bfc24d4715b@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f30591e72bfc24d4715b Tested-by: syzbot+f30591e72bfc24d4715b@syzkaller.appspotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07ocfs2: stop quota recovery before disabling quotasJan Kara
Currently quota recovery is synchronized with unmount using sb->s_umount semaphore. That is however prone to deadlocks because flush_workqueue(osb->ocfs2_wq) called from umount code can wait for quota recovery to complete while ocfs2_finish_quota_recovery() waits for sb->s_umount semaphore. Grabbing of sb->s_umount semaphore in ocfs2_finish_quota_recovery() is only needed to protect that function from disabling of quotas from ocfs2_dismount_volume(). Handle this problem by disabling quota recovery early during unmount in ocfs2_dismount_volume() instead so that we can drop acquisition of sb->s_umount from ocfs2_finish_quota_recovery(). Link: https://lkml.kernel.org/r/20250424134515.18933-6-jack@suse.cz Fixes: 5f530de63cfc ("ocfs2: Use s_umount for quota recovery protection") Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Shichangkuo <shi.changkuo@h3c.com> Reported-by: Murad Masimov <m.masimov@mt-integration.ru> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Tested-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07ocfs2: implement handshaking with ocfs2 recovery threadJan Kara
We will need ocfs2 recovery thread to acknowledge transitions of recovery_state when disabling particular types of recovery. This is similar to what currently happens when disabling recovery completely, just more general. Implement the handshake and use it for exit from recovery. Link: https://lkml.kernel.org/r/20250424134515.18933-5-jack@suse.cz Fixes: 5f530de63cfc ("ocfs2: Use s_umount for quota recovery protection") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Tested-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Murad Masimov <m.masimov@mt-integration.ru> Cc: Shichangkuo <shi.changkuo@h3c.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07ocfs2: switch osb->disable_recovery to enumJan Kara
Patch series "ocfs2: Fix deadlocks in quota recovery", v3. This implements another approach to fixing quota recovery deadlocks. We avoid grabbing sb->s_umount semaphore from ocfs2_finish_quota_recovery() and instead stop quota recovery early in ocfs2_dismount_volume(). This patch (of 3): We will need more recovery states than just pure enable / disable to fix deadlocks with quota recovery. Switch osb->disable_recovery to enum. Link: https://lkml.kernel.org/r/20250424134301.1392-1-jack@suse.cz Link: https://lkml.kernel.org/r/20250424134515.18933-4-jack@suse.cz Fixes: 5f530de63cfc ("ocfs2: Use s_umount for quota recovery protection") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Tested-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Murad Masimov <m.masimov@mt-integration.ru> Cc: Shichangkuo <shi.changkuo@h3c.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07mm/userfaultfd: fix uninitialized output field for -EAGAIN racePeter Xu
While discussing some userfaultfd relevant issues recently, Andrea noticed a potential ABI breakage with -EAGAIN on almost all userfaultfd ioctl()s. Quote from Andrea, explaining how -EAGAIN was processed, and how this should fix it (taking example of UFFDIO_COPY ioctl): The "mmap_changing" and "stale pmd" conditions are already reported as -EAGAIN written in the copy field, this does not change it. This change removes the subnormal case that left copy.copy uninitialized and required apps to explicitly set the copy field to get deterministic behavior (which is a requirement contrary to the documentation in both the manpage and source code). In turn there's no alteration to backwards compatibility as result of this change because userland will find the copy field consistently set to -EAGAIN, and not anymore sometime -EAGAIN and sometime uninitialized. Even then the change only can make a difference to non cooperative users of userfaultfd, so when UFFD_FEATURE_EVENT_* is enabled, which is not true for the vast majority of apps using userfaultfd or this unintended uninitialized field may have been noticed sooner. Meanwhile, since this bug existed for years, it also almost affects all ioctl()s that was introduced later. Besides UFFDIO_ZEROPAGE, these also get affected in the same way: - UFFDIO_CONTINUE - UFFDIO_POISON - UFFDIO_MOVE This patch should have fixed all of them. Link: https://lkml.kernel.org/r/20250424215729.194656-2-peterx@redhat.com Fixes: df2cc96e7701 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races") Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") Fixes: fc71884a5f59 ("mm: userfaultfd: add new UFFDIO_POISON ioctl") Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07ocfs2: fix panic in failed foilio allocationMark Tinguely
commit 7e119cff9d0a ("ocfs2: convert w_pages to w_folios") and commit 9a5e08652dc4b ("ocfs2: use an array of folios instead of an array of pages") save -ENOMEM in the folio array upon allocation failure and call the folio array free code. The folio array free code expects either valid folio pointers or NULL. Finding the -ENOMEM will result in a panic. Fix by NULLing the error folio entry. Link: https://lkml.kernel.org/r/c879a52b-835c-4fa0-902b-8b2e9196dcbd@oracle.com Fixes: 7e119cff9d0a ("ocfs2: convert w_pages to w_folios") Fixes: 9a5e08652dc4b ("ocfs2: use an array of folios instead of an array of pages") Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07ocfs2: fix the issue with discontiguous allocation in the global_bitmapHeming Zhao
commit 4eb7b93e0310 ("ocfs2: improve write IO performance when fragmentation is high") introduced another regression. The following ocfs2-test case can trigger this issue: > discontig_runner.sh => activate_discontig_bg.sh => resv_unwritten: > ${RESV_UNWRITTEN_BIN} -f ${WORK_PLACE}/large_testfile -s 0 -l \ > $((${FILE_MAJOR_SIZE_M}*1024*1024)) In my env, test disk size (by "fdisk -l <dev>"): > 53687091200 bytes, 104857600 sectors. Above command is: > /usr/local/ocfs2-test/bin/resv_unwritten -f \ > /mnt/ocfs2/ocfs2-activate-discontig-bg-dir/large_testfile -s 0 -l \ > 53187969024 Error log: > [*] Reserve 50724M space for a LARGE file, reserve 200M space for future test. > ioctl error 28: "No space left on device" > resv allocation failed Unknown error -1 > reserve unwritten region from 0 to 53187969024. Call flow: __ocfs2_change_file_space //by ioctl OCFS2_IOC_RESVSP64 ocfs2_allocate_unwritten_extents //start:0 len:53187969024 while() + ocfs2_get_clusters //cpos:0, alloc_size:1623168 (cluster number) + ocfs2_extend_allocation + ocfs2_lock_allocators | + choose OCFS2_AC_USE_MAIN & ocfs2_cluster_group_search | + ocfs2_add_inode_data ocfs2_add_clusters_in_btree __ocfs2_claim_clusters ocfs2_claim_suballoc_bits + During the allocation of the final part of the large file (after ~47GB), no chain had the required contiguous bits_wanted. Consequently, the allocation failed. How to fix: When OCFS2 is encountering fragmented allocation, the file system should stop attempting bits_wanted contiguous allocation and instead provide the largest available contiguous free bits from the cluster groups. Link: https://lkml.kernel.org/r/20250414060125.19938-2-heming.zhao@suse.com Fixes: 4eb7b93e0310 ("ocfs2: improve write IO performance when fragmentation is high") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reported-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-07xfs: allow sysadmins to specify a maximum atomic write limit at mount timeDarrick J. Wong
Introduce a mount option to allow sysadmins to specify the maximum size of an atomic write. If the filesystem can work with the supplied value, that becomes the new guaranteed maximum. The value mustn't be too big for the existing filesystem geometry (max write size, max AG/rtgroup size). We dynamically recompute the tr_atomic_write transaction reservation based on the given block size, check that the current log size isn't less than the new minimum log size constraints, and set a new maximum. The actual software atomic write max is still computed based off of tr_atomic_ioend the same way it has for the past few commits. Note also that xfs_calc_atomic_write_log_geometry is non-static because mkfs will need that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: update atomic write limitsJohn Garry
Update the limits returned from xfs_get_atomic_write_{min, max, max_opt)(). No reflink support always means no CoW-based atomic writes. For updating xfs_get_atomic_write_min(), we support blocksize only and that depends on HW or reflink support. For updating xfs_get_atomic_write_max(), for no reflink, we are limited to blocksize but only if HW support. Otherwise we are limited to combined limit in mp->m_atomic_write_unit_max. For updating xfs_get_atomic_write_max_opt(), ultimately we are limited by the bdev atomic write limit. If xfs_get_atomic_write_max() does not report > 1x blocksize, then just continue to report 0 as before. Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: update comments in the helper functions] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: add xfs_calc_atomic_write_unit_max()John Garry
Now that CoW-based atomic writes are supported, update the max size of an atomic write for the data device. The limit of a CoW-based atomic write will be the limit of the number of logitems which can fit into a single transaction. In addition, the max atomic write size needs to be aligned to the agsize. Limit the size of atomic writes to the greatest power-of-two factor of the agsize so that allocations for an atomic write will always be aligned compatibly with the alignment requirements of the storage. Function xfs_atomic_write_logitems() is added to find the limit the number of log items which can fit in a single transaction. Amend the max atomic write computation to create a new transaction reservation type, and compute the maximum size of an atomic write completion (in fsblocks) based on this new transaction reservation. Initially, tr_atomic_write is a clone of tr_itruncate, which provides a reasonable level of parallelism. In the next patch, we'll add a mount option so that sysadmins can configure their own limits. [djwong: use a new reservation type for atomic write ioends, refactor group limit calculations] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> [jpg: rounddown power-of-2 always] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: add xfs_file_dio_write_atomic()John Garry
Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes. Now HW offload will not be required to support atomic writes and is an optional feature. CoW-based atomic writes will be supported with out-of-places write and atomic extent remapping. Either mode of operation may be used for an atomic write, depending on the circumstances. The preferred method is HW offload as it will be faster. If HW offload is not available then we always use the CoW-based method. If HW offload is available but not possible to use, then again we use the CoW-based method. If available, HW offload would not be possible for the write length exceeding the HW offload limit, the write spanning multiple extents, unaligned disk blocks, etc. Apart from the write exceeding the HW offload limit, other conditions for HW offload usage can only be detected in the iomap handling for the write. As such, we use a fallback method to issue the write if we detect in the ->iomap_begin() handler that HW offload is not possible. Special code -ENOPROTOOPT is returned from ->iomap_begin() to inform that HW offload is not possible. In other words, atomic writes are supported on any filesystem that can perform out of place write remapping atomically (i.e. reflink) up to some fairly large size. If the conditions are right (a single correctly aligned overwrite mapping) then the filesystem will use any available hardware support to avoid the filesystem metadata updates. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: commit CoW-based atomic writes atomicallyJohn Garry
When completing a CoW-based write, each extent range mapping update is covered by a separate transaction. For a CoW-based atomic write, all mappings must be changed at once, so change to use a single transaction. Note that there is a limit on the amount of log intent items which can be fit into a single transaction, but this is being ignored for now since the count of items for a typical atomic write would be much less than is typically supported. A typical atomic write would be expected to be 64KB or less, which means only 16 possible extents unmaps, which is quite small. Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: add tr_atomic_ioend] Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: add large atomic writes checks in xfs_direct_write_iomap_begin()John Garry
For when large atomic writes (> 1x FS block) are supported, there will be various occasions when HW offload may not be possible. Such instances include: - unaligned extent mapping wrt write length - extent mappings which do not cover the full write, e.g. the write spans sparse or mixed-mapping extents - the write length is greater than HW offload can support - no hardware support at all In those cases, we need to fallback to the CoW-based atomic write mode. For this, report special code -ENOPROTOOPT to inform the caller that HW offload-based method is not possible. In addition to the occasions mentioned, if the write covers an unallocated range, we again judge that we need to rely on the CoW-based method when we would need to allocate anything more than 1x block. This is because if we allocate less blocks that is required for the write, then again HW offload-based method would not be possible. So we are taking a pessimistic approach to writes covering unallocated space. Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: various cleanups] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: add xfs_atomic_write_cow_iomap_begin()John Garry
For CoW-based atomic writes, reuse the infrastructure for reflink CoW fork support. Add ->iomap_begin() callback xfs_atomic_write_cow_iomap_begin() to create staging mappings in the CoW fork for atomic write updates. The general steps in the function are as follows: - find extent mapping in the CoW fork for the FS block range being written - if part or full extent is found, proceed to process found extent - if no extent found, map in new blocks to the CoW fork - convert unwritten blocks in extent if required - update iomap extent mapping and return The bulk of this function is quite similar to the processing in xfs_reflink_allocate_cow(), where we try to find an extent mapping; if none exists, then allocate a new extent in the CoW fork, convert unwritten blocks, and return a mapping. Performance testing has shown the XFS_ILOCK_EXCL locking to be quite a bottleneck, so this is an area which could be optimised in future. Christoph Hellwig contributed almost all of the code in xfs_atomic_write_cow_iomap_begin(). Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: add a new xfs_can_sw_atomic_write to convey intent better] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: refine atomic write size check in xfs_file_write_iter()John Garry
Currently the size of atomic write allowed is fixed at the blocksize. To start to lift this restriction, partly refactor xfs_report_atomic_write() to into helpers - xfs_get_atomic_write_{min, max}() - and use those helpers to find the per-inode atomic write limits and check according to that. Also add xfs_get_atomic_write_max_opt() to return the optimal limit, and just return 0 since large atomics aren't supported yet. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: refactor xfs_reflink_end_cow_extent()John Garry
Refactor xfs_reflink_end_cow_extent() into separate parts which process the CoW range and commit the transaction. This refactoring will be used in future for when it is required to commit a range of extents as a single transaction, similar to how it was done pre-commit d6f215f359637. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: allow block allocator to take an alignment hintJohn Garry
Add a BMAPI flag to provide a hint to the block allocator to align extents according to the extszhint. This will be useful for atomic writes to ensure that we are not being allocated extents which are not suitable (for atomic writes). Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: ignore HW which cannot atomic write a single blockDarrick J. Wong
Currently only HW which can write at least 1x block is supported. For supporting atomic writes > 1x block, a CoW-based method will also be used and this will not be resticted to using HW which can write >= 1x block. However for deciding if HW-based atomic writes can be used, we need to start adding checks for write length < HW min, which complicates the code. Indeed, a statx field similar to unit_max_opt should also be added for this minimum, which is undesirable. HW which can only write > 1x blocks would be uncommon and quite weird, so let's just not support it. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-05-07xfs: add helpers to compute transaction reservation for finishing intent itemsDarrick J. Wong
In the transaction reservation code, hoist the logic that computes the reservation needed to finish one log intent item into separate helper functions. These will be used in subsequent patches to estimate the number of blocks that an online repair can commit to reaping in the same transaction as the change committing the new data structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: add helpers to compute log item overheadDarrick J. Wong
Add selected helpers to estimate the transaction reservation required to write various log intent and buffer items to the log. These helpers will be used by the online repair code for more precise estimations of how much work can be done in a single transaction. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: separate out setting buftarg atomic writes limitsDarrick J. Wong
Separate out setting buftarg atomic writes limits into a dedicated function, xfs_configure_buftarg_atomic_writes(), to keep the specific functionality self-contained. For naming consistency, rename xfs_setsize_buftarg() -> xfs_configure_buftarg(). Signed-off-by: Darrick J. Wong <djwong@kernel.org> [jpg: separate out from patch "xfs: ignore HW which ..."] Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-05-07xfs: rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomic_write()John Garry
In future we will want to be able to check if specifically HW offload-based atomic writes are possible, so rename xfs_inode_can_atomicwrite() -> xfs_inode_can_hw_atomicwrite(). Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> [djwong: add an underscore to be consistent with everything else] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07xfs: only call xfs_setsize_buftarg once per buffer targetDarrick J. Wong
It's silly to call xfs_setsize_buftarg from xfs_alloc_buftarg with the block device LBA size because we don't need to ask the block layer to validate a geometry number that it provided us. Instead, set the preliminary bt_meta_sector* fields to the LBA size in preparation for reading the primary super. However, we still want to flush and invalidate the pagecache for all three block devices before we start reading metadata from those devices, so call sync_blockdev() per bdev in xfs_alloc_buftarg(). This will enable a subsequent patch to validate hw atomic write geometry against the filesystem geometry. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> [jpg: call sync_blockdev() from xfs_alloc_buftarg()] Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-05-07fs: add atomic write unit max opt to statxJohn Garry
XFS will be able to support large atomic writes (atomic write > 1x block) in future. This will be achieved by using different operating methods, depending on the size of the write. Specifically a new method of operation based in FS atomic extent remapping will be supported in addition to the current HW offload-based method. The FS method will generally be appreciably slower performing than the HW-offload method. However the FS method will be typically able to contribute to achieving a larger atomic write unit max limit. XFS will support a hybrid mode, where HW offload method will be used when possible, i.e. HW offload is used when the length of the write is supported, and for other times FS-based atomic writes will be used. As such, there is an atomic write length at which the user may experience appreciably slower performance. Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt. When zero, it means that there is no such performance boundary. Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is ok for older kernels which don't support this new field, as they would report 0 in this field (from zeroing in cp_statx()) already. Furthermore those older kernels don't support large atomic writes - apart from block fops, but there would be consistent performance there for atomic writes in range [unit min, unit max]. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: John Garry <john.g.garry@oracle.com>
2025-05-07bcachefs: Don't aggressively discard the journalKent Overstreet
We frequently use 'bcachefs list_journal -a' for debugging, as it provides a record of all btree transactions, and a history of what happened. But it's not so useful if we immediately discard journal buckets right after they're no longer dirty. This tweaks journal reclaim to only discard when we're low on space, keeping the journal mostly un-discarded. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07bcachefs: Ensure superblock gets written when we go EROKent Overstreet
When we go emergency read-only, make sure we do a final write_super() to persist counters and error counts - this can be critical for piecing together what fsck was doing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07bcachefs: Filter out harmless EROFS error messagesKent Overstreet
These just indicate that we're shutting down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07bcachefs: journal_shutdown is EROFS, not EIOKent Overstreet
We often filter out EROFS errors to avoid log spew after an emergency shutdown - journal_shutdown is just another emergency shutdown error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2025-05-07smb: client: Avoid race in open_cached_dir with lease breaksPaul Aurich
A pre-existing valid cfid returned from find_or_create_cached_dir might race with a lease break, meaning open_cached_dir doesn't consider it valid, and thinks it's newly-constructed. This leaks a dentry reference if the allocation occurs before the queued lease break work runs. Avoid the race by extending holding the cfid_list_lock across find_or_create_cached_dir and when the result is checked. Cc: stable@vger.kernel.org Reviewed-by: Henrique Carvalho <henrique.carvalho@suse.com> Signed-off-by: Paul Aurich <paul@darkrain42.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2025-05-07Merge tag 'erofs-for-6.15-rc6-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: - Add a new reviewer, Hongbo Li, for better community development - Fix an I/O hang out of file-backed mounts - Address a rare data corruption caused by concurrent I/Os on the same deduplicated compressed data - Minor cleanup * tag 'erofs-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: ensure the extra temporary copy is valid for shortened bvecs erofs: remove unused enum type fs/erofs/fileio: call erofs_onlinefolio_split() after bio_add_folio() MAINTAINERS: erofs: add myself as reviewer
2025-05-07fs: aio: initialize .ki_write_stream of read-write requestMing Lei
AIO needs to initialize .ki_write_stream explicitly for read/write request, otherwise random .ki_write_stream is used, and cause -EINVAL returned for aio write randomly. Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <kbusch@kernel.org> Cc: Kanchan Joshi <joshi.k@samsung.com> Fixes: c27683da6406 ("block: expose write streams for block device nodes") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250507133328.3040255-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07hfsplus: use bdev_rw_virt in hfsplus_submit_bioChristoph Hellwig
Replace the code building a bio from a kernel direct map address and submitting it synchronously with the bdev_rw_virt helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07btrfs: use bdev_rw_virt in scrub_one_superChristoph Hellwig
Replace the code building a bio from a kernel direct map address and submitting it synchronously with the bdev_rw_virt helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Link: https://lore.kernel.org/r/20250507120451.4000627-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07xfs: simplify building the bio in xlog_write_iclogChristoph Hellwig
Use the bio_add_virt_nofail and bio_add_vmalloc helpers to abstract away the details of the memory allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07xfs: simplify xfs_rw_bdevChristoph Hellwig
Delegate to bdev_rw_virt when operating on non-vmalloc memory and use bio_add_vmalloc_chunk to insulate xfs from the details of adding vmalloc memory to a bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07xfs: simplify xfs_buf_submit_bioChristoph Hellwig
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it and use bio_add_vmalloc to insulate xfs from the details of adding vmalloc memory to a bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07zonefs: use bdev_rw_virt in zonefs_read_superChristoph Hellwig
Switch zonefs_read_super to allocate the superblock buffer using kmalloc which falls back to the page allocator for PAGE_SIZE allocation but gives us a kernel virtual address and then use bdev_rw_virt to perform the synchronous read into it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07gfs2: use bdev_rw_virt in gfs2_read_superChristoph Hellwig
Switch gfs2_read_super to allocate the superblock buffer using kmalloc which falls back to the page allocator for PAGE_SIZE allocation but gives us a kernel virtual address and then use bdev_rw_virt to perform the synchronous read into it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07udf: Make sure i_lenExtents is uptodate on inode evictionJan Kara
UDF maintains total length of all extents in i_lenExtents. Generally we keep extent lengths (and thus i_lenExtents) block aligned because it makes the file appending logic simpler. However the standard mandates that the inode size must match the length of all extents and thus we trim the last extent when closing the file. To catch possible bugs we also verify that i_lenExtents matches i_size when evicting inode from memory. Commit b405c1e58b73 ("udf: refactor udf_next_aext() to handle error") however broke the code updating i_lenExtents and thus udf_evict_inode() ended up spewing lots of errors about incorrectly sized extents although the extents were actually sized properly. Fix the updating of i_lenExtents to silence the errors. Fixes: b405c1e58b73 ("udf: refactor udf_next_aext() to handle error") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>
2025-05-07erofs: ensure the extra temporary copy is valid for shortened bvecsGao Xiang
When compressed data deduplication is enabled, multiple logical extents may reference the same compressed physical cluster. The previous commit 94c43de73521 ("erofs: fix wrong primary bvec selection on deduplicated extents") already avoids using shortened bvecs. However, in such cases, the extra temporary buffers also need to be preserved for later use in z_erofs_fill_other_copies() to to prevent data corruption. IOWs, extra temporary buffers have to be retained not only due to varying start relative offsets (`pageofs_out`, as indicated by `pcl->multibases`) but also because of shortened bvecs. android.hardware.graphics.composer@2.1.so : 270696 bytes 0: 0.. 204185 | 204185 : 628019200.. 628084736 | 65536 -> 1: 204185.. 225536 | 21351 : 544063488.. 544129024 | 65536 2: 225536.. 270696 | 45160 : 0.. 0 | 0 com.android.vndk.v28.apex : 93814897 bytes ... 364: 53869896..54095257 | 225361 : 543997952.. 544063488 | 65536 -> 365: 54095257..54309344 | 214087 : 544063488.. 544129024 | 65536 366: 54309344..54514557 | 205213 : 544129024.. 544194560 | 65536 ... Both 204185 and 54095257 have the same start relative offset of 3481, but the logical page 55 of `android.hardware.graphics.composer@2.1.so` ranges from 225280 to 229632, forming a shortened bvec [225280, 225536) that cannot be used for decompressing the range from 54095257 to 54309344 of `com.android.vndk.v28.apex`. Since `pcl->multibases` is already meaningless, just mark `be->keepxcpy` on demand for simplicity. Again, this issue can only lead to data corruption if `-Ededupe` is on. Fixes: 94c43de73521 ("erofs: fix wrong primary bvec selection on deduplicated extents") Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250506101850.191506-1-hsiangkao@linux.alibaba.com
2025-05-06kill vfs_submount()Al Viro
The last remaining user of vfs_submount() (tracefs) is easy to convert to fs_context_for_submount(); do that and bury that thing, along with SB_SUBMOUNT Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>