Age | Commit message (Collapse) | Author |
|
Move this API to the canonical timer_*() namespace.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250507175338.672442-10-mingo@kernel.org
|
|
Cross-merge networking fixes after downstream PR (net-6.15-rc6).
No conflicts.
Adjacent changes:
net/core/dev.c:
08e9f2d584c4 ("net: Lock netdevices during dev_shutdown")
a82dc19db136 ("net: avoid potential race between netdev_get_by_index_lock() and netns switch")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
__write_node_folio can only return 0 or AOP_WRITEPAGE_ACTIVATE.
As part of phasing out AOP_WRITEPAGE_ACTIVATE, switch to a bool return
instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Always assign ret where the error happens, and jump to out instead
of multiple loop exit conditions to prepare for changes in the
__write_node_folio calling convention.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Consolidate the code to unlock the page in f2fs_write_single_data_page
instead of leaving it to the callers for the AOP_WRITEPAGE_ACTIVATE case.
Replace AOP_WRITEPAGE_ACTIVATE with a positive return of 1 as this case
now doesn't match the historic ->writepage special return code that is
on it's way out now that ->writepage has been removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Since commits 7ff0104a8052 ("f2fs: Remove f2fs_write_node_page()") and
3b47398d9861 ("f2fs: Remove f2fs_write_meta_page()'), f2fs can't be
called from reclaim context any more. Remove all code keyed of the
wbc->for_reclaim flag, which is now only set for writing out swap or
shmem pages inside the swap code, but never passed to file systems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Pull smb server fixes from Steve French:
- Fix UAF closing file table (e.g. in tree disconnect)
- Fix potential out of bounds write
- Fix potential memory leak parsing lease state in open
- Fix oops in rename with empty target
* tag 'v6.15-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: Fix UAF in __close_file_table_ids
ksmbd: prevent out-of-bounds stream writes by validating *pos
ksmbd: fix memory leak in parse_lease_state()
ksmbd: prevent rename with empty string
|
|
__f2fs_write_meta_folio can only return 0 or AOP_WRITEPAGE_ACTIVATE.
As part of phasing out AOP_WRITEPAGE_ACTIVATE, switch to a bool return
instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
If __write_node_folio() failed, it will return AOP_WRITEPAGE_ACTIVATE,
the incorrect return value may be passed to userspace in below path,
fix it.
- sync_filesystem
- sync_fs
- f2fs_issue_checkpoint
- block_operations
- f2fs_sync_node_pages
- __write_node_folio
: return AOP_WRITEPAGE_ACTIVATE
Cc: stable@vger.kernel.org
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
After commit c0e473a0d226 ("block: fix race between set_blocksize and read
paths") was merged, set_blocksize() called by sb_set_blocksize() now locks
the inode of the backing device file. As a result of this change, syzbot
started reporting deadlock warnings due to a circular dependency involving
the semaphore "ns_sem" of the nilfs object, the inode lock of the backing
device file, and the locks that this inode lock is transitively dependent
on.
This is caused by a new lock dependency added by the above change, since
init_nilfs() calls sb_set_blocksize() in the lock section of "ns_sem".
However, these warnings are false positives because init_nilfs() is called
in the early stage of the mount operation and the filesystem has not yet
started.
The reason why "ns_sem" is locked in init_nilfs() was to avoid a race
condition in nilfs_fill_super() caused by sharing a nilfs object among
multiple filesystem instances (super block structures) in the early
implementation. However, nilfs objects and super block structures have
long ago become one-to-one, and there is no longer any need to use the
semaphore there.
So, fix this issue by removing the use of the semaphore "ns_sem" in
init_nilfs().
Link: https://lkml.kernel.org/r/20250503053327.12294-1-konishi.ryusuke@gmail.com
Fixes: c0e473a0d226 ("block: fix race between set_blocksize and read paths")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+00f7f5b884b117ee6773@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=00f7f5b884b117ee6773
Tested-by: syzbot+00f7f5b884b117ee6773@syzkaller.appspotmail.com
Reported-by: syzbot+f30591e72bfc24d4715b@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f30591e72bfc24d4715b
Tested-by: syzbot+f30591e72bfc24d4715b@syzkaller.appspotmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently quota recovery is synchronized with unmount using sb->s_umount
semaphore. That is however prone to deadlocks because
flush_workqueue(osb->ocfs2_wq) called from umount code can wait for quota
recovery to complete while ocfs2_finish_quota_recovery() waits for
sb->s_umount semaphore.
Grabbing of sb->s_umount semaphore in ocfs2_finish_quota_recovery() is
only needed to protect that function from disabling of quotas from
ocfs2_dismount_volume(). Handle this problem by disabling quota recovery
early during unmount in ocfs2_dismount_volume() instead so that we can
drop acquisition of sb->s_umount from ocfs2_finish_quota_recovery().
Link: https://lkml.kernel.org/r/20250424134515.18933-6-jack@suse.cz
Fixes: 5f530de63cfc ("ocfs2: Use s_umount for quota recovery protection")
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Shichangkuo <shi.changkuo@h3c.com>
Reported-by: Murad Masimov <m.masimov@mt-integration.ru>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Tested-by: Heming Zhao <heming.zhao@suse.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
We will need ocfs2 recovery thread to acknowledge transitions of
recovery_state when disabling particular types of recovery. This is
similar to what currently happens when disabling recovery completely, just
more general. Implement the handshake and use it for exit from recovery.
Link: https://lkml.kernel.org/r/20250424134515.18933-5-jack@suse.cz
Fixes: 5f530de63cfc ("ocfs2: Use s_umount for quota recovery protection")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Tested-by: Heming Zhao <heming.zhao@suse.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Murad Masimov <m.masimov@mt-integration.ru>
Cc: Shichangkuo <shi.changkuo@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "ocfs2: Fix deadlocks in quota recovery", v3.
This implements another approach to fixing quota recovery deadlocks. We
avoid grabbing sb->s_umount semaphore from ocfs2_finish_quota_recovery()
and instead stop quota recovery early in ocfs2_dismount_volume().
This patch (of 3):
We will need more recovery states than just pure enable / disable to fix
deadlocks with quota recovery. Switch osb->disable_recovery to enum.
Link: https://lkml.kernel.org/r/20250424134301.1392-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20250424134515.18933-4-jack@suse.cz
Fixes: 5f530de63cfc ("ocfs2: Use s_umount for quota recovery protection")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Tested-by: Heming Zhao <heming.zhao@suse.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Murad Masimov <m.masimov@mt-integration.ru>
Cc: Shichangkuo <shi.changkuo@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
While discussing some userfaultfd relevant issues recently, Andrea noticed
a potential ABI breakage with -EAGAIN on almost all userfaultfd ioctl()s.
Quote from Andrea, explaining how -EAGAIN was processed, and how this
should fix it (taking example of UFFDIO_COPY ioctl):
The "mmap_changing" and "stale pmd" conditions are already reported as
-EAGAIN written in the copy field, this does not change it. This change
removes the subnormal case that left copy.copy uninitialized and required
apps to explicitly set the copy field to get deterministic
behavior (which is a requirement contrary to the documentation in both
the manpage and source code). In turn there's no alteration to backwards
compatibility as result of this change because userland will find the
copy field consistently set to -EAGAIN, and not anymore sometime -EAGAIN
and sometime uninitialized.
Even then the change only can make a difference to non cooperative users
of userfaultfd, so when UFFD_FEATURE_EVENT_* is enabled, which is not
true for the vast majority of apps using userfaultfd or this unintended
uninitialized field may have been noticed sooner.
Meanwhile, since this bug existed for years, it also almost affects all
ioctl()s that was introduced later. Besides UFFDIO_ZEROPAGE, these also
get affected in the same way:
- UFFDIO_CONTINUE
- UFFDIO_POISON
- UFFDIO_MOVE
This patch should have fixed all of them.
Link: https://lkml.kernel.org/r/20250424215729.194656-2-peterx@redhat.com
Fixes: df2cc96e7701 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races")
Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl")
Fixes: fc71884a5f59 ("mm: userfaultfd: add new UFFDIO_POISON ioctl")
Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
commit 7e119cff9d0a ("ocfs2: convert w_pages to w_folios") and commit
9a5e08652dc4b ("ocfs2: use an array of folios instead of an array of
pages") save -ENOMEM in the folio array upon allocation failure and call
the folio array free code.
The folio array free code expects either valid folio pointers or NULL.
Finding the -ENOMEM will result in a panic. Fix by NULLing the error
folio entry.
Link: https://lkml.kernel.org/r/c879a52b-835c-4fa0-902b-8b2e9196dcbd@oracle.com
Fixes: 7e119cff9d0a ("ocfs2: convert w_pages to w_folios")
Fixes: 9a5e08652dc4b ("ocfs2: use an array of folios instead of an array of pages")
Signed-off-by: Mark Tinguely <mark.tinguely@oracle.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
commit 4eb7b93e0310 ("ocfs2: improve write IO performance when
fragmentation is high") introduced another regression.
The following ocfs2-test case can trigger this issue:
> discontig_runner.sh => activate_discontig_bg.sh => resv_unwritten:
> ${RESV_UNWRITTEN_BIN} -f ${WORK_PLACE}/large_testfile -s 0 -l \
> $((${FILE_MAJOR_SIZE_M}*1024*1024))
In my env, test disk size (by "fdisk -l <dev>"):
> 53687091200 bytes, 104857600 sectors.
Above command is:
> /usr/local/ocfs2-test/bin/resv_unwritten -f \
> /mnt/ocfs2/ocfs2-activate-discontig-bg-dir/large_testfile -s 0 -l \
> 53187969024
Error log:
> [*] Reserve 50724M space for a LARGE file, reserve 200M space for future test.
> ioctl error 28: "No space left on device"
> resv allocation failed Unknown error -1
> reserve unwritten region from 0 to 53187969024.
Call flow:
__ocfs2_change_file_space //by ioctl OCFS2_IOC_RESVSP64
ocfs2_allocate_unwritten_extents //start:0 len:53187969024
while()
+ ocfs2_get_clusters //cpos:0, alloc_size:1623168 (cluster number)
+ ocfs2_extend_allocation
+ ocfs2_lock_allocators
| + choose OCFS2_AC_USE_MAIN & ocfs2_cluster_group_search
|
+ ocfs2_add_inode_data
ocfs2_add_clusters_in_btree
__ocfs2_claim_clusters
ocfs2_claim_suballoc_bits
+ During the allocation of the final part of the large file
(after ~47GB), no chain had the required contiguous
bits_wanted. Consequently, the allocation failed.
How to fix:
When OCFS2 is encountering fragmented allocation, the file system should
stop attempting bits_wanted contiguous allocation and instead provide the
largest available contiguous free bits from the cluster groups.
Link: https://lkml.kernel.org/r/20250414060125.19938-2-heming.zhao@suse.com
Fixes: 4eb7b93e0310 ("ocfs2: improve write IO performance when fragmentation is high")
Signed-off-by: Heming Zhao <heming.zhao@suse.com>
Reported-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Introduce a mount option to allow sysadmins to specify the maximum size
of an atomic write. If the filesystem can work with the supplied value,
that becomes the new guaranteed maximum.
The value mustn't be too big for the existing filesystem geometry (max
write size, max AG/rtgroup size). We dynamically recompute the
tr_atomic_write transaction reservation based on the given block size,
check that the current log size isn't less than the new minimum log size
constraints, and set a new maximum.
The actual software atomic write max is still computed based off of
tr_atomic_ioend the same way it has for the past few commits. Note also
that xfs_calc_atomic_write_log_geometry is non-static because mkfs will
need that.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
|
|
Update the limits returned from xfs_get_atomic_write_{min, max, max_opt)().
No reflink support always means no CoW-based atomic writes.
For updating xfs_get_atomic_write_min(), we support blocksize only and that
depends on HW or reflink support.
For updating xfs_get_atomic_write_max(), for no reflink, we are limited to
blocksize but only if HW support. Otherwise we are limited to combined
limit in mp->m_atomic_write_unit_max.
For updating xfs_get_atomic_write_max_opt(), ultimately we are limited by
the bdev atomic write limit. If xfs_get_atomic_write_max() does not report
> 1x blocksize, then just continue to report 0 as before.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: update comments in the helper functions]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Now that CoW-based atomic writes are supported, update the max size of an
atomic write for the data device.
The limit of a CoW-based atomic write will be the limit of the number of
logitems which can fit into a single transaction.
In addition, the max atomic write size needs to be aligned to the agsize.
Limit the size of atomic writes to the greatest power-of-two factor of the
agsize so that allocations for an atomic write will always be aligned
compatibly with the alignment requirements of the storage.
Function xfs_atomic_write_logitems() is added to find the limit the number
of log items which can fit in a single transaction.
Amend the max atomic write computation to create a new transaction
reservation type, and compute the maximum size of an atomic write
completion (in fsblocks) based on this new transaction reservation.
Initially, tr_atomic_write is a clone of tr_itruncate, which provides a
reasonable level of parallelism. In the next patch, we'll add a mount
option so that sysadmins can configure their own limits.
[djwong: use a new reservation type for atomic write ioends, refactor
group limit calculations]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[jpg: rounddown power-of-2 always]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
Now HW offload will not be required to support atomic writes and is
an optional feature.
CoW-based atomic writes will be supported with out-of-places write and
atomic extent remapping.
Either mode of operation may be used for an atomic write, depending on the
circumstances.
The preferred method is HW offload as it will be faster. If HW offload is
not available then we always use the CoW-based method. If HW offload is
available but not possible to use, then again we use the CoW-based method.
If available, HW offload would not be possible for the write length
exceeding the HW offload limit, the write spanning multiple extents,
unaligned disk blocks, etc.
Apart from the write exceeding the HW offload limit, other conditions for
HW offload usage can only be detected in the iomap handling for the write.
As such, we use a fallback method to issue the write if we detect in the
->iomap_begin() handler that HW offload is not possible. Special code
-ENOPROTOOPT is returned from ->iomap_begin() to inform that HW offload is
not possible.
In other words, atomic writes are supported on any filesystem that can
perform out of place write remapping atomically (i.e. reflink) up to
some fairly large size. If the conditions are right (a single correctly
aligned overwrite mapping) then the filesystem will use any available
hardware support to avoid the filesystem metadata updates.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
When completing a CoW-based write, each extent range mapping update is
covered by a separate transaction.
For a CoW-based atomic write, all mappings must be changed at once, so
change to use a single transaction.
Note that there is a limit on the amount of log intent items which can be
fit into a single transaction, but this is being ignored for now since
the count of items for a typical atomic write would be much less than is
typically supported. A typical atomic write would be expected to be 64KB
or less, which means only 16 possible extents unmaps, which is quite
small.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add tr_atomic_ioend]
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
For when large atomic writes (> 1x FS block) are supported, there will be
various occasions when HW offload may not be possible.
Such instances include:
- unaligned extent mapping wrt write length
- extent mappings which do not cover the full write, e.g. the write spans
sparse or mixed-mapping extents
- the write length is greater than HW offload can support
- no hardware support at all
In those cases, we need to fallback to the CoW-based atomic write mode. For
this, report special code -ENOPROTOOPT to inform the caller that HW
offload-based method is not possible.
In addition to the occasions mentioned, if the write covers an unallocated
range, we again judge that we need to rely on the CoW-based method when we
would need to allocate anything more than 1x block. This is because if we
allocate less blocks that is required for the write, then again HW
offload-based method would not be possible. So we are taking a pessimistic
approach to writes covering unallocated space.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: various cleanups]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
For CoW-based atomic writes, reuse the infrastructure for reflink CoW fork
support.
Add ->iomap_begin() callback xfs_atomic_write_cow_iomap_begin() to create
staging mappings in the CoW fork for atomic write updates.
The general steps in the function are as follows:
- find extent mapping in the CoW fork for the FS block range being written
- if part or full extent is found, proceed to process found extent
- if no extent found, map in new blocks to the CoW fork
- convert unwritten blocks in extent if required
- update iomap extent mapping and return
The bulk of this function is quite similar to the processing in
xfs_reflink_allocate_cow(), where we try to find an extent mapping; if
none exists, then allocate a new extent in the CoW fork, convert unwritten
blocks, and return a mapping.
Performance testing has shown the XFS_ILOCK_EXCL locking to be quite
a bottleneck, so this is an area which could be optimised in future.
Christoph Hellwig contributed almost all of the code in
xfs_atomic_write_cow_iomap_begin().
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: add a new xfs_can_sw_atomic_write to convey intent better]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Currently the size of atomic write allowed is fixed at the blocksize.
To start to lift this restriction, partly refactor
xfs_report_atomic_write() to into helpers -
xfs_get_atomic_write_{min, max}() - and use those helpers to find the
per-inode atomic write limits and check according to that.
Also add xfs_get_atomic_write_max_opt() to return the optimal limit, and
just return 0 since large atomics aren't supported yet.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Refactor xfs_reflink_end_cow_extent() into separate parts which process
the CoW range and commit the transaction.
This refactoring will be used in future for when it is required to commit
a range of extents as a single transaction, similar to how it was done
pre-commit d6f215f359637.
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Add a BMAPI flag to provide a hint to the block allocator to align extents
according to the extszhint.
This will be useful for atomic writes to ensure that we are not being
allocated extents which are not suitable (for atomic writes).
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Currently only HW which can write at least 1x block is supported.
For supporting atomic writes > 1x block, a CoW-based method will also be
used and this will not be resticted to using HW which can write >= 1x
block.
However for deciding if HW-based atomic writes can be used, we need to
start adding checks for write length < HW min, which complicates the
code. Indeed, a statx field similar to unit_max_opt should also be
added for this minimum, which is undesirable.
HW which can only write > 1x blocks would be uncommon and quite weird,
so let's just not support it.
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
In the transaction reservation code, hoist the logic that computes the
reservation needed to finish one log intent item into separate helper
functions. These will be used in subsequent patches to estimate the
number of blocks that an online repair can commit to reaping in the same
transaction as the change committing the new data structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Add selected helpers to estimate the transaction reservation required to
write various log intent and buffer items to the log. These helpers
will be used by the online repair code for more precise estimations of
how much work can be done in a single transaction.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
Separate out setting buftarg atomic writes limits into a dedicated
function, xfs_configure_buftarg_atomic_writes(), to keep the specific
functionality self-contained.
For naming consistency, rename xfs_setsize_buftarg() ->
xfs_configure_buftarg().
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[jpg: separate out from patch "xfs: ignore HW which ..."]
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
In future we will want to be able to check if specifically HW offload-based
atomic writes are possible, so rename xfs_inode_can_atomicwrite() ->
xfs_inode_can_hw_atomicwrite().
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[djwong: add an underscore to be consistent with everything else]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
It's silly to call xfs_setsize_buftarg from xfs_alloc_buftarg with the
block device LBA size because we don't need to ask the block layer to
validate a geometry number that it provided us. Instead, set the
preliminary bt_meta_sector* fields to the LBA size in preparation for
reading the primary super.
However, we still want to flush and invalidate the pagecache for all
three block devices before we start reading metadata from those devices,
so call sync_blockdev() per bdev in xfs_alloc_buftarg().
This will enable a subsequent patch to validate hw atomic write geometry
against the filesystem geometry.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
[jpg: call sync_blockdev() from xfs_alloc_buftarg()]
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
XFS will be able to support large atomic writes (atomic write > 1x block)
in future. This will be achieved by using different operating methods,
depending on the size of the write.
Specifically a new method of operation based in FS atomic extent remapping
will be supported in addition to the current HW offload-based method.
The FS method will generally be appreciably slower performing than the
HW-offload method. However the FS method will be typically able to
contribute to achieving a larger atomic write unit max limit.
XFS will support a hybrid mode, where HW offload method will be used when
possible, i.e. HW offload is used when the length of the write is
supported, and for other times FS-based atomic writes will be used.
As such, there is an atomic write length at which the user may experience
appreciably slower performance.
Advertise this limit in a new statx field, stx_atomic_write_unit_max_opt.
When zero, it means that there is no such performance boundary.
Masks STATX{_ATTR}_WRITE_ATOMIC can be used to get this new field. This is
ok for older kernels which don't support this new field, as they would
report 0 in this field (from zeroing in cp_statx()) already. Furthermore
those older kernels don't support large atomic writes - apart from block
fops, but there would be consistent performance there for atomic writes
in range [unit min, unit max].
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
|
|
We frequently use 'bcachefs list_journal -a' for debugging, as it
provides a record of all btree transactions, and a history of what
happened.
But it's not so useful if we immediately discard journal buckets right
after they're no longer dirty.
This tweaks journal reclaim to only discard when we're low on space,
keeping the journal mostly un-discarded.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
When we go emergency read-only, make sure we do a final write_super() to
persist counters and error counts - this can be critical for piecing
together what fsck was doing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
These just indicate that we're shutting down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
We often filter out EROFS errors to avoid log spew after an emergency
shutdown - journal_shutdown is just another emergency shutdown error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
|
|
A pre-existing valid cfid returned from find_or_create_cached_dir might
race with a lease break, meaning open_cached_dir doesn't consider it
valid, and thinks it's newly-constructed. This leaks a dentry reference
if the allocation occurs before the queued lease break work runs.
Avoid the race by extending holding the cfid_list_lock across
find_or_create_cached_dir and when the result is checked.
Cc: stable@vger.kernel.org
Reviewed-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
- Add a new reviewer, Hongbo Li, for better community development
- Fix an I/O hang out of file-backed mounts
- Address a rare data corruption caused by concurrent I/Os on the same
deduplicated compressed data
- Minor cleanup
* tag 'erofs-for-6.15-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: ensure the extra temporary copy is valid for shortened bvecs
erofs: remove unused enum type
fs/erofs/fileio: call erofs_onlinefolio_split() after bio_add_folio()
MAINTAINERS: erofs: add myself as reviewer
|
|
AIO needs to initialize .ki_write_stream explicitly for read/write request,
otherwise random .ki_write_stream is used, and cause -EINVAL returned for
aio write randomly.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Kanchan Joshi <joshi.k@samsung.com>
Fixes: c27683da6406 ("block: expose write streams for block device nodes")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250507133328.3040255-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace the code building a bio from a kernel direct map address and
submitting it synchronously with the bdev_rw_virt helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace the code building a bio from a kernel direct map address and
submitting it synchronously with the bdev_rw_virt helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use the bio_add_virt_nofail and bio_add_vmalloc helpers to abstract
away the details of the memory allocation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20250507120451.4000627-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Delegate to bdev_rw_virt when operating on non-vmalloc memory and use
bio_add_vmalloc_chunk to insulate xfs from the details of adding vmalloc
memory to a bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20250507120451.4000627-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the
bio_add_virt_nofail helper implementing it and use bio_add_vmalloc
to insulate xfs from the details of adding vmalloc memory to a bio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250507120451.4000627-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Switch zonefs_read_super to allocate the superblock buffer using kmalloc
which falls back to the page allocator for PAGE_SIZE allocation but
gives us a kernel virtual address and then use bdev_rw_virt to perform
the synchronous read into it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Switch gfs2_read_super to allocate the superblock buffer using kmalloc
which falls back to the page allocator for PAGE_SIZE allocation but
gives us a kernel virtual address and then use bdev_rw_virt to perform
the synchronous read into it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250507120451.4000627-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
UDF maintains total length of all extents in i_lenExtents. Generally we
keep extent lengths (and thus i_lenExtents) block aligned because it
makes the file appending logic simpler. However the standard mandates
that the inode size must match the length of all extents and thus we
trim the last extent when closing the file. To catch possible bugs we
also verify that i_lenExtents matches i_size when evicting inode from
memory. Commit b405c1e58b73 ("udf: refactor udf_next_aext() to handle
error") however broke the code updating i_lenExtents and thus
udf_evict_inode() ended up spewing lots of errors about incorrectly
sized extents although the extents were actually sized properly. Fix the
updating of i_lenExtents to silence the errors.
Fixes: b405c1e58b73 ("udf: refactor udf_next_aext() to handle error")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
When compressed data deduplication is enabled, multiple logical extents
may reference the same compressed physical cluster.
The previous commit 94c43de73521 ("erofs: fix wrong primary bvec
selection on deduplicated extents") already avoids using shortened
bvecs. However, in such cases, the extra temporary buffers also
need to be preserved for later use in z_erofs_fill_other_copies() to
to prevent data corruption.
IOWs, extra temporary buffers have to be retained not only due to
varying start relative offsets (`pageofs_out`, as indicated by
`pcl->multibases`) but also because of shortened bvecs.
android.hardware.graphics.composer@2.1.so : 270696 bytes
0: 0.. 204185 | 204185 : 628019200.. 628084736 | 65536
-> 1: 204185.. 225536 | 21351 : 544063488.. 544129024 | 65536
2: 225536.. 270696 | 45160 : 0.. 0 | 0
com.android.vndk.v28.apex : 93814897 bytes
...
364: 53869896..54095257 | 225361 : 543997952.. 544063488 | 65536
-> 365: 54095257..54309344 | 214087 : 544063488.. 544129024 | 65536
366: 54309344..54514557 | 205213 : 544129024.. 544194560 | 65536
...
Both 204185 and 54095257 have the same start relative offset of 3481,
but the logical page 55 of `android.hardware.graphics.composer@2.1.so`
ranges from 225280 to 229632, forming a shortened bvec [225280, 225536)
that cannot be used for decompressing the range from 54095257 to
54309344 of `com.android.vndk.v28.apex`.
Since `pcl->multibases` is already meaningless, just mark `be->keepxcpy`
on demand for simplicity.
Again, this issue can only lead to data corruption if `-Ededupe` is on.
Fixes: 94c43de73521 ("erofs: fix wrong primary bvec selection on deduplicated extents")
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250506101850.191506-1-hsiangkao@linux.alibaba.com
|
|
The last remaining user of vfs_submount() (tracefs) is easy to convert
to fs_context_for_submount(); do that and bury that thing, along with
SB_SUBMOUNT
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Tested-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|