Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One more regression fix for a problem in zoned mode: mounting would
fail if the number of open and active zones reached a common limit
that didn't use to be checked"
* tag 'for-6.17-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: don't fail mount needlessly due to too many active zones
|
|
Jan Kara reported that the shared ILOCK held across the journal
flush during fdatasync operations slows down O_DSYNC DIO on
unwritten extents significantly. The underlying issue is that
unwritten extent conversion needs the ILOCK exclusive, whilst the
datasync operation after the extent conversion holds it shared.
Hence we cannot be flushing the journal for one IO completion whilst
at the same time doing unwritten extent conversion on another IO
completion on the same inode. This means that IO completions
lock-step, and IO performance is dependent on the journal flush
latency.
Jan demonstrated that reducing the ifdatasync lock hold time can
improve O_DSYNC DIO to unwritten extents performance by 2.5x.
Discussion on that patch found issues with the method, and we
came to the conclusion that separately tracking datasync flush
sequences was the best approach to solving the problem.
The fsync code uses the ILOCK to serialise against concurrent
modifications in the transaction commit phase. In a transaction
commit, there are several disjoint updates to inode log item state
that need to be considered atomically by the fsync code. These
operations are all done under ILOCK_EXCL context:
1. ili_fsync_flags is updated in ->iop_precommit
2. i_pincount is updated in ->iop_pin before it is added to the CIL
3. ili_commit_seq is updated in ->iop_committing, after it has been
added to the CIL
In fsync, we need to:
1. check that the inode is dirty in the journal (ipincount)
2. check that ili_fsync_flags is set
3. grab the ili_commit_seq if a journal flush is needed
4. clear the ili_fsync_flags to ensure that new modifications that
require fsync are tracked in ->iop_precommit correctly
The serialisation of ipincount/ili_commit_seq is needed
to ensure that we don't try to unnecessarily flush the journal.
The serialisation of ili_fsync_flags being set in
->iop_precommit and cleared in fsync post journal flush is
required for correctness.
Hence holding the ILOCK_SHARED in xfs_file_fsync() performs all this
serialisation for us. Ideally, we want to remove the need to hold
the ILOCK_SHARED in xfs_file_fsync() for best performance.
We start with the observation that fsync/fdatasync() only need to
wait for operations that have been completed. Hence operations that
are still being committed have not completed and datasync operations
do not need to wait for them.
This means we can use a single point in time in the commit process
to signal "this modification is complete". This is what
->iop_committing is supposed to provide - it is the point at which
the object is unlocked after the modification has been recorded in
the CIL. Hence we could use ili_commit_seq to determine if we should
flush the journal.
In theory, we can already do this. However, in practice this will
expose an internal global CIL lock to the IO path. The ipincount()
checks optimise away the need to take this lock - if the inode is
not pinned, then it is not in the CIL and we don't need to check if
a journal flush at ili_commit_seq needs to be performed.
The reason this is needed is that the ili_commit_seq is never
cleared. Once it is set, it remains set even once the journal has
been committed and the object has been unpinned. Hence we have to
look that journal internal commit sequence state to determine if
ili_commit_seq needs to be acted on or not.
We can solve this by clearing ili_commit_seq when the inode is
unpinned. If we clear it atomically with the last unpin going away,
then we are guaranteed that new modifications will order correctly
as they add a new pin counts and we won't clear a sequence number
for an active modification in the CIL.
Further, we can then allow the per-transaction flag state to
propagate into ->iop_committing (instead of clearing it in
->iop_precommit) and that will allow us to determine if the
modification needs a full fsync or just a datasync, and so we can
record a separate datasync sequence number (Jan's idea!) and then
use that in the fdatasync path instead of the full fsync sequence
number.
With this infrastructure in place, we no longer need the
ILOCK_SHARED in the fsync path. All serialisation is done against
the commit sequence numbers - if the sequence number is set, then we
have to flush the journal. If it is not set, then we have nothing to
do. This greatly simplifies the fsync implementation....
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
There are similar extsize checks and updates done inside and outside
the inode item lock, which could all be done under a single top
level logic branch outside the ili_lock. The COW extsize fixup can
potentially miss updating the XFS_ILOG_CORE in ili_fsync_fields, so
moving this code up above the ili_fsync_fields update could also be
considered a fix.
Further, to make the next change a bit cleaner, move where we
calculate the on-disk flag mask to after we attach the cluster
buffer to the the inode log item.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
|
|
start_creating() is a generic name which I would like to use for a
function similar to simple_start_creating(), only not quite so simple.
debugfs is using this name which, though static, will cause complaints
if then name is given a different signature in a header file.
So rename it to debugfs_start_creating().
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
kern_path_locked() is now only used to prepare for removing an object
from the filesystem (and that is the only credible reason for wanting a
positive locked dentry). Thus it corresponds to kern_path_create() and
so should have a corresponding name.
Unfortunately the name "kern_path_create" is somewhat misleading as it
doesn't actually create anything. The recently added
simple_start_creating() provides a better pattern I believe. The
"start" can be matched with "end" to bracket the creating or removing.
So this patch changes names:
kern_path_locked -> start_removing_path
kern_path_create -> start_creating_path
user_path_create -> start_creating_user_path
user_path_locked_at -> start_removing_user_path_at
done_path_create -> end_creating_path
and also introduces end_removing_path() which is identical to
end_creating_path().
__start_removing_path (which was __kern_path_locked) is enhanced to
call mnt_want_write() for consistency with the start_creating_path().
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
audit_alloc_mark() and audit_get_nd() both need to perform a path
lookup getting the parent dentry (which must exist) and the final
target (following a LAST_NORM name) which sometimes doesn't need to
exist.
They don't need the parent to be locked, but use kern_path_locked() or
kern_path_locked_negative() anyway. This is somewhat misleading to the
casual reader.
This patch introduces a more targeted function, kern_path_parent(),
which returns not holding locks. On success the "path" will
be set to the parent, which must be found, and the return value is the
dentry of the target, which might be negative.
This will clear the way to rename kern_path_locked() which is
otherwise only used to prepare for removing something.
It also allows us to remove kern_path_locked_negative(), which is
transformed into the new kern_path_parent().
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
A rename operation can only rename within a single mount. Callers of
vfs_rename() must and do ensure this is the case.
So there is no point in having two mnt_idmaps in renamedata as they are
always the same. Only one of them is passed to ->rename in any case.
This patch replaces both with a single "mnt_idmap" and changes all
callers.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Since 204a575e91f3 "VFS: add common error checks to lookup_one_qstr_excl()"
filename_create() does not need to stash the error value from mnt_want_write()
into a separate variable - the logic that used to clobber 'error' after the
call of mnt_want_write() has migrated into lookup_one_qstr_excl().
So there is no need for two different err variables.
This patch discards "err2" and uses "error' throughout.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
ovl wants a lookup which won't block on a fatal signal. It currently
uses down_write_killable() and then repeatedly calls to lookup_one()
The lock may not be needed if the name is already in the dcache and it
aids proposed future changes if the locking is kept internal to namei.c
So this patch adds lookup_one_positive_killable() which is like
lookup_one_positive() but will abort in the face of a fatal signal.
overlayfs is changed to use this.
Note that instead of always getting an exclusive lock, ovl now only gets
a shared lock, and only sometimes. The exclusive lock was never needed.
However down_read_killable() was only added in v4.15 but overlayfs started
using down_write_killable() here in v4.7.
Note that the linked list ->first_maybe_whiteout ->next_maybe_white is
local to the thread so there is no concurrency in that list which could
be threatened by removing the locking.
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Christian Brauner <brauner@kernel.org>
|
|
Previously BTRFS did not look at a device's reported max_open_zones limit,
but starting with commit 04147d8394e8 ("btrfs: zoned: limit active zones
to max_open_zones"), zoned BTRFS limited the number of concurrently used
block-groups to the number of max_open_zones a device reported, if it
hadn't already reported a number of max_active_zones.
Starting with commit 04147d8394e8 the number of open zones is treated the
same way as active zones. But this leads to mount failures on filesystems
which have been used before 04147d8394e8 because too many zones are in an
open state.
Ignore the new limitations on these filesystems, so zones can be finished
or evacuated.
Reported-by: Yuwei Han <hrx@bupt.moe>
Link: https://lore.kernel.org/all/2F48A90AF7DDF380+1790bcfd-cb6f-456b-870d-7982f21b5eae@bupt.moe/
Fixes: 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones")
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since all real encoded extents (directly handled by the decompression
subsystem) have a sane, limited maximum decoded length
(Z_EROFS_PCLUSTER_MAX_DSIZE), and the read-more policy is only applied
if needed.
However, it makes no sense to read more for non-encoded maps, such as
fragment extents, since such extents can be huge (up to i_size) and
there is no benefit to reading more at this layer.
For normal images, it does not really matter, but for crafted images
generated by syzbot, excessively large fragment extents can cause
read-more to run for an overly long time.
Reported-and-tested-by: syzbot+1a9af3ef3c84c5e14dcc@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/68c8583d.050a0220.2ff435.03a3.GAE@google.com
Fixes: b44686c8391b ("erofs: fix large fragment handling")
Fixes: b15b2e307c3a ("erofs: support on-disk compressed fragments data")
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
After setting the BTRFS_ROOT_FORCE_COW flag on the root we are doing a
full write barrier, smp_wmb(), but we don't need to, all we need is a
smp_mb__after_atomic(). The use of the smp_wmb() is from the old days
when we didn't use a bit and used instead an int field in the root to
signal if cow is forced. After the int field was changed to a bit in
the root's state (flags field), we forgot to update the memory barrier
in create_pending_snapshot() to smp_mb__after_atomic(), but we did the
change in commit_fs_roots() after clearing BTRFS_ROOT_FORCE_COW. That
happened in commit 27cdeb7096b8 ("Btrfs: use bitfield instead of integer
data type for the some variants in btrfs_root"). On the reader side, in
should_cow_block(), we also use the counterpart smp_mb__before_atomic()
which generates further confusion.
So change the smp_wmb() to smp_mb__after_atomic(). In fact we don't
even need any barrier at all since create_pending_snapshot() is called
in the critical section of a transaction commit and therefore no one
can concurrently join/attach the transaction, or start a new one, until
the transaction is unblocked. By the time someone starts a new transaction
and enters should_cow_block(), a lot of implicit memory barriers already
took place by having acquired several locks such as fs_info->trans_lock
and extent buffer locks on the root node at least. Nevertlheless, for
consistency use smp_mb__after_atomic() after setting the force cow bit
in create_pending_snapshot().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen.
Transaction abort is one such error, the btrfs_abort_transaction()
inlines code to check the state and print a warning, this ought to be
out of the hot path.
The most common pattern is when transaction abort is called after
checking a return value and the control flow leads to a quick return.
In other cases it may not be necessary to add unlikely() e.g. when the
function returns anyway or the control flow is not changed noticeably.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EIO is one of them.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EUCLEAN (a corruption) is one of them.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Trivial pattern for the auto freeing with goto -> return conversions
if possible.
The following cases are considered trivial in this patch:
1. Cases where there are no operations between btrfs_free_path() and the
function returns.
2. Cases where only simple cleanup operations (such as kfree(), kvfree(),
clear_bit(), and fs_path_free()) are present between
btrfs_free_path() and the function return.
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Previously BTRFS did not look at a device's reported max_open_zones limit,
but starting with commit 04147d8394e8 ("btrfs: zoned: limit active zones
to max_open_zones"), zoned BTRFS limited the number of concurrently used
block-groups to the number of max_open_zones a device reported, if it
hadn't already reported a number of max_active_zones.
Starting with commit 04147d8394e8 the number of open zones is treated the
same way as active zones. But this leads to mount failures on filesystems
which have been used before 04147d8394e8 because too many zones are in an
open state.
Ignore the new limitations on these filesystems, so zones can be finished
or evacuated.
Reported-by: Yuwei Han <hrx@bupt.moe>
Link: https://lore.kernel.org/all/2F48A90AF7DDF380+1790bcfd-cb6f-456b-870d-7982f21b5eae@bupt.moe/
Fixes: 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones")
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
As pointed out in the documentation, calling 'kmalloc' with open-coded
arithmetic can lead to unfortunate overflows and this particular way of
using it has been deprecated. Instead, it's preferred to use
'kmalloc_array' in cases where it might apply so an overflow check is
performed.
Note this is an API cleanup and is not fixing any overflows because in
all cases the multipliers are bounded small numbers derived from number
of items in leaves/nodes.
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
With all the preparation patches, we're able to finally enable btrfs
block size (sector size) larger than page size support and give it a
full fstests run.
And obviously this new feature is hidden behind experimental flags, and
should not be considered as a core feature yet as btrfs' default block
size is still 4K.
But this is still a feature that will shine in the future where 16K
block sized device are widely adopted.
For now there are some features explicitly disabled:
- Direct IO
This is the most complex part to support, the root reason is we can
not control the pages of iov iter passed in.
User space programs can only ensure the virtual addresses are
contiguous, but have no control on their physical addresses.
Our bs > ps support heavily relies on large folios, and direct IO
memory can easily break it.
So direct IO is disabled and will always fall back to buffered IO.
- RAID56
In theory we can convert RAID56 to use large folios, but it will need
to be converted back to page based if we want to support direct IO in
the future.
So just reject it for now.
- Encoded send
- Encoded read
Both are utilizing btrfs_encoded_read_regular_fill_pages(), and send
is utilizing vmallocated memory.
Unfortunately for vmallocated memory we can not guarantee the minimal
folio order.
For send, it will just always fallback to regular writes, which reads
from page cache and will follow the existing folio order requirement.
- Encoded write
Encoded write itself is allocating pages by themselves, and we can
easily change it to follow the minimal order.
But since encoded read is already disabled, there is no need to only
enable encoded write.
Finally just like what we did for bs < ps support in the past, add a
warning message for bs > ps mounts.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Btrfs uses btrfs_bio to handle read/write of logical address, for the
incoming bs > ps support, btrfs has extra requirements:
- One folio must contain at least one fs block
- No fs block can cross folio boundaries
This requirement is not hard to maintain, thanks to the address space's
minimal folio order.
But not all btrfs bios are generated through address space, e.g.
compression and scrub.
To catch possible unaligned bios, introduce a helper,
assert_bbio_alginment(), for each btrfs_bio in btrfs_submit_bbio().
This will check the following things:
- bv_offset is aligned to block size
- bv_len is aligned to block size
With a btrfs bio passing above checks, unless it's empty it will ensure
the requirements for bs > ps support.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG DURING BS > PS TEST]
When running the following script on a btrfs whose block size is larger
than page size, e.g. 8K block size and 4K page size, it will trigger a
kernel BUG:
# mkfs.btrfs -s 8k $dev
# mount $dev $mnt
# mkdir $mnt/dir
# ln -s dir $mnt/link
# ls $mnt/link
The call trace looks like this:
BTRFS warning (device dm-2): support for block size 8192 with page size 4096 is experimental, some features may be missing
BTRFS info (device dm-2): checking UUID tree
BTRFS info (device dm-2): enabling ssd optimizations
BTRFS info (device dm-2): enabling free space tree
------------[ cut here ]------------
kernel BUG at /home/adam/linux/include/linux/highmem.h:275!
Oops: invalid opcode: 0000 [#1] SMP
CPU: 8 UID: 0 PID: 667 Comm: ls Tainted: G OE 6.17.0-rc4-custom+ #283 PREEMPT(full)
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
RIP: 0010:zero_user_segments.constprop.0+0xdc/0xe0 [btrfs]
Call Trace:
<TASK>
btrfs_get_extent.cold+0x85/0x101 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
btrfs_do_readpage+0x244/0x750 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
btrfs_read_folio+0x9c/0x100 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
filemap_read_folio+0x37/0xe0
do_read_cache_folio+0x94/0x3e0
__page_get_link.isra.0+0x20/0x90
page_get_link+0x16/0x40
step_into+0x69b/0x830
path_lookupat+0xa7/0x170
filename_lookup+0xf7/0x200
? set_ptes.isra.0+0x36/0x70
vfs_statx+0x7a/0x160
do_statx+0x63/0xa0
__x64_sys_statx+0x90/0xe0
do_syscall_64+0x82/0xae0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
</TASK>
Please note bs > ps support is still under development and the
enablement patch is not even in btrfs development branch.
[CAUSE]
Btrfs reuses its data folio read path to handle symbolic links, as the
symbolic link target is stored as an inline data extent.
But for newly created inodes, btrfs only set the minimal order if the
target inode is a regular file.
Thus for above newly created symbolic link, it doesn't properly respect
the minimal folio order, and triggered the above crash.
[FIX]
Call btrfs_set_inode_mapping_order() unconditionally inside
btrfs_create_new_inode().
For symbolic links this will fix the crash as now the folio will meet
the minimal order.
For regular files this brings no change.
For directory/bdev/char and all the other types of inodes, they won't
go through the data read path, thus no effect either.
Fixes: cc38d178ff33 ("btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This involves:
- Migrate scrub_stripe::pages[] to folios[]
- Use btrfs_alloc_folio_array() and folio_put() to alloc above array.
- Migrate scrub_stripe_get_kaddr() and scrub_stripe_get_paddr() to use
folio interfaces
- Migrate raid56_parity_cache_data_pages() to
raid56_parity_cache_data_folios()
Since scrub is the only caller still using pages.
This helper will copy the folio array contents into rbio::stripe_pages,
with sector uptodate flags updated.
And a new ASSERT() to make sure bs > ps cases will not hit this path.
Since most scrub code is based on kaddr/paddr, the migration itself is
pretty straightforward.
And since we're here, also move the loop to set the
stripe_sectors[].uptodate out of the copy loop.
As we always mark all the sectors as uptodate for the data stripe, it's
easier to do in one go, other than doing it inside the copy loop.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This involves converting the following functions to use correct folio
sizes/shifts:
- zlib_compress_folios()
- zlib_decompress_bio()
There is a special handling for s390 hardware acceleration.
With bs > ps cases, we can go with 16K block size on s390 (which uses
fixed 4K page size).
In that case we do not need to do the buffer copy as our folio is large
enough for hardware acceleration.
So factor out the s390 specific and folio size check into a helper,
need_special_buffer().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This involves converting the following functions to use correct folio
sizes/shifts:
- copy_compress_data_to_page()
- lzo_compress_folios()
- lzo_decompress_bio()
Just like zstd, lzo has some extra incorrect usage of kmap_local_folio()
that the offset is always 0.
This will not handle HIGHMEM large folios correctly, but those cases are
already rejected explicitly so it should not cause problems when bs > ps
support is enabled.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This involves converting the following functions to use proper folio
sizes/shifts:
- zstd_compress_folios()
- zstd_decompress_bio()
The function zstd_decompress() is already using block size correctly
without using page size, thus it needs no modification.
And since zstd compression is calling kmap_local_folio(), the existing
code cannot handle large folios with HIGHMEM, as kmap_local_folio()
requires us to handle one page range each time.
I do not really think it's worth to spend time on some feature that will
be deprecated eventually. So here just add an extra explicit rejection
for bs > ps with HIGHMEM feature enabled kernels.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This includes the following preparation for bs > ps cases:
- Always alloc/free the folio directly if bs > ps
This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus
affecting all compression algorithms.
For btrfs_free_compr_folio() it needs no parameter for now, as we can
use the folio size to skip the caching part.
For now the change is just to passing a @fs_info into the function,
all the folio size assumption is still based on page size.
- Properly zero the last folio in compress_file_range()
Since the compressed folios can be larger than a page, we need to
properly zero the whole folio.
- Use correct folio size for btrfs_add_compressed_bio_folios()
Instead of page size, use the correct folio size.
- Use correct folio size/shift for btrfs_compress_filemap_get_folio()
As we are not only using simple page sized folios anymore.
- Use correct folio size for btrfs_decompress()
There is an ASSERT() making sure the decompressed range is no larger
than a page, which will be triggered for bs > ps cases.
- Skip readahead for compressed pages
Similar to subpage cases.
- Make btrfs_alloc_folio_array() to accept a new @order parameter
- Add a helper to calculate the minimal folio size
All those changes should not affect the existing bs <= ps handling.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
With my local branch to enable bs > ps support for btrfs, sometimes I
hit the following ASSERT() inside submit_one_sector():
ASSERT(block_start != EXTENT_MAP_HOLE);
Please note that it's not yet possible to hit this ASSERT() in the wild
yet, as it requires btrfs bs > ps support, which is not even in the
development branch.
But on the other hand, there is also a very low chance to hit above
ASSERT() with bs < ps cases, so this is an existing bug affect not only
the incoming bs > ps support but also the existing bs < ps support.
[CAUSE]
Firstly that ASSERT() means we're trying to submit a dirty block but
without a real extent map nor ordered extent map backing it.
Furthermore with extra debugging, the folio triggering such ASSERT() is
always larger than the fs block size in my bs > ps case.
(8K block size, 4K page size)
After some more debugging, the ASSERT() is trigger by the following
sequence:
extent_writepage()
| We got a 32K folio (4 fs blocks) at file offset 0, and the fs block
| size is 8K, page size is 4K.
| And there is another 8K folio at file offset 32K, which is also
| dirty.
| So the filemap layout looks like the following:
|
| "||" is the filio boundary in the filemap.
| "//| is the dirty range.
|
| 0 8K 16K 24K 32K 40K
| |////////| |//////////////////////||////////|
|
|- writepage_delalloc()
| |- find_lock_delalloc_range() for [0, 8K)
| | Now range [0, 8K) is properly locked.
| |
| |- find_lock_delalloc_range() for [16K, 40K)
| | |- btrfs_find_delalloc_range() returned range [16K, 40K)
| | |- lock_delalloc_folios() locked folio 0 successfully
| | |
| | | The filemap range [32K, 40K) got dropped from filemap.
| | |
| | |- lock_delalloc_folios() failed with -EAGAIN on folio 32K
| | | As the folio at 32K is dropped.
| | |
| | |- loops = 1;
| | |- max_bytes = PAGE_SIZE;
| | |- goto again;
| | | This will re-do the lookup for dirty delalloc ranges.
| | |
| | |- btrfs_find_delalloc_range() called with @max_bytes == 4K
| | | This is smaller than block size, so
| | | btrfs_find_delalloc_range() is unable to return any range.
| | \- return false;
| |
| \- Now only range [0, 8K) has an OE for it, but for dirty range
| [16K, 32K) it's dirty without an OE.
| This breaks the assumption that writepage_delalloc() will find
| and lock all dirty ranges inside the folio.
|
|- extent_writepage_io()
|- submit_one_sector() for [0, 8K)
| Succeeded
|
|- submit_one_sector() for [16K, 24K)
Triggering the ASSERT(), as there is no OE, and the original
extent map is a hole.
Please note that, this also exposed the same problem for bs < ps
support. E.g. with 64K page size and 4K block size.
If we failed to lock a folio, and falls back into the "loops = 1;"
branch, we will re-do the search using 64K as max_bytes.
Which may fail again to lock the next folio, and exit early without
handling all dirty blocks inside the folio.
[FIX]
Instead of using the fixed size PAGE_SIZE as @max_bytes, use
@sectorsize, so that we are ensured to find and lock any remaining
blocks inside the folio.
And since we're here, add an extra ASSERT() to
before calling btrfs_find_delalloc_range() to make sure the @max_bytes is
at least no smaller than a block to avoid false negative.
Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no point in setting the key's offset to (u64)-1 since we never
use it before setting it to the current transaction's ID. So remove the
assignment of (u64)-1 to the key's offset and move the remainder of the
key initialization close to where it's used.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can annotate btrfs_is_testing() as unlikely since that's the most
expected scenario and it's desirable for the compiler to optimize for
the case we are not running the self tests. So add the annotation to
btrfs_is_testing() and while at it also make it return bool instead of
int.
Also make two of the existing callers use btrfs_is_testing() directly
instead of storing its result in a local variable.
On x86_64 with Debian's gcc 14.2.0-19 this resulted in a very tiny object
code reduction.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1913263 161567 15592 2090422 1fe5b6 fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1913257 161567 15592 2090416 1fe5b0 fs/btrfs/btrfs.ko
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
It's quite hard and unreadable the way the rule checks are organized in
should_cow_block(). We have a single if statement that returns 0 (false)
and it checks several conditions, with one them being a negated compound
condition which is particularly hard to reason immediately.
Improve on this by using multiple if statements, each checking a single
condition and returning immediately. Also change the return type from an
integer to a boolean, since all we need is to return true or false.
At least on x86_64 with Debian's gcc 14.2.0-19, this also reduces the
object code size by 64 bytes.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1913327 161567 15592 2090486 1fe5f6 fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1913263 161567 15592 2090422 1fe5b6 fs/btrfs/btrfs.ko
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is no need to store the extent's ram_bytes in two variables,
further more one of them, named 'size', is used only for the extent's end
offset calculation. So remove the 'size' variable and use 'nbytes' only.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The comment is wrong about the part where it says a prealloc extent does
not contribute to an inode's nbytes - it does. Only holes don't contribute
and that's what we are checking for, as prealloc extents always have a
disk_bytenr different from 0. So fix the comment and re-organize the code
to not set nbytes twice and set it to the extent item's number of bytes
only if it doesn't represent a hole - in case it's a hole we have already
initialized nbytes to 0 when we declared it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Since the support of bs < ps support, extent_writepage_io() will submit
multiple blocks inside the folio.
But if we hit error submitting one sector, but the next sector can still
be submitted successfully, the function extent_writepage_io() will still
return 0.
This will make btrfs to silently ignore the error without setting error
flag for the filemap.
Fix it by recording the first error hit, and always return that value.
Fixes: 8bf334beb349 ("btrfs: fix double accounting race when extent_writepage_io() failed")
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have several sanity checks when inserting or extending items in a btree
that verify we didn't overflow the leaf or access a slot beyond the last
one. These are cases that are never expected to be hit so mark them as
unlikely, allowing the compiler to potentially generate better code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We expect that after attempting to read an extent buffer we had no errors
therefore the extent buffer is up to date, so mark the checks for a not up
to date extent buffer as unlikely and allow the compiler to pontentially
generate better code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are not expecting to ever fail the extent buffer alignment checks, so
mark them as unlikely to allow the compiler to potentially generate more
optimized code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of dereferencing fs_info every time we need to access the node
size, store in a local variable to make the code less verbose and avoid
a line split too.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Looking at a leaf dump from the kernel's print-tree implementation is not
so friendly to analyze since key types are printed as numbers. Improve on
this by printing key types as strings that are a diminutive of the macro
names for key types, just like we do in btrfs-progs.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The code for processing file extent items is quite large and it's better
to have it in a dedicated helper rather than in a huge switch statement,
just like we do in btrfs-progs.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are not printing anything about the compression type, so add that
useful information in the same format as btrfs-progs.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are advertising the ram_bytes of an inline extent as its data size, but
that is not true for compressed extents. The ram_bytes corresponds to the
uncompressed data size while the data size (compressed data) is given by
btrfs_file_extent_inline_item_len(). So fix this and print both values in
the same format as in btrfs-progs.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we don't print anything for extent csum items other than the
generic line with the key, item offset and item size. While one can still
determine the range the extent csum covers by doing a few simple
computations, it makes it more time consuming to analyse a leaf dump.
So add a line that prints information about the range covered by the
checksum using the same format as btrfs-progs. This is useful when
debugging log tree issues since we log extent csum items for new extents.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We currently don't print information about dir log items (other than the
key, item offset and item size), which is useful to look at when debugging
problems with a log tree. So print their specific information (currently
they only have an end index number) in a format similar to btrfs-progs.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we ignore inode extref items, we just print their key, item
offset in the leaf and their size, no information about their content
like the index number, parent inode, name length and name.
Improve on this by printing the index, parent and name length in the same
format as btrfs-progs. Note that we don't print the name, as that would
require some processing and escaping like we do in btrfs-progs, and that
could expose sensitive information for some users in case they share their
dmesg/syslog and it contains a leaf dump. So for now leave names out.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we ignore inode ref items, we just print their key, item offset
in the leaf and their size, no information about their content like the
index number, name length and name.
Improve on this by printing the index and name length in the same format
as btrfs-progs. Note that we don't print the name, as that would require
some processing and escaping like we do in btrfs-progs, and that could
expose sensitive information for some users in case they share their
dmesg/syslog and it contains a leaf dump. So for now leave names out.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we only print the dir items for BTRFS_DIR_ITEM_KEY keys, but
we also have dir items for BTRFS_DIR_INDEX_KEY and BTRFS_XATTR_ITEM_KEY
keys too. So print them for those keys too.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently we only print the object id component of the location key from a
dir item and the flags. We are missing the whole key, transid and the name
and data lengths. We are also ignoring the fact that we can have multiple
dir item objects encoded in a single item for a BTRFS_DIR_ITEM_KEY key, so
what we print is only for the first item.
Improve on this by iterating on all dir items and print the missing
information. This is done with the same format as in btrfs-progs, what
we miss is printing the names and data since not only that would require
some processing and escaping like in btrfs-progs, but it would also reveal
information that may be sensitive and users may not want to share that in
case that get a leaf dumped in dmesg.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are not dumping a lot of fields for an inode item which are useful for
debugging whenever we dump a leaf (log replay failure for example), so add
them and make it as close as possible to the print tree implementation in
btrfs-progs (things like converting timespecs to human readable dates and
converting flags to strings are missing since they are not so practical to
do in the kernel).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Like inode refs, inode extrefs have a variable length name, which means
we have to do a proper check to make sure no header nor name can exceed
the item limits.
The check itself is very similar to check_inode_ref(), just a different
structure (btrfs_inode_extref vs btrfs_inode_ref).
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We now have a nodesize_bits member in fs_info so we can index an extent
buffer in the backref cache by node number instead of by sector number.
While this allows for a denser index space with the possibility of using
less maple tree nodes, in practice it's unlikely to hit such benefits
since we currently limit the maximum number of keys in the cache to 128,
so unless all extent buffers are contiguous we are unlikely to see a
memory usage reduction in the backing maple tree due to fewer nodes.
Nevertheless it doesn't cost anything to index by node number and it's
more logical.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|