Age | Commit message (Collapse) | Author |
|
In preparation to use a helper for folio_pos + folio_size, rename the
variables for the locked range so they don't use the 'folio_' prefix. As
the locking ranges take inclusive end of the range (hence the "-1") this
would be confusing as the folio helpers typically use exclusive end of
the range.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The way zero_end is calculated and used does a -1 and +1 that
effectively cancel out, so this can be simplified. This is also
preparatory patch for using a helper for folio_pos + folio_size with the
semantics of exclusive end of the range.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Every caller of __add_block_group_free_space() is checking if the flag
BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE is set before calling it. This is
duplicate code and it's prone to some mistake in case we add more callers
in the future. So move the check for that flag into the start of
__add_block_group_free_space(), and, as a consequence, the path allocation
from add_block_group_free_space() is moved into
__add_block_group_free_space(), to preserve the behaviour of allocating
a path only if the flag BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE is set.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Only one of the callers of __add_block_group_free_space() aborts the
transaction if the call fails, while the others don't do it and it's
either never done up the call chain or much higher in the call chain.
So make sure we abort the transaction at __add_block_group_free_space()
if it fails, which brings a couple benefits:
1) If some call chain never aborts the transaction, we avoid having some
metadata inconsistency because BLOCK_GROUP_FLAG_NEEDS_FREE_SPACE is
cleared when we enter __add_block_group_free_space() and therefore
__add_block_group_free_space() is never called again to add the block
group items to the free space tree, since the function is only called
when that flag is set in a block group;
2) If the call chain already aborts the transaction, then we get a better
trace that points to the exact step from __add_block_group_free_space()
which failed, which is better for analysis.
So abort the transaction at __add_block_group_free_space() if any of its
steps fails.
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unlike qgroup rescan, which always shows whether it cleared the
inconsistent flag, we do not have a proper way to show if qgroup is
marked inconsistent.
This was not a big deal before as there aren't that many locations that
can mark qgroup inconsistent.
But with the introduction of drop_subtree_threshold, qgroup can be
marked inconsistent very frequently, especially when dropping
subvolumes.
Although most user space tools relying on qgroup should do their own
checks and queue a rescan if needed, we have no idea when qgroup is
marked inconsistent, and this would be much harder to debug.
So this patch will add an extra warning (btrfs_warn_rl()) when the
qgroup flag is flipped into inconsistent for the first time.
And add extra reason why qgroup flips inconsistent.
This means we can move the error message immediately before
qgroup_inconsistent_warning() into that function.
For call sites without an obvious reason, or is a shared error handling,
output the function that failed and the error code instead.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's only one caller of btrfs_printk_ratelimited(), merge it there.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The btrfs_debug() helpers depend on various config options. In case of
"no_printk" we don't need to define a special helper that in the end
does nothing but validates the parameters. As the default build config
is to validate the parameters we can simplify it to let the debug
helpers expand to nothing and remove btrfs_no_printk_in_rcu().
To avoid warnings use fs_info and inline one variable in
extent_from_logical().
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Remove the critical level message helpers as they're not used, the RCU
protection is provided by the plain helpers.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The RCU protection is now done in the plain helpers, we can remove the
"_in_rcu" and "_rl_in_rcu".
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The RCU protection is now done in the plain helpers, we can remove the
"_in_rcu" and "_rl_in_rcu".
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The RCU protection is now done in the plain helpers, we can remove the
"_in_rcu" and "_rl_in_rcu".
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The RCU protection is now done in the plain helpers, we can remove the
"_in_rcu" and "_rl_in_rcu".
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have two versions of message helpers, one that provides RCU
protection around the call in case we need to dereference device name.
As messages are not performance critical we can set up the RCU
protection for all of them and drop the distinction for those where
device name is needed. This will lead to further simplifications.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We're using the following levels: crit, err, warn, info, debug. This
covers our needs and further specializations is not needed, so let's
remove emerg, alert and notice.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The RCU-string API has never taken off and we don't use the printk
helpers provided as we do the protection in our helpers. Remove the "in
RCU" wrappers.
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The helper is trivial and we can simply use kfree_rcu() if needed. In
our case it's just one place where we rename a device in
device_list_add() and the old name can still be used until the end of
the RCU grace period. The other case is freeing a device and there
nothing should reach the device, so we can use plain kfree().
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Create a block group dedicated for data relocation on mount of a zoned
filesystem.
If there is already more than one empty DATA block group on mount, this
one is picked for the data relocation block group, instead of a newly
created one.
This is done to ensure, there is always space for performing garbage
collection and the filesystem is not hitting ENOSPC under heavy overwrite
workloads.
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
A few more remaining cases where we can use the helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Two remaining cases where we can use the helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can avoid potential memory allocation failure in
btrfs_replace_file_extents() as the block reserve lifetime is limited to
the scope of the function. This requires +48 bytes on stack.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can avoid potential memory allocation failure in btrfs_truncate() as
the block reserve lifetime is limited to the scope of the function. This
requires +48 bytes on stack.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can avoid potential memory allocation failure in btrfs_evict_inode()
as the block reserve lifetime is limited to the scope of the function.
This requires +48 bytes on stack.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The inode_lock field of struct btrfs_root was removed in commit
e2844cce75c9e61("btrfs: remove inode_lock from struct btrfs_root and use
xarray locks") but the related comment haven't been updated.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
With all the preparation patches already merged, it's pretty easy to
enable large data folios:
- Remove the ASSERT() on folio size in btrfs_end_repair_bio()
- Add a helper to properly set the max folio order
Currently due to several call sites that are fetching the bitmap
content directly into an unsigned long, we can only support
BITS_PER_LONG blocks for each bitmap.
- Call the helper when reading/creating an inode
The support has the following limitations:
- No large folios for data reloc inode
The relocation code still requires page sized folio.
But it's not that hot nor common compared to regular buffered ios.
Will be improved in the future.
- Requires CONFIG_BTRFS_EXPERIMENTAL
- Will require all folio related operations to check if it needs the
extra btrfs_subpage structure
Now any folio larger than block size will need btrfs_subpage structure
handling.
Unfortunately I do not have a physical machine for performance test, but
if everything goes like XFS/EXT4, it should mostly bring single digits
percentage performance improvement in the real world.
Although I believe there are still quite some optimizations to be done,
let's focus on testing the current large data folio support first.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of using a bare atomic, use the refcount_t type, which despite
being a structure that contains only an atomic, has an API that checks
for underflows and other hazards. This doesn't change the size of the
extent_buffer structure.
This removes the need to do things like this:
WARN_ON(atomic_read(&eb->refs) == 0);
if (atomic_dec_and_test(&eb->refs)) {
(...)
}
And do just:
if (refcount_dec_and_test(&eb->refs)) {
(...)
}
Since refcount_dec_and_test() already triggers a warning when we decrement
a ref count that has a value of 0 (or below zero).
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's this special atomic compare and exchange logic which serves to
avoid locking the extent buffers refs_lock spinlock and therefore reduce
lock contention, so add a comment to make it more obvious.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
It's hard to read the logic to break out of the while loop since it's a
very long expression consisting of a logical or of two composite
expressions, each one composed by a logical and. Further each one is also
testing for the EXTENT_BUFFER_UNMAPPED bit, making it more verbose than
necessary.
So change from this:
if ((!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && refs <= 3)
|| (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) &&
refs == 1))
break;
To this:
if (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags)) {
if (refs == 1)
break;
} else if (refs <= 3) {
break;
}
At least on x86_64 using gcc 9.3.0, this doesn't change the object size.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no need to return errors, all we do is return 1 or 0 depending
on whether we should or should not stop iterating over delayed dir
indexes. So change the function to return bool instead of an int.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no need to return errors and we currently return 1 in case the
index should be deleted and 0 otherwise, so change the return type from
int to bool as this is a boolean function and it's more clear this way.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Update the error messages with:
1) Fix typo in the first one, deltiona -> deletion;
2) Remove redundant part of the first message, the part following the
comma, and including all the useful information: root, inode, index
and error value;
3) Update the second message to use more formal language (example 'error'
instead of 'err'), , remove redundant part about "deletion tree of
delayed node..." and print the relevant information in the same
format and order as the first message, without the ugly opening
parenthesis without a space separating from the previous word.
This also makes the message similar in format to the one we have at
btrfs_insert_delayed_dir_index().
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no need to return an integer as all we need to do is return true
or false to tell whether we deleted a delayed item or not. Also the logic
is inverted since we return 1 (true) if we didn't delete and 0 (false) if
we did, which is somewhat counter intuitive. Change the return type to a
boolean and make it return true if we deleted and false otherwise.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The argument has boolean semantics, so change its type from int to bool,
making it more clear.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are two callers of btrfs_del_inode_ref() that declare a local index
variable and then pass a pointer for it to btrfs_del_inode_ref(), but then
don't use that index at all. Since btrfs_del_inode_ref() accepts a NULL
index pointer, pass NULL instead and stop declaring those useless index
variables.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of allocating the scratch eb after joining the log transaction,
allocate it before so that we're not delaying log commits for longer
than necessary, as allocating the scratch eb means allocating an
extent_buffer structure, which comes from a dedicated kmem_cache, plus
pages/folios to attach to the eb. Both of these allocations may take time
when we're under memory pressure.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of allocating the path after joining the log transaction, allocate
it before so that we're not delaying log commits for the rare cases where
the allocation takes a significant time (under memory pressure and all
slabs are full, there's the need to allocate a new page, etc).
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Instead of allocating the path after joining the log transaction, allocate
it before so that we're not delaying log commits for the rare cases where
the allocation takes a significant time (under memory pressure and all
slabs are full, there's the need to allocate a new page, etc).
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are supposed to be able to join a log transaction at that point, since
we have determined that the inode was logged in the current transaction
with the call to inode_logged(). So ASSERT() we joined a log transaction
and also warn if we didn't in case assertions are disabled (the kernel
config doesn't have CONFIG_BTRFS_ASSERT=y), so that the issue gets noticed
and reported if it ever happens.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There's no need to use btrfs_delete_one_dir_name() at del_logged_dentry()
because we are processing a dir index key which can contain only a single
name, unlike dir item keys which can encode multiple names in case of name
hash collisions. We have explicitly looked up for a dir index key by
calling btrfs_lookup_dir_index_item() and we don't log dir item keys
anymore (since commit 339d03542484 ("btrfs: only copy dir index keys when
logging a directory")). So simplify and use btrfs_del_item() directly
instead of btrfs_delete_one_dir_name().
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
After calling btrfs_delete_one_dir_name() there's no need for the path
anymore so we can free it immediately after. After that point we do
some btree operations that take time and in those call chains we end up
allocating paths for these operations, so we're unnecessarily holding on
to the path we allocated early at __btrfs_unlink_inode().
So free the path as soon as we don't need it and add a comment. This
also allows to simplify the error path, removing two exit labels and
returning directly when an error happens.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We are supposed to be able to join a log transaction at that point, since
we have determined that the inode was logged in the current transaction
with the call to inode_logged(). So ASSERT() we joined a log transaction
and also warn if we didn't in case assertions are disabled (the kernel
config doesn't have CONFIG_BTRFS_ASSERT=y), so that the issue gets noticed
and reported if it ever happens.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[CURRENT BEHAVIOR]
Currently inside btrfs_get_tree_subvol(), we call fc_mount() to grab a
tree, then re-lock s_umount inside btrfs_reconfigure_for_mount() to
avoid race with remount.
However fc_mount() itself is just doing two things:
1. Call vfs_get_tree()
2. Release s_umount then call vfs_create_mount()
[ENHANCEMENT]
Instead of calling fc_mount(), we can open-code it with vfs_get_tree()
first.
This provides a benefit that, since we have the full control of
s_umount, we do not need to re-lock that rw_sempahore when calling
btrfs_reconfigure_for_mount(), meaning less race between RO/RW remount.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Rework the subject and commit message, refactor the error handling ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Tested-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unify naming of return value to the preferred way.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unify naming of return value to the preferred way.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unify naming of return value to the preferred way.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unify naming of return value to the preferred way.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unify naming of return value to the preferred way.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unify naming of return value to the preferred way.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unify naming of return value to the preferred way.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unify naming of return value to the preferred way.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Unify naming of return value to the preferred way.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|