Age | Commit message (Collapse) | Author |
|
Embed a btrfs_bio into struct compressed_bio. This avoids potential
(so far theoretical) deadlocks due to nesting of btrfs_bioset allocations
for the original read bio and the compressed bio, and avoids an extra
memory allocation in the I/O path.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In btrfs_io_context structure, we have a pointer raid_map, which
indicates the logical bytenr for each stripe.
But considering we always call sort_parity_stripes(), the result
raid_map[] is always sorted, thus raid_map[0] is always the logical
bytenr of the full stripe.
So why we waste the space and time (for sorting) for raid_map?
This patch will replace btrfs_io_context::raid_map with a single u64
number, full_stripe_start, by:
- Replace btrfs_io_context::raid_map with full_stripe_start
- Replace call sites using raid_map[0] to use full_stripe_start
- Replace call sites using raid_map[i] to compare with nr_data_stripes.
The benefits are:
- Less memory wasted on raid_map
It's sizeof(u64) * num_stripes vs sizeof(u64).
It'll always save at least one u64, and the benefit grows larger with
num_stripes.
- No more weird alloc_btrfs_io_context() behavior
As there is only one fixed size + one variable length array.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
That structure is our ultimate object for all __btrfs_map_block()
related functions. We have some hard to understand members, like
tgtdev_map, but without any comments.
This patch will improve the situation:
- Add extra comments for num_stripes, mirror_num, num_tgtdevs and
tgtdev_map[]
Especially for the last two members, add a dedicated (thus very long)
comments for them, with example to explain it.
- Shrink those int members to u16.
In fact our on-disk format is only using u16 for num_stripes, thus
no need to use int at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There is no memory re-allocation for handle_ops_on_dev_replace(), thus
we don't need to pass a btrfs_io_context pointer.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
There are quite some div64 calls inside btrfs_map_block() and its
variants.
Such calls are for @stripe_nr, where @stripe_nr is the number of
stripes before our logical bytenr inside a chunk.
However we can eliminate such div64 calls by just reducing the width of
@stripe_nr from 64 to 32.
This can be done because our chunk size limit is already 10G, with fixed
stripe length 64K.
Thus a U32 is definitely enough to contain the number of stripes.
With such width reduction, we can get rid of slower div64, and extra
warning for certain 32bit arch.
This patch would do:
- Add a new tree-checker chunk validation on chunk length
Make sure no chunk can reach 256G, which can also act as a bitflip
checker.
- Reduce the width from u64 to u32 for @stripe_nr variables
- Replace unnecessary div64 calls with regular modulo and division
32bit division and modulo are much faster than 64bit operations, and
we are finally free of the div64 fear at least in those involved
functions.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently btrfs doesn't support stripe lengths other than 64KiB.
This is already set in the tree-checker.
There is really no meaning to record that fixed value in map_lookup for
now, and can all be replaced with BTRFS_STRIPE_LEN.
Furthermore we can use the fix stripe length to do the following
optimization:
- Use BTRFS_STRIPE_LEN_SHIFT to replace some 64bit division
Now we only need to do a right shift.
And the value of BTRFS_STRIPE_LEN itself is already too large for bit
shift, thus if we accidentally use BTRFS_STRIPE_LEN to do bit shift,
a compiler warning would be triggered.
Thus this bit shift optimization would be safe.
- Use BTRFS_STRIPE_LEN_MASK to calculate the offset inside a stripe
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Move the remaining code that deals with initializing the btree
inode into btrfs_init_btree_inode instead of splitting it between
that helpers and its only caller.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Function search_file_offset_in_bio() finds the file offset in the
file_offset_ret, and we use the return value to indicate if it is
successful, so use bool.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The function btrfs_lookup_bio_sums() and a nested if statement declare
ret respectively as blk_status_t and int.
There is no need to store the return value of
search_file_offset_in_bio() to ret as this is a one-time call.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Remove btrfs_csum_ptr() and fold it into it's only caller.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
These days all the operations that take locks in the raid56.c code are
run from user context (mostly workqueues). Drop all the irqsafe locking
that is not required any more.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We were seeing weird errors when we were testing our btrfs backports
before we had the incorrect level check fix. These errors appeared to
be improper error handling, but error injection testing uncovered that
the errors were a result of corruption that occurred from improper error
handling during snapshot delete.
With snapshot delete if we encounter any errors during walk_down or
walk_up we'll simply return an error, we won't abort the transaction.
This is problematic because we will be dropping references for nodes and
leaves along the way, and if we fail in the middle we will leave the
file system corrupt because we don't know where we left off in the drop.
Fix this by making sure we abort if we hit any errors during the walk
down or walk up operations, as we have no idea what operations could
have been left half done at this point.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We can get errors in walk_down_proc as we try and lookup extent info for
the snapshot dropping to act on. However if we get an error we simply
return 1 which indicates we're done with walking down, which will lead
us to improperly continue with the snapshot drop with the incorrect
information. Instead break if we get any error from walk_down_proc or
do_walk_down, and handle the case of ret == 1 by returning 0, otherwise
return the ret value that we have.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
When we mount the file system we do something like this:
while (1) {
lookup fs roots;
for (i = 0; i < num_roots; i++) {
ret = btrfs_orphan_cleanup(roots[i]);
if (ret)
break;
btrfs_put_root(roots[i]);
}
}
for (; i < num_roots; i++)
btrfs_put_root(roots[i]);
As you can see if we break in that inner loop we just go back to the
outer loop and lose the fact that we have to drop references on the
remaining roots we looked up. Fix this by making an out label and
jumping to that on error so we don't leak a reference to the roots we
looked up.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We missed a couple of iput()s in the orphan cleanup failure paths, add
them so we don't get refcount errors. The iput needs to be done in the
check and not under a common label due to the way the code is
structured.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
While investigating a problem with error injection I tripped over
curious behavior in the node/leaf splitting code. If we get an EIO when
trying to read either the left or right leaf/node for splitting we'll
simply treat the node as if it were full and continue on. The end
result of this isn't too bad, we simply end up allocating a block when
we may have pushed items into the adjacent blocks.
However this does essentially allow us to continue to modify a file
system that we've gotten errors on, either from a bad disk or csum
mismatch or other corruption. This isn't particularly safe, so instead
handle these btrfs_read_node_slot() usages differently. We allow you to
pass in any slot, the idea being that we save some code if the slot
number is outside of the range of the parent. This means we treat all
errors the same, when in reality we only want to ignore -ENOENT.
Fix this by changing how we call btrfs_read_node_slot(), which is to
only call it for slots we know are valid. This way if we get an error
back from reading the block we can properly pass the error up the chain.
This was validated with the error injection testing I was doing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
In btrfs_read_node_slot() we have a BUG_ON() that can be converted to an
ASSERT(), it's from an extent buffer and the level is validated at the
time it's read from disk.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
While trying to track down a lost EIO problem I hit the following
assertion while doing my error injection testing
BTRFS warning (device nvme1n1): transaction 1609 (with 180224 dirty metadata bytes) is not committed
assertion failed: !found, in fs/btrfs/disk-io.c:4456
------------[ cut here ]------------
kernel BUG at fs/btrfs/messages.h:169!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 1445 Comm: mount Tainted: G W 6.2.0-rc5+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
RIP: 0010:btrfs_assertfail.constprop.0+0x18/0x1a
RSP: 0018:ffffb95fc3b0bc68 EFLAGS: 00010286
RAX: 0000000000000034 RBX: ffff9941c2ac2000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffb6741f7d RDI: 00000000ffffffff
RBP: ffff9941c2ac2428 R08: 0000000000000000 R09: ffffb95fc3b0bb38
R10: 0000000000000003 R11: ffffffffb71438a8 R12: ffff9941c2ac2428
R13: ffff9941c2ac2450 R14: ffff9941c2ac2450 R15: 000000000002c000
FS: 00007fcea2d07800(0000) GS:ffff9941fbc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f00cc7c83a8 CR3: 000000010c686000 CR4: 0000000000350ef0
Call Trace:
<TASK>
close_ctree+0x426/0x48f
btrfs_mount_root.cold+0x7e/0xee
? legacy_parse_param+0x2b/0x220
legacy_get_tree+0x2b/0x50
vfs_get_tree+0x29/0xc0
vfs_kern_mount.part.0+0x73/0xb0
btrfs_mount+0x11d/0x3d0
? legacy_parse_param+0x2b/0x220
legacy_get_tree+0x2b/0x50
vfs_get_tree+0x29/0xc0
path_mount+0x438/0xa40
__x64_sys_mount+0xe9/0x130
do_syscall_64+0x3e/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
This is because the error injection did an EIO for the root inode lookup
and we simply jumped to closing the ctree. However because we didn't
mark the file system as having an error we skipped all of the broken
transaction cleanup stuff, and thus triggered this ASSERT(). Fix this
by calling btrfs_handle_fs_error() in this case so we have the error set
on the file system.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix fast checksum detection, this affects filesystems with non-crc32c
checksum, calculation would not be offloaded to worker threads
- restore thread_pool mount option behaviour for endio workers, the new
value for maximum active threads would not be set to the actual work
queues
* tag 'for-6.3-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix fast csum implementation detection
btrfs: restore the thread_pool= behavior in remount for the end I/O workqueues
|
|
The BTRFS_FS_CSUM_IMPL_FAST flag is currently set whenever a non-generic
crc32c is detected, which is the incorrect check if the file system uses
a different checksumming algorithm. Refactor the code to only check
this if crc32c is actually used. Note that in an ideal world the
information if an algorithm is hardware accelerated or not should be
provided by the crypto API instead, but that's left for another day.
CC: stable@vger.kernel.org # 5.4.x: c8a5f8ca9a9c: btrfs: print checksum type and implementation at mount time
CC: stable@vger.kernel.org # 5.4.x
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Commit d7b9416fe5c5 ("btrfs: remove btrfs_end_io_wq") converted the read
and I/O handling from btrfs_workqueues to Linux workqueues, and as part
of that lost the code to apply the thread_pool= based max_active limit
on remount. Restore it.
Fixes: d7b9416fe5c5 ("btrfs: remove btrfs_end_io_wq")
CC: stable@vger.kernel.org # 6.0+
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- scan block devices in non-exclusive mode to avoid temporary mkfs
failures
- fix race between quota disable and quota assign ioctls
- fix deadlock when aborting transaction during relocation with scrub
- ignore fiemap path cache when there are multiple paths for a node
* tag 'for-6.3-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: ignore fiemap path cache when there are multiple paths for a node
btrfs: fix deadlock when aborting transaction during relocation with scrub
btrfs: scan device in non-exclusive mode
btrfs: fix race between quota disable and quota assign ioctls
|
|
This returns a pointer to the current iovec entry in the iterator. Only
useful with ITER_IOVEC right now, but it prepares us to treat ITER_UBUF
and ITER_IOVEC identically for the first segment.
Rename struct iov_iter->iov to iov_iter->__iov to find any potentially
troublesome spots, and also to prevent anyone from adding new code that
accesses iter->iov directly.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
During fiemap, when walking backreferences to determine if a b+tree
node/leaf is shared, we may find a tree block (leaf or node) for which
two parents were added to the references ulist. This happens if we get
for example one direct ref (shared tree block ref) and one indirect ref
(non-shared tree block ref) for the tree block at the current level,
which can happen during relocation.
In that case the fiemap path cache can not be used since it's meant for
a single path, with one tree block at each possible level, so having
multiple references for a tree block at any level may result in getting
the level counter exceed BTRFS_MAX_LEVEL and eventually trigger the
warning:
WARN_ON_ONCE(level >= BTRFS_MAX_LEVEL)
at lookup_backref_shared_cache() and at store_backref_shared_cache().
This is harmless since the code ignores any level >= BTRFS_MAX_LEVEL, the
warning is there just to catch any unexpected case like the one described
above. However if a user finds this it may be scary and get reported.
So just ignore the path cache once we find a tree block for which there
are more than one reference, which is the less common case, and update
the cache with the sharedness check result for all levels below the level
for which we found multiple references.
Reported-by: Jarno Pelkonen <jarno.pelkonen@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAKv8qLmDNAGJGCtsevxx_VZ_YOvvs1L83iEJkTgyA4joJertng@mail.gmail.com/
Fixes: 12a824dc67a6 ("btrfs: speedup checking for extent sharedness during fiemap")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Before relocating a block group we pause scrub, then do the relocation and
then unpause scrub. The relocation process requires starting and committing
a transaction, and if we have a failure in the critical section of the
transaction commit path (transaction state >= TRANS_STATE_COMMIT_START),
we will deadlock if there is a paused scrub.
That results in stack traces like the following:
[42.479] BTRFS info (device sdc): relocating block group 53876686848 flags metadata|raid6
[42.936] BTRFS warning (device sdc): Skipping commit of aborted transaction.
[42.936] ------------[ cut here ]------------
[42.936] BTRFS: Transaction aborted (error -28)
[42.936] WARNING: CPU: 11 PID: 346822 at fs/btrfs/transaction.c:1977 btrfs_commit_transaction+0xcc8/0xeb0 [btrfs]
[42.936] Modules linked in: dm_flakey dm_mod loop btrfs (...)
[42.936] CPU: 11 PID: 346822 Comm: btrfs Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[42.936] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[42.936] RIP: 0010:btrfs_commit_transaction+0xcc8/0xeb0 [btrfs]
[42.936] Code: ff ff 45 8b (...)
[42.936] RSP: 0018:ffffb58649633b48 EFLAGS: 00010282
[42.936] RAX: 0000000000000000 RBX: ffff8be6ef4d5bd8 RCX: 0000000000000000
[42.936] RDX: 0000000000000002 RSI: ffffffffb35e7782 RDI: 00000000ffffffff
[42.936] RBP: ffff8be6ef4d5c98 R08: 0000000000000000 R09: ffffb586496339e8
[42.936] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8be6d38c7c00
[42.936] R13: 00000000ffffffe4 R14: ffff8be6c268c000 R15: ffff8be6ef4d5cf0
[42.936] FS: 00007f381a82b340(0000) GS:ffff8beddfcc0000(0000) knlGS:0000000000000000
[42.936] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[42.936] CR2: 00007f1e35fb7638 CR3: 0000000117680006 CR4: 0000000000370ee0
[42.936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[42.936] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[42.936] Call Trace:
[42.936] <TASK>
[42.936] ? start_transaction+0xcb/0x610 [btrfs]
[42.936] prepare_to_relocate+0x111/0x1a0 [btrfs]
[42.936] relocate_block_group+0x57/0x5d0 [btrfs]
[42.936] ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs]
[42.936] btrfs_relocate_block_group+0x248/0x3c0 [btrfs]
[42.936] ? __pfx_autoremove_wake_function+0x10/0x10
[42.936] btrfs_relocate_chunk+0x3b/0x150 [btrfs]
[42.936] btrfs_balance+0x8ff/0x11d0 [btrfs]
[42.936] ? __kmem_cache_alloc_node+0x14a/0x410
[42.936] btrfs_ioctl+0x2334/0x32c0 [btrfs]
[42.937] ? mod_objcg_state+0xd2/0x360
[42.937] ? refill_obj_stock+0xb0/0x160
[42.937] ? seq_release+0x25/0x30
[42.937] ? __rseq_handle_notify_resume+0x3b5/0x4b0
[42.937] ? percpu_counter_add_batch+0x2e/0xa0
[42.937] ? __x64_sys_ioctl+0x88/0xc0
[42.937] __x64_sys_ioctl+0x88/0xc0
[42.937] do_syscall_64+0x38/0x90
[42.937] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[42.937] RIP: 0033:0x7f381a6ffe9b
[42.937] Code: 00 48 89 44 24 (...)
[42.937] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[42.937] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b
[42.937] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003
[42.937] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000
[42.937] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423
[42.937] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148
[42.937] </TASK>
[42.937] ---[ end trace 0000000000000000 ]---
[42.937] BTRFS: error (device sdc: state A) in cleanup_transaction:1977: errno=-28 No space left
[59.196] INFO: task btrfs:346772 blocked for more than 120 seconds.
[59.196] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.196] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.196] task:btrfs state:D stack:0 pid:346772 ppid:1 flags:0x00004002
[59.196] Call Trace:
[59.196] <TASK>
[59.196] __schedule+0x392/0xa70
[59.196] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.196] schedule+0x5d/0xd0
[59.196] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.197] ? __pfx_autoremove_wake_function+0x10/0x10
[59.197] scrub_pause_off+0x21/0x50 [btrfs]
[59.197] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.197] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.198] ? __pfx_autoremove_wake_function+0x10/0x10
[59.198] scrub_stripe+0x20d/0x740 [btrfs]
[59.198] scrub_chunk+0xc4/0x130 [btrfs]
[59.198] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.198] ? __pfx_autoremove_wake_function+0x10/0x10
[59.198] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.199] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.199] ? _copy_from_user+0x7b/0x80
[59.199] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.199] ? refill_stock+0x33/0x50
[59.199] ? should_failslab+0xa/0x20
[59.199] ? kmem_cache_alloc_node+0x151/0x460
[59.199] ? alloc_io_context+0x1b/0x80
[59.199] ? preempt_count_add+0x70/0xa0
[59.199] ? __x64_sys_ioctl+0x88/0xc0
[59.199] __x64_sys_ioctl+0x88/0xc0
[59.199] do_syscall_64+0x38/0x90
[59.199] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.199] RIP: 0033:0x7f82ffaffe9b
[59.199] RSP: 002b:00007f82ff9fcc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.199] RAX: ffffffffffffffda RBX: 000055b191e36310 RCX: 00007f82ffaffe9b
[59.199] RDX: 000055b191e36310 RSI: 00000000c400941b RDI: 0000000000000003
[59.199] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.199] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff9fd640
[59.199] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.199] </TASK>
[59.199] INFO: task btrfs:346773 blocked for more than 120 seconds.
[59.200] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.201] task:btrfs state:D stack:0 pid:346773 ppid:1 flags:0x00004002
[59.201] Call Trace:
[59.201] <TASK>
[59.201] __schedule+0x392/0xa70
[59.201] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.201] schedule+0x5d/0xd0
[59.201] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.201] ? __pfx_autoremove_wake_function+0x10/0x10
[59.201] scrub_pause_off+0x21/0x50 [btrfs]
[59.202] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.202] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.202] ? __pfx_autoremove_wake_function+0x10/0x10
[59.202] scrub_stripe+0x20d/0x740 [btrfs]
[59.202] scrub_chunk+0xc4/0x130 [btrfs]
[59.203] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.203] ? __pfx_autoremove_wake_function+0x10/0x10
[59.203] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.203] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.203] ? _copy_from_user+0x7b/0x80
[59.203] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.204] ? should_failslab+0xa/0x20
[59.204] ? kmem_cache_alloc_node+0x151/0x460
[59.204] ? alloc_io_context+0x1b/0x80
[59.204] ? preempt_count_add+0x70/0xa0
[59.204] ? __x64_sys_ioctl+0x88/0xc0
[59.204] __x64_sys_ioctl+0x88/0xc0
[59.204] do_syscall_64+0x38/0x90
[59.204] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.204] RIP: 0033:0x7f82ffaffe9b
[59.204] RSP: 002b:00007f82ff1fbc50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.204] RAX: ffffffffffffffda RBX: 000055b191e36790 RCX: 00007f82ffaffe9b
[59.204] RDX: 000055b191e36790 RSI: 00000000c400941b RDI: 0000000000000003
[59.204] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.204] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82ff1fc640
[59.204] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.204] </TASK>
[59.204] INFO: task btrfs:346774 blocked for more than 120 seconds.
[59.205] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.205] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.206] task:btrfs state:D stack:0 pid:346774 ppid:1 flags:0x00004002
[59.206] Call Trace:
[59.206] <TASK>
[59.206] __schedule+0x392/0xa70
[59.206] schedule+0x5d/0xd0
[59.206] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.206] ? __pfx_autoremove_wake_function+0x10/0x10
[59.206] scrub_pause_off+0x21/0x50 [btrfs]
[59.207] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.207] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.207] ? __pfx_autoremove_wake_function+0x10/0x10
[59.207] scrub_stripe+0x20d/0x740 [btrfs]
[59.208] scrub_chunk+0xc4/0x130 [btrfs]
[59.208] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.208] ? __mutex_unlock_slowpath.isra.0+0x9a/0x120
[59.208] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.208] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.209] ? _copy_from_user+0x7b/0x80
[59.209] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.209] ? should_failslab+0xa/0x20
[59.209] ? kmem_cache_alloc_node+0x151/0x460
[59.209] ? alloc_io_context+0x1b/0x80
[59.209] ? preempt_count_add+0x70/0xa0
[59.209] ? __x64_sys_ioctl+0x88/0xc0
[59.209] __x64_sys_ioctl+0x88/0xc0
[59.209] do_syscall_64+0x38/0x90
[59.209] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.209] RIP: 0033:0x7f82ffaffe9b
[59.209] RSP: 002b:00007f82fe9fac50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.209] RAX: ffffffffffffffda RBX: 000055b191e36c10 RCX: 00007f82ffaffe9b
[59.209] RDX: 000055b191e36c10 RSI: 00000000c400941b RDI: 0000000000000003
[59.209] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.209] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe9fb640
[59.209] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.209] </TASK>
[59.209] INFO: task btrfs:346775 blocked for more than 120 seconds.
[59.210] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.210] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.211] task:btrfs state:D stack:0 pid:346775 ppid:1 flags:0x00004002
[59.211] Call Trace:
[59.211] <TASK>
[59.211] __schedule+0x392/0xa70
[59.211] schedule+0x5d/0xd0
[59.211] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.211] ? __pfx_autoremove_wake_function+0x10/0x10
[59.211] scrub_pause_off+0x21/0x50 [btrfs]
[59.212] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.212] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.212] ? __pfx_autoremove_wake_function+0x10/0x10
[59.212] scrub_stripe+0x20d/0x740 [btrfs]
[59.213] scrub_chunk+0xc4/0x130 [btrfs]
[59.213] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.213] ? __mutex_unlock_slowpath.isra.0+0x9a/0x120
[59.213] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.213] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.214] ? _copy_from_user+0x7b/0x80
[59.214] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.214] ? should_failslab+0xa/0x20
[59.214] ? kmem_cache_alloc_node+0x151/0x460
[59.214] ? alloc_io_context+0x1b/0x80
[59.214] ? preempt_count_add+0x70/0xa0
[59.214] ? __x64_sys_ioctl+0x88/0xc0
[59.214] __x64_sys_ioctl+0x88/0xc0
[59.214] do_syscall_64+0x38/0x90
[59.214] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.214] RIP: 0033:0x7f82ffaffe9b
[59.214] RSP: 002b:00007f82fe1f9c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.214] RAX: ffffffffffffffda RBX: 000055b191e37090 RCX: 00007f82ffaffe9b
[59.214] RDX: 000055b191e37090 RSI: 00000000c400941b RDI: 0000000000000003
[59.214] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.214] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fe1fa640
[59.214] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.214] </TASK>
[59.214] INFO: task btrfs:346776 blocked for more than 120 seconds.
[59.215] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.216] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.217] task:btrfs state:D stack:0 pid:346776 ppid:1 flags:0x00004002
[59.217] Call Trace:
[59.217] <TASK>
[59.217] __schedule+0x392/0xa70
[59.217] ? __pv_queued_spin_lock_slowpath+0x165/0x370
[59.217] schedule+0x5d/0xd0
[59.217] __scrub_blocked_if_needed+0x74/0xc0 [btrfs]
[59.217] ? __pfx_autoremove_wake_function+0x10/0x10
[59.217] scrub_pause_off+0x21/0x50 [btrfs]
[59.217] scrub_simple_mirror+0x1c7/0x950 [btrfs]
[59.217] ? scrub_parity_put+0x1a5/0x1d0 [btrfs]
[59.218] ? __pfx_autoremove_wake_function+0x10/0x10
[59.218] scrub_stripe+0x20d/0x740 [btrfs]
[59.218] scrub_chunk+0xc4/0x130 [btrfs]
[59.218] scrub_enumerate_chunks+0x3e4/0x7a0 [btrfs]
[59.219] ? __pfx_autoremove_wake_function+0x10/0x10
[59.219] btrfs_scrub_dev+0x236/0x6a0 [btrfs]
[59.219] ? btrfs_ioctl+0xd97/0x32c0 [btrfs]
[59.219] ? _copy_from_user+0x7b/0x80
[59.219] btrfs_ioctl+0xde1/0x32c0 [btrfs]
[59.219] ? should_failslab+0xa/0x20
[59.219] ? kmem_cache_alloc_node+0x151/0x460
[59.219] ? alloc_io_context+0x1b/0x80
[59.219] ? preempt_count_add+0x70/0xa0
[59.219] ? __x64_sys_ioctl+0x88/0xc0
[59.219] __x64_sys_ioctl+0x88/0xc0
[59.219] do_syscall_64+0x38/0x90
[59.219] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.219] RIP: 0033:0x7f82ffaffe9b
[59.219] RSP: 002b:00007f82fd9f8c50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.219] RAX: ffffffffffffffda RBX: 000055b191e37510 RCX: 00007f82ffaffe9b
[59.219] RDX: 000055b191e37510 RSI: 00000000c400941b RDI: 0000000000000003
[59.219] RBP: 0000000000000000 R08: 00007fff1575016f R09: 0000000000000000
[59.219] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f82fd9f9640
[59.219] R13: 000000000000006b R14: 00007f82ffa87580 R15: 0000000000000000
[59.219] </TASK>
[59.219] INFO: task btrfs:346822 blocked for more than 120 seconds.
[59.220] Tainted: G W 6.3.0-rc2-btrfs-next-127+ #1
[59.221] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[59.222] task:btrfs state:D stack:0 pid:346822 ppid:1 flags:0x00004002
[59.222] Call Trace:
[59.222] <TASK>
[59.222] __schedule+0x392/0xa70
[59.222] schedule+0x5d/0xd0
[59.222] btrfs_scrub_cancel+0x91/0x100 [btrfs]
[59.222] ? __pfx_autoremove_wake_function+0x10/0x10
[59.222] btrfs_commit_transaction+0x572/0xeb0 [btrfs]
[59.223] ? start_transaction+0xcb/0x610 [btrfs]
[59.223] prepare_to_relocate+0x111/0x1a0 [btrfs]
[59.223] relocate_block_group+0x57/0x5d0 [btrfs]
[59.223] ? btrfs_wait_nocow_writers+0x25/0xb0 [btrfs]
[59.223] btrfs_relocate_block_group+0x248/0x3c0 [btrfs]
[59.224] ? __pfx_autoremove_wake_function+0x10/0x10
[59.224] btrfs_relocate_chunk+0x3b/0x150 [btrfs]
[59.224] btrfs_balance+0x8ff/0x11d0 [btrfs]
[59.224] ? __kmem_cache_alloc_node+0x14a/0x410
[59.224] btrfs_ioctl+0x2334/0x32c0 [btrfs]
[59.225] ? mod_objcg_state+0xd2/0x360
[59.225] ? refill_obj_stock+0xb0/0x160
[59.225] ? seq_release+0x25/0x30
[59.225] ? __rseq_handle_notify_resume+0x3b5/0x4b0
[59.225] ? percpu_counter_add_batch+0x2e/0xa0
[59.225] ? __x64_sys_ioctl+0x88/0xc0
[59.225] __x64_sys_ioctl+0x88/0xc0
[59.225] do_syscall_64+0x38/0x90
[59.225] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[59.225] RIP: 0033:0x7f381a6ffe9b
[59.225] RSP: 002b:00007ffd45ecf060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[59.225] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f381a6ffe9b
[59.225] RDX: 00007ffd45ecf150 RSI: 00000000c4009420 RDI: 0000000000000003
[59.225] RBP: 0000000000000003 R08: 0000000000000013 R09: 0000000000000000
[59.225] R10: 00007f381a60c878 R11: 0000000000000246 R12: 00007ffd45ed0423
[59.225] R13: 00007ffd45ecf150 R14: 0000000000000000 R15: 00007ffd45ecf148
[59.225] </TASK>
What happens is the following:
1) A scrub is running, so fs_info->scrubs_running is 1;
2) Task A starts block group relocation, and at btrfs_relocate_chunk() it
pauses scrub by calling btrfs_scrub_pause(). That increments
fs_info->scrub_pause_req from 0 to 1 and waits for the scrub task to
pause (for fs_info->scrubs_paused to be == to fs_info->scrubs_running);
3) The scrub task pauses at scrub_pause_off(), waiting for
fs_info->scrub_pause_req to decrease to 0;
4) Task A then enters btrfs_relocate_block_group(), and down that call
chain we start a transaction and then attempt to commit it;
5) When task A calls btrfs_commit_transaction(), it either will do the
commit itself or wait for some other task that already started the
commit of the transaction - it doesn't matter which case;
6) The transaction commit enters state TRANS_STATE_COMMIT_START;
7) An error happens during the transaction commit, like -ENOSPC when
running delayed refs or delayed items for example;
8) This results in calling transaction.c:cleanup_transaction(), where
we call btrfs_scrub_cancel(), incrementing fs_info->scrub_cancel_req
from 0 to 1, and blocking this task waiting for fs_info->scrubs_running
to decrease to 0;
9) From this point on, both the transaction commit and the scrub task
hang forever:
1) The transaction commit is waiting for fs_info->scrubs_running to
be decreased to 0;
2) The scrub task is at scrub_pause_off() waiting for
fs_info->scrub_pause_req to decrease to 0 - so it can not proceed
to stop the scrub and decrement fs_info->scrubs_running from 0 to 1.
Therefore resulting in a deadlock.
Fix this by having cleanup_transaction(), called if a transaction commit
fails, not call btrfs_scrub_cancel() if relocation is in progress, and
having btrfs_relocate_block_group() call btrfs_scrub_cancel() instead if
the relocation failed and a transaction abort happened.
This was triggered with btrfs/061 from fstests.
Fixes: 55e3a601c81c ("btrfs: Fix data checksum error cause by replace with io-load.")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This fixes mkfs/mount/check failures due to race with systemd-udevd
scan.
During the device scan initiated by systemd-udevd, other user space
EXCL operations such as mkfs, mount, or check may get blocked and result
in a "Device or resource busy" error. This is because the device
scan process opens the device with the EXCL flag in the kernel.
Two reports were received:
- btrfs/179 test case, where the fsck command failed with the -EBUSY
error
- LTP pwritev03 test case, where mkfs.vfs failed with
the -EBUSY error, when mkfs.vfs tried to overwrite old btrfs filesystem
on the device.
In both cases, fsck and mkfs (respectively) were racing with a
systemd-udevd device scan, and systemd-udevd won, resulting in the
-EBUSY error for fsck and mkfs.
Reproducing the problem has been difficult because there is a very
small window during which these userspace threads can race to
acquire the exclusive device open. Even on the system where the problem
was observed, the problem occurrences were anywhere between 10 to 400
iterations and chances of reproducing decreases with debug printk()s.
However, an exclusive device open is unnecessary for the scan process,
as there are no write operations on the device during scan. Furthermore,
during the mount process, the superblock is re-read in the below
function call chain:
btrfs_mount_root
btrfs_open_devices
open_fs_devices
btrfs_open_one_device
btrfs_get_bdev_and_sb
So, to fix this issue, removes the FMODE_EXCL flag from the scan
operation, and add a comment.
The case where mkfs may still write to the device and a scan is running,
the btrfs signature is not written at that time so scan will not
recognize such device.
Reported-by: Sherry Yang <sherry.yang@oracle.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202303170839.fdf23068-oliver.sang@intel.com
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The quota assign ioctl can currently run in parallel with a quota disable
ioctl call. The assign ioctl uses the quota root, while the disable ioctl
frees that root, and therefore we can have a use-after-free triggered in
the assign ioctl, leading to a trace like the following when KASAN is
enabled:
[672.723][T736] BUG: KASAN: slab-use-after-free in btrfs_search_slot+0x2962/0x2db0
[672.723][T736] Read of size 8 at addr ffff888022ec0208 by task btrfs_search_sl/27736
[672.724][T736]
[672.725][T736] CPU: 1 PID: 27736 Comm: btrfs_search_sl Not tainted 6.3.0-rc3 #37
[672.723][T736] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[672.727][T736] Call Trace:
[672.728][T736] <TASK>
[672.728][T736] dump_stack_lvl+0xd9/0x150
[672.725][T736] print_report+0xc1/0x5e0
[672.720][T736] ? __virt_addr_valid+0x61/0x2e0
[672.727][T736] ? __phys_addr+0xc9/0x150
[672.725][T736] ? btrfs_search_slot+0x2962/0x2db0
[672.722][T736] kasan_report+0xc0/0xf0
[672.729][T736] ? btrfs_search_slot+0x2962/0x2db0
[672.724][T736] btrfs_search_slot+0x2962/0x2db0
[672.723][T736] ? fs_reclaim_acquire+0xba/0x160
[672.722][T736] ? split_leaf+0x13d0/0x13d0
[672.726][T736] ? rcu_is_watching+0x12/0xb0
[672.723][T736] ? kmem_cache_alloc+0x338/0x3c0
[672.722][T736] update_qgroup_status_item+0xf7/0x320
[672.724][T736] ? add_qgroup_rb+0x3d0/0x3d0
[672.739][T736] ? do_raw_spin_lock+0x12d/0x2b0
[672.730][T736] ? spin_bug+0x1d0/0x1d0
[672.737][T736] btrfs_run_qgroups+0x5de/0x840
[672.730][T736] ? btrfs_qgroup_rescan_worker+0xa70/0xa70
[672.738][T736] ? __del_qgroup_relation+0x4ba/0xe00
[672.738][T736] btrfs_ioctl+0x3d58/0x5d80
[672.735][T736] ? tomoyo_path_number_perm+0x16a/0x550
[672.737][T736] ? tomoyo_execute_permission+0x4a0/0x4a0
[672.731][T736] ? btrfs_ioctl_get_supported_features+0x50/0x50
[672.737][T736] ? __sanitizer_cov_trace_switch+0x54/0x90
[672.734][T736] ? do_vfs_ioctl+0x132/0x1660
[672.730][T736] ? vfs_fileattr_set+0xc40/0xc40
[672.730][T736] ? _raw_spin_unlock_irq+0x2e/0x50
[672.732][T736] ? sigprocmask+0xf2/0x340
[672.737][T736] ? __fget_files+0x26a/0x480
[672.732][T736] ? bpf_lsm_file_ioctl+0x9/0x10
[672.738][T736] ? btrfs_ioctl_get_supported_features+0x50/0x50
[672.736][T736] __x64_sys_ioctl+0x198/0x210
[672.736][T736] do_syscall_64+0x39/0xb0
[672.731][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.739][T736] RIP: 0033:0x4556ad
[672.742][T736] </TASK>
[672.743][T736]
[672.748][T736] Allocated by task 27677:
[672.743][T736] kasan_save_stack+0x22/0x40
[672.741][T736] kasan_set_track+0x25/0x30
[672.741][T736] __kasan_kmalloc+0xa4/0xb0
[672.749][T736] btrfs_alloc_root+0x48/0x90
[672.746][T736] btrfs_create_tree+0x146/0xa20
[672.744][T736] btrfs_quota_enable+0x461/0x1d20
[672.743][T736] btrfs_ioctl+0x4a1c/0x5d80
[672.747][T736] __x64_sys_ioctl+0x198/0x210
[672.749][T736] do_syscall_64+0x39/0xb0
[672.744][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.756][T736]
[672.757][T736] Freed by task 27677:
[672.759][T736] kasan_save_stack+0x22/0x40
[672.759][T736] kasan_set_track+0x25/0x30
[672.756][T736] kasan_save_free_info+0x2e/0x50
[672.751][T736] ____kasan_slab_free+0x162/0x1c0
[672.758][T736] slab_free_freelist_hook+0x89/0x1c0
[672.752][T736] __kmem_cache_free+0xaf/0x2e0
[672.752][T736] btrfs_put_root+0x1ff/0x2b0
[672.759][T736] btrfs_quota_disable+0x80a/0xbc0
[672.752][T736] btrfs_ioctl+0x3e5f/0x5d80
[672.756][T736] __x64_sys_ioctl+0x198/0x210
[672.753][T736] do_syscall_64+0x39/0xb0
[672.765][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.769][T736]
[672.768][T736] The buggy address belongs to the object at ffff888022ec0000
[672.768][T736] which belongs to the cache kmalloc-4k of size 4096
[672.769][T736] The buggy address is located 520 bytes inside of
[672.769][T736] freed 4096-byte region [ffff888022ec0000, ffff888022ec1000)
[672.760][T736]
[672.764][T736] The buggy address belongs to the physical page:
[672.761][T736] page:ffffea00008bb000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22ec0
[672.766][T736] head:ffffea00008bb000 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[672.779][T736] flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
[672.770][T736] raw: 00fff00000010200 ffff888012842140 ffffea000054ba00 dead000000000002
[672.770][T736] raw: 0000000000000000 0000000000040004 00000001ffffffff 0000000000000000
[672.771][T736] page dumped because: kasan: bad access detected
[672.778][T736] page_owner tracks the page as allocated
[672.777][T736] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd2040(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 88
[672.779][T736] get_page_from_freelist+0x119c/0x2d50
[672.779][T736] __alloc_pages+0x1cb/0x4a0
[672.776][T736] alloc_pages+0x1aa/0x270
[672.773][T736] allocate_slab+0x260/0x390
[672.771][T736] ___slab_alloc+0xa9a/0x13e0
[672.778][T736] __slab_alloc.constprop.0+0x56/0xb0
[672.771][T736] __kmem_cache_alloc_node+0x136/0x320
[672.789][T736] __kmalloc+0x4e/0x1a0
[672.783][T736] tomoyo_realpath_from_path+0xc3/0x600
[672.781][T736] tomoyo_path_perm+0x22f/0x420
[672.782][T736] tomoyo_path_unlink+0x92/0xd0
[672.780][T736] security_path_unlink+0xdb/0x150
[672.788][T736] do_unlinkat+0x377/0x680
[672.788][T736] __x64_sys_unlink+0xca/0x110
[672.789][T736] do_syscall_64+0x39/0xb0
[672.783][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.784][T736] page last free stack trace:
[672.787][T736] free_pcp_prepare+0x4e5/0x920
[672.787][T736] free_unref_page+0x1d/0x4e0
[672.784][T736] __unfreeze_partials+0x17c/0x1a0
[672.797][T736] qlist_free_all+0x6a/0x180
[672.796][T736] kasan_quarantine_reduce+0x189/0x1d0
[672.797][T736] __kasan_slab_alloc+0x64/0x90
[672.793][T736] kmem_cache_alloc+0x17c/0x3c0
[672.799][T736] getname_flags.part.0+0x50/0x4e0
[672.799][T736] getname_flags+0x9e/0xe0
[672.792][T736] vfs_fstatat+0x77/0xb0
[672.791][T736] __do_sys_newlstat+0x84/0x100
[672.798][T736] do_syscall_64+0x39/0xb0
[672.796][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.790][T736]
[672.791][T736] Memory state around the buggy address:
[672.799][T736] ffff888022ec0100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.805][T736] ffff888022ec0180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.802][T736] >ffff888022ec0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.809][T736] ^
[672.809][T736] ffff888022ec0280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.809][T736] ffff888022ec0300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fix this by having the qgroup assign ioctl take the qgroup ioctl mutex
before calling btrfs_run_qgroups(), which is what all qgroup ioctls should
call.
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAFcO6XN3VD8ogmHwqRk4kbiwtpUSNySu2VAxN8waEPciCHJvMA@mail.gmail.com/
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes, the zoned accounting fix is spread across a few
patches, preparatory and the actual fixes:
- zoned mode:
- fix accounting of unusable zone space
- fix zone activation condition for DUP profile
- preparatory patches
- improved error handling of missing chunks
- fix compiler warning"
* tag 'for-6.3-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: drop space_info->active_total_bytes
btrfs: zoned: count fresh BG region as zone unusable
btrfs: use temporary variable for space_info in btrfs_update_block_group
btrfs: rename BTRFS_FS_NO_OVERCOMMIT to BTRFS_FS_ACTIVE_ZONE_TRACKING
btrfs: zoned: fix btrfs_can_activate_zone() to support DUP profile
btrfs: fix compiler warning on SPARC/PA-RISC handling fscrypt_setup_filename
btrfs: handle missing chunk mapping more gracefully
|
|
The space_info->active_total_bytes is no longer necessary as we now
count the region of newly allocated block group as zone_unusable. Drop
its usage.
Fixes: 6a921de58992 ("btrfs: zoned: introduce space_info->active_total_bytes")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
The naming of space_info->active_total_bytes is misleading. It counts
not only active block groups but also full ones which are previously
active but now inactive. That confusion results in a bug not counting
the full BGs into active_total_bytes on mount time.
For a background, there are three kinds of block groups in terms of
activation.
1. Block groups never activated
2. Block groups currently active
3. Block groups previously active and currently inactive (due to fully
written or zone finish)
What we really wanted to exclude from "total_bytes" is the total size of
BGs #1. They seem empty and allocatable but since they are not activated,
we cannot rely on them to do the space reservation.
And, since BGs #1 never get activated, they should have no "used",
"reserved" and "pinned" bytes.
OTOH, BGs #3 can be counted in the "total", since they are already full
we cannot allocate from them anyway. For them, "total_bytes == used +
reserved + pinned + zone_unusable" should hold.
Tracking #2 and #3 as "active_total_bytes" (current implementation) is
confusing. And, tracking #1 and subtract that properly from "total_bytes"
every time you need space reservation is cumbersome.
Instead, we can count the whole region of a newly allocated block group as
zone_unusable. Then, once that block group is activated, release
[0 .. zone_capacity] from the zone_unusable counters. With this, we can
eliminate the confusing ->active_total_bytes and the code will be common
among regular and the zoned mode. Also, no additional counter is needed
with this approach.
Fixes: 6a921de58992 ("btrfs: zoned: introduce space_info->active_total_bytes")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We do
cache->space_info->counter += num_bytes;
everywhere in here. This is makes the lines longer than they need to
be, and will be especially noticeable when we add the active tracking in,
so add a temp variable for the space_info so this is cleaner.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
This flag only gets set when we're doing active zone tracking, and we're
going to need to use this flag for things related to this behavior.
Rename the flag to represent what it actually means for the file system
so it can be used in other ways and still make sense.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
btrfs_can_activate_zone() returns true if at least one device has one zone
available for activation. This is OK for the single profile, but not OK for
DUP profile. We need two zones to create a DUP block group. Fix it by
properly handling the case with the profile flags.
Fixes: 265f7237dd25 ("btrfs: zoned: allow DUP on meta-data block groups")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Commit 1ec49744ba83 ("btrfs: turn on -Wmaybe-uninitialized") exposed
that on SPARC and PA-RISC, gcc is unaware that fscrypt_setup_filename()
only returns negative error values or 0. This ultimately results in a
maybe-uninitialized warning in btrfs_lookup_dentry().
Change to only return negative error values or 0 from
fscrypt_setup_filename() at the relevant call site, and assert that no
positive error codes are returned (which would have wider implications
involving other users).
Reported-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/all/481b19b5-83a0-4793-b4fd-194ad7b978c3@roeck-us.net/
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
[BUG]
During my scrub rework, I did a stupid thing like this:
bio->bi_iter.bi_sector = stripe->logical;
btrfs_submit_bio(fs_info, bio, stripe->mirror_num);
Above bi_sector assignment is using logical address directly, which
lacks ">> SECTOR_SHIFT".
This results a read on a range which has no chunk mapping.
This results the following crash:
BTRFS critical (device dm-1): unable to find logical 11274289152 length 65536
assertion failed: !IS_ERR(em), in fs/btrfs/volumes.c:6387
Sure this is all my fault, but this shows a possible problem in real
world, that some bit flip in file extents/tree block can point to
unmapped ranges, and trigger above ASSERT(), or if CONFIG_BTRFS_ASSERT
is not configured, cause invalid pointer access.
[PROBLEMS]
In the above call chain, we just don't handle the possible error from
btrfs_get_chunk_map() inside __btrfs_map_block().
[FIX]
The fix is straightforward, replace the ASSERT() with proper error
handling (callers handle errors already).
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"First batch of fixes. Among them there are two updates to sysfs and
ioctl which are not strictly fixes but are used for testing so there's
no reason to delay them.
- fix block group item corruption after inserting new block group
- fix extent map logging bit not cleared for split maps after
dropping range
- fix calculation of unusable block group space reporting bogus
values due to 32/64b division
- fix unnecessary increment of read error stat on write error
- improve error handling in inode update
- export per-device fsid in DEV_INFO ioctl to distinguish seeding
devices, needed for testing
- allocator size classes:
- fix potential dead lock in size class loading logic
- print sysfs stats for the allocation classes"
* tag 'for-6.3-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix block group item corruption after inserting new block group
btrfs: fix extent map logging bit not cleared for split maps after dropping range
btrfs: fix percent calculation for bg reclaim message
btrfs: fix unnecessary increment of read error stat on write error
btrfs: handle btrfs_del_item errors in __btrfs_update_delayed_inode
btrfs: ioctl: return device fsid from DEV_INFO ioctl
btrfs: fix potential dead lock in size class loading logic
btrfs: sysfs: add size class stats
|
|
We can often end up inserting a block group item, for a new block group,
with a wrong value for the used bytes field.
This happens if for the new allocated block group, in the same transaction
that created the block group, we have tasks allocating extents from it as
well as tasks removing extents from it.
For example:
1) Task A creates a metadata block group X;
2) Two extents are allocated from block group X, so its "used" field is
updated to 32K, and its "commit_used" field remains as 0;
3) Transaction commit starts, by some task B, and it enters
btrfs_start_dirty_block_groups(). There it tries to update the block
group item for block group X, which currently has its "used" field with
a value of 32K. But that fails since the block group item was not yet
inserted, and so on failure update_block_group_item() sets the
"commit_used" field of the block group back to 0;
4) The block group item is inserted by task A, when for example
btrfs_create_pending_block_groups() is called when releasing its
transaction handle. This results in insert_block_group_item() inserting
the block group item in the extent tree (or block group tree), with a
"used" field having a value of 32K, but without updating the
"commit_used" field in the block group, which remains with value of 0;
5) The two extents are freed from block X, so its "used" field changes
from 32K to 0;
6) The transaction commit by task B continues, it enters
btrfs_write_dirty_block_groups() which calls update_block_group_item()
for block group X, and there it decides to skip the block group item
update, because "used" has a value of 0 and "commit_used" has a value
of 0 too.
As a result, we end up with a block item having a 32K "used" field but
no extents allocated from it.
When this issue happens, a btrfs check reports an error like this:
[1/7] checking root items
[2/7] checking extents
block group [1104150528 1073741824] used 39796736 but extent items used 0
ERROR: errors found in extent allocation tree or chunk allocation
(...)
Fix this by making insert_block_group_item() update the block group's
"commit_used" field.
Fixes: 7248e0cebbef ("btrfs: skip update of block group item if used bytes are the same")
CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
range
At btrfs_drop_extent_map_range() we are clearing the EXTENT_FLAG_LOGGING
bit on a 'flags' variable that was not initialized. This makes static
checkers complain about it, so initialize the 'flags' variable before
clearing the bit.
In practice this has no consequences, because EXTENT_FLAG_LOGGING should
not be set when btrfs_drop_extent_map_range() is called, as an fsync locks
the inode in exclusive mode, locks the inode's mmap semaphore in exclusive
mode too and it always flushes all delalloc.
Also add a comment about why we clear EXTENT_FLAG_LOGGING on a copy of the
flags of the split extent map.
Reported-by: Dan Carpenter <error27@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/Y%2FyipSVozUDEZKow@kili/
Fixes: db21370bffbc ("btrfs: drop extent map range more efficiently")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We have a report, that the info message for block-group reclaim is
crossing the 100% used mark.
This is happening as we were truncating the divisor for the division
(the block_group->length) to a 32bit value.
Fix this by using div64_u64() to not truncate the divisor.
In the worst case, it can lead to a div by zero error and should be
possible to trigger on 4 disks RAID0, and each device is large enough:
$ mkfs.btrfs -f /dev/test/scratch[1234] -m raid1 -d raid0
btrfs-progs v6.1
[...]
Filesystem size: 40.00GiB
Block group profiles:
Data: RAID0 4.00GiB <<<
Metadata: RAID1 256.00MiB
System: RAID1 8.00MiB
Reported-by: Forza <forza@tnonline.net>
Link: https://lore.kernel.org/linux-btrfs/e99483.c11a58d.1863591ca52@tnonline.net/
Fixes: 5f93e776c673 ("btrfs: zoned: print unusable percentage when reclaiming block groups")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add Qu's note ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Current btrfs_log_dev_io_error() increases the read error count even if the
erroneous IO is a WRITE request. This is because it forget to use "else
if", and all the error WRITE requests counts as READ error as there is (of
course) no REQ_RAHEAD bit set.
Fixes: c3a62baf21ad ("btrfs: use chained bios when cloning")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Even if the slot is already read out, we may still need to re-balance
the tree, thus it can cause error in that btrfs_del_item() call and we
need to handle it properly.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: void0red <void0red@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Currently user space utilizes dev info ioctl to grab the info of a
certain devid, this includes its device uuid. But the returned info is
not enough to determine if a device is a seed.
Commit a26d60dedf9a ("btrfs: sysfs: add devinfo/fsid to retrieve actual
fsid from the device") exports the same value in sysfs so this is for
parity with ioctl. Add a new member, fsid, into
btrfs_ioctl_dev_info_args, and populate the member with fsid value.
This should not cause any compatibility problem, following the
combinations:
- Old user space, old kernel
- Old user space, new kernel
User space tool won't even check the new member.
- New user space, old kernel
The kernel won't touch the new member, and user space tool should
zero out its argument, thus the new member is all zero.
User space tool can then know the kernel doesn't support this fsid
reporting, and falls back to whatever they can.
- New user space, new kernel
Go as planned.
Would find the fsid member is no longer zero, and trust its value.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
As reported by Filipe, there's a potential deadlock caused by
using btrfs_search_forward on commit_root. The locking there is
unconditional, even if ->skip_locking and ->search_commit_root is set.
It's not meant to be used for commit roots, so it always needs to do
locking.
So if another task is COWing a child node of the same root node and
then needs to wait for block group caching to complete when trying to
allocate a metadata extent, it deadlocks.
For example:
[539604.239315] sysrq: Show Blocked State
[539604.240133] task:kworker/u16:6 state:D stack:0 pid:2119594 ppid:2 flags:0x00004000
[539604.241613] Workqueue: btrfs-cache btrfs_work_helper [btrfs]
[539604.242673] Call Trace:
[539604.243129] <TASK>
[539604.243925] __schedule+0x41d/0xee0
[539604.244797] ? rcu_read_lock_sched_held+0x12/0x70
[539604.245399] ? rwsem_down_read_slowpath+0x185/0x490
[539604.246111] schedule+0x5d/0xf0
[539604.246593] rwsem_down_read_slowpath+0x2da/0x490
[539604.247290] ? rcu_barrier_tasks_trace+0x10/0x20
[539604.248090] __down_read_common+0x3d/0x150
[539604.248702] down_read_nested+0xc3/0x140
[539604.249280] __btrfs_tree_read_lock+0x24/0x100 [btrfs]
[539604.250097] btrfs_read_lock_root_node+0x48/0x60 [btrfs]
[539604.250915] btrfs_search_forward+0x59/0x460 [btrfs]
[539604.251781] ? btrfs_global_root+0x50/0x70 [btrfs]
[539604.252476] caching_thread+0x1be/0x920 [btrfs]
[539604.253167] btrfs_work_helper+0xf6/0x400 [btrfs]
[539604.253848] process_one_work+0x24f/0x5a0
[539604.254476] worker_thread+0x52/0x3b0
[539604.255166] ? __pfx_worker_thread+0x10/0x10
[539604.256047] kthread+0xf0/0x120
[539604.256591] ? __pfx_kthread+0x10/0x10
[539604.257212] ret_from_fork+0x29/0x50
[539604.257822] </TASK>
[539604.258233] task:btrfs-transacti state:D stack:0 pid:2236474 ppid:2 flags:0x00004000
[539604.259802] Call Trace:
[539604.260243] <TASK>
[539604.260615] __schedule+0x41d/0xee0
[539604.261205] ? rcu_read_lock_sched_held+0x12/0x70
[539604.262000] ? rwsem_down_read_slowpath+0x185/0x490
[539604.262822] schedule+0x5d/0xf0
[539604.263374] rwsem_down_read_slowpath+0x2da/0x490
[539604.266228] ? lock_acquire+0x160/0x310
[539604.266917] ? rcu_read_lock_sched_held+0x12/0x70
[539604.267996] ? lock_contended+0x19e/0x500
[539604.268720] __down_read_common+0x3d/0x150
[539604.269400] down_read_nested+0xc3/0x140
[539604.270057] __btrfs_tree_read_lock+0x24/0x100 [btrfs]
[539604.271129] btrfs_read_lock_root_node+0x48/0x60 [btrfs]
[539604.272372] btrfs_search_slot+0x143/0xf70 [btrfs]
[539604.273295] update_block_group_item+0x9e/0x190 [btrfs]
[539604.274282] btrfs_start_dirty_block_groups+0x1c4/0x4f0 [btrfs]
[539604.275381] ? __mutex_unlock_slowpath+0x45/0x280
[539604.276390] btrfs_commit_transaction+0xee/0xed0 [btrfs]
[539604.277391] ? lock_acquire+0x1a4/0x310
[539604.278080] ? start_transaction+0xcb/0x6c0 [btrfs]
[539604.279099] transaction_kthread+0x142/0x1c0 [btrfs]
[539604.279996] ? __pfx_transaction_kthread+0x10/0x10 [btrfs]
[539604.280673] kthread+0xf0/0x120
[539604.281050] ? __pfx_kthread+0x10/0x10
[539604.281496] ret_from_fork+0x29/0x50
[539604.281966] </TASK>
[539604.282255] task:fsstress state:D stack:0 pid:2236483 ppid:1 flags:0x00004006
[539604.283897] Call Trace:
[539604.284700] <TASK>
[539604.285088] __schedule+0x41d/0xee0
[539604.285660] schedule+0x5d/0xf0
[539604.286175] btrfs_wait_block_group_cache_progress+0xf2/0x170 [btrfs]
[539604.287342] ? __pfx_autoremove_wake_function+0x10/0x10
[539604.288450] find_free_extent+0xd93/0x1750 [btrfs]
[539604.289256] ? _raw_spin_unlock+0x29/0x50
[539604.289911] ? btrfs_get_alloc_profile+0x127/0x2a0 [btrfs]
[539604.290843] btrfs_reserve_extent+0x147/0x290 [btrfs]
[539604.291943] btrfs_alloc_tree_block+0xcb/0x3e0 [btrfs]
[539604.292903] __btrfs_cow_block+0x138/0x580 [btrfs]
[539604.293773] btrfs_cow_block+0x10e/0x240 [btrfs]
[539604.294595] btrfs_search_slot+0x7f3/0xf70 [btrfs]
[539604.295585] btrfs_update_device+0x71/0x1b0 [btrfs]
[539604.296459] btrfs_chunk_alloc_add_chunk_item+0xe0/0x340 [btrfs]
[539604.297489] btrfs_chunk_alloc+0x1bf/0x490 [btrfs]
[539604.298335] find_free_extent+0x6fa/0x1750 [btrfs]
[539604.299174] ? _raw_spin_unlock+0x29/0x50
[539604.299950] ? btrfs_get_alloc_profile+0x127/0x2a0 [btrfs]
[539604.300918] btrfs_reserve_extent+0x147/0x290 [btrfs]
[539604.301797] btrfs_alloc_tree_block+0xcb/0x3e0 [btrfs]
[539604.303017] ? lock_release+0x224/0x4a0
[539604.303855] __btrfs_cow_block+0x138/0x580 [btrfs]
[539604.304789] btrfs_cow_block+0x10e/0x240 [btrfs]
[539604.305611] btrfs_search_slot+0x7f3/0xf70 [btrfs]
[539604.306682] ? btrfs_global_root+0x50/0x70 [btrfs]
[539604.308198] lookup_inline_extent_backref+0x17b/0x7a0 [btrfs]
[539604.309254] lookup_extent_backref+0x43/0xd0 [btrfs]
[539604.310122] __btrfs_free_extent+0xf8/0x810 [btrfs]
[539604.310874] ? lock_release+0x224/0x4a0
[539604.311724] ? btrfs_merge_delayed_refs+0x17b/0x1d0 [btrfs]
[539604.313023] __btrfs_run_delayed_refs+0x2ba/0x1260 [btrfs]
[539604.314271] btrfs_run_delayed_refs+0x8f/0x1c0 [btrfs]
[539604.315445] ? rcu_read_lock_sched_held+0x12/0x70
[539604.316706] btrfs_commit_transaction+0xa2/0xed0 [btrfs]
[539604.317855] ? do_raw_spin_unlock+0x4b/0xa0
[539604.318544] ? _raw_spin_unlock+0x29/0x50
[539604.319240] create_subvol+0x53d/0x6e0 [btrfs]
[539604.320283] btrfs_mksubvol+0x4f5/0x590 [btrfs]
[539604.321220] __btrfs_ioctl_snap_create+0x11b/0x180 [btrfs]
[539604.322307] btrfs_ioctl_snap_create_v2+0xc6/0x150 [btrfs]
[539604.323295] btrfs_ioctl+0x9f7/0x33e0 [btrfs]
[539604.324331] ? rcu_read_lock_sched_held+0x12/0x70
[539604.325137] ? lock_release+0x224/0x4a0
[539604.325808] ? __x64_sys_ioctl+0x87/0xc0
[539604.326467] __x64_sys_ioctl+0x87/0xc0
[539604.327109] do_syscall_64+0x38/0x90
[539604.327875] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[539604.328792] RIP: 0033:0x7f05a7babaeb
This needs to use regular btrfs_search_slot() with some skip and stop
logic.
Since we only consider five samples (five search slots), don't bother
with the complexity of looking for commit_root_sem contention. If
necessary, it can be added to the load function in between samples.
Reported-by: Filipe Manana <fdmanana@kernel.org>
Link: https://lore.kernel.org/linux-btrfs/CAL3q7H7eKMD44Z1+=Kb-1RFMMeZpAm2fwyO59yeBwCcSOU80Pg@mail.gmail.com/
Fixes: c7eec3d9aa95 ("btrfs: load block group size class when caching")
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
Remove struct posix_acl_{access,default}_handler for all filesystems
that don't depend on the xattr handler in their inode->i_op->listxattr()
method in any way. There's nothing more to do than to simply remove the
handler. It's been effectively unused ever since we introduced the new
posix acl api.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|
Make it possible to see the distribution of size classes for block
groups. Helpful for testing and debugging the allocator w.r.t. to size
classes.
The new stats can be found at the path:
/sys/fs/btrfs/<FSID>/allocation/<bg-type>/size_class
but they will only be non-zero for bg-type = data.
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Daniel Verkamp has contributed a memfd series ("mm/memfd: add
F_SEAL_EXEC") which permits the setting of the memfd execute bit at
memfd creation time, with the option of sealing the state of the X
bit.
- Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset()
thread-safe for pmd unshare") which addresses a rare race condition
related to PMD unsharing.
- Several folioification patch serieses from Matthew Wilcox, Vishal
Moola, Sidhartha Kumar and Lorenzo Stoakes
- Johannes Weiner has a series ("mm: push down lock_page_memcg()")
which does perform some memcg maintenance and cleanup work.
- SeongJae Park has added DAMOS filtering to DAMON, with the series
"mm/damon/core: implement damos filter".
These filters provide users with finer-grained control over DAMOS's
actions. SeongJae has also done some DAMON cleanup work.
- Kairui Song adds a series ("Clean up and fixes for swap").
- Vernon Yang contributed the series "Clean up and refinement for maple
tree".
- Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It
adds to MGLRU an LRU of memcgs, to improve the scalability of global
reclaim.
- David Hildenbrand has added some userfaultfd cleanup work in the
series "mm: uffd-wp + change_protection() cleanups".
- Christoph Hellwig has removed the generic_writepages() library
function in the series "remove generic_writepages".
- Baolin Wang has performed some maintenance on the compaction code in
his series "Some small improvements for compaction".
- Sidhartha Kumar is doing some maintenance work on struct page in his
series "Get rid of tail page fields".
- David Hildenbrand contributed some cleanup, bugfixing and
generalization of pte management and of pte debugging in his series
"mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with
swap PTEs".
- Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation
flag in the series "Discard __GFP_ATOMIC".
- Sergey Senozhatsky has improved zsmalloc's memory utilization with
his series "zsmalloc: make zspage chain size configurable".
- Joey Gouly has added prctl() support for prohibiting the creation of
writeable+executable mappings.
The previous BPF-based approach had shortcomings. See "mm: In-kernel
support for memory-deny-write-execute (MDWE)".
- Waiman Long did some kmemleak cleanup and bugfixing in the series
"mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF".
- T.J. Alumbaugh has contributed some MGLRU cleanup work in his series
"mm: multi-gen LRU: improve".
- Jiaqi Yan has provided some enhancements to our memory error
statistics reporting, mainly by presenting the statistics on a
per-node basis. See the series "Introduce per NUMA node memory error
statistics".
- Mel Gorman has a second and hopefully final shot at fixing a CPU-hog
regression in compaction via his series "Fix excessive CPU usage
during compaction".
- Christoph Hellwig does some vmalloc maintenance work in the series
"cleanup vfree and vunmap".
- Christoph Hellwig has removed block_device_operations.rw_page() in
ths series "remove ->rw_page".
- We get some maple_tree improvements and cleanups in Liam Howlett's
series "VMA tree type safety and remove __vma_adjust()".
- Suren Baghdasaryan has done some work on the maintainability of our
vm_flags handling in the series "introduce vm_flags modifier
functions".
- Some pagemap cleanup and generalization work in Mike Rapoport's
series "mm, arch: add generic implementation of pfn_valid() for
FLATMEM" and "fixups for generic implementation of pfn_valid()"
- Baoquan He has done some work to make /proc/vmallocinfo and
/proc/kcore better represent the real state of things in his series
"mm/vmalloc.c: allow vread() to read out vm_map_ram areas".
- Jason Gunthorpe rationalized the GUP system's interface to the rest
of the kernel in the series "Simplify the external interface for
GUP".
- SeongJae Park wishes to migrate people from DAMON's debugfs interface
over to its sysfs interface. To support this, we'll temporarily be
printing warnings when people use the debugfs interface. See the
series "mm/damon: deprecate DAMON debugfs interface".
- Andrey Konovalov provided the accurately named "lib/stackdepot: fixes
and clean-ups" series.
- Huang Ying has provided a dramatic reduction in migration's TLB flush
IPI rates with the series "migrate_pages(): batch TLB flushing".
- Arnd Bergmann has some objtool fixups in "objtool warning fixes".
* tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits)
include/linux/migrate.h: remove unneeded externs
mm/memory_hotplug: cleanup return value handing in do_migrate_range()
mm/uffd: fix comment in handling pte markers
mm: change to return bool for isolate_movable_page()
mm: hugetlb: change to return bool for isolate_hugetlb()
mm: change to return bool for isolate_lru_page()
mm: change to return bool for folio_isolate_lru()
objtool: add UACCESS exceptions for __tsan_volatile_read/write
kmsan: disable ftrace in kmsan core code
kasan: mark addr_has_metadata __always_inline
mm: memcontrol: rename memcg_kmem_enabled()
sh: initialize max_mapnr
m68k/nommu: add missing definition of ARCH_PFN_OFFSET
mm: percpu: fix incorrect size in pcpu_obj_full_size()
maple_tree: reduce stack usage with gcc-9 and earlier
mm: page_alloc: call panic() when memoryless node allocation fails
mm: multi-gen LRU: avoid futile retries
migrate_pages: move THP/hugetlb migration support check to simplify code
migrate_pages: batch flushing TLB
migrate_pages: share more code between _unmap and _move
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Documentation updates
- Miscellaneous fixes, perhaps most notably:
- Throttling callback invocation based on the number of callbacks
that are now ready to invoke instead of on the total number of
callbacks
- Several patches that suppress false-positive boot-time
diagnostics, for example, due to lockdep not yet being
initialized
- Make expedited RCU CPU stall warnings dump stacks of any tasks
that are blocking the stalled grace period. (Normal RCU CPU
stall warnings have done this for many years)
- Lazy-callback fixes to avoid delays during boot, suspend, and
resume. (Note that lazy callbacks must be explicitly enabled, so
this should not (yet) affect production use cases)
- Make kfree_rcu() and friends take advantage of polled grace periods,
thus reducing memory footprint by almost two orders of magnitude,
admittedly on a microbenchmark
This also begins the transition from kfree_rcu(p) to
kfree_rcu_mightsleep(p). This transition was motivated by bugs where
kfree_rcu(p), which can block, was typed instead of the intended
kfree_rcu(p, rh)
- SRCU updates, perhaps most notably fixing a bug that causes SRCU to
fail when booted on a system with a non-zero boot CPU. This
surprising situation actually happens for kdump kernels on the
powerpc architecture
This also adds an srcu_down_read() and srcu_up_read(), which act like
srcu_read_lock() and srcu_read_unlock(), but allow an SRCU read-side
critical section to be handed off from one task to another
- Clean up the now-useless SRCU Kconfig option
There are a few more commits that are not yet acked or pulled into
maintainer trees, and these will be in a pull request for a later
merge window
- RCU-tasks updates, perhaps most notably these fixes:
- A strange interaction between PID-namespace unshare and the
RCU-tasks grace period that results in a low-probability but
very real hang
- A race between an RCU tasks rude grace period on a single-CPU
system and CPU-hotplug addition of the second CPU that can
result in a too-short grace period
- A race between shrinking RCU tasks down to a single callback
list and queuing a new callback to some other CPU, but where
that queuing is delayed for more than an RCU grace period. This
can result in that callback being stranded on the non-boot CPU
- Torture-test updates and fixes
- Torture-test scripting updates and fixes
- Provide additional RCU CPU stall-warning information in kernels built
with CONFIG_RCU_CPU_STALL_CPUTIME=y, and restore the full five-minute
timeout limit for expedited RCU CPU stall warnings
* tag 'rcu.2023.02.10a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (80 commits)
rcu/kvfree: Add kvfree_rcu_mightsleep() and kfree_rcu_mightsleep()
kernel/notifier: Remove CONFIG_SRCU
init: Remove "select SRCU"
fs/quota: Remove "select SRCU"
fs/notify: Remove "select SRCU"
fs/btrfs: Remove "select SRCU"
fs: Remove CONFIG_SRCU
drivers/pci/controller: Remove "select SRCU"
drivers/net: Remove "select SRCU"
drivers/md: Remove "select SRCU"
drivers/hwtracing/stm: Remove "select SRCU"
drivers/dax: Remove "select SRCU"
drivers/base: Remove CONFIG_SRCU
rcu: Disable laziness if lazy-tracking says so
rcu: Track laziness during boot and suspend
rcu: Remove redundant call to rcu_boost_kthread_setaffinity()
rcu: Allow up to five minutes expedited RCU CPU stall-warning timeouts
rcu: Align the output of RCU CPU stall warning messages
rcu: Add RCU stall diagnosis information
sched: Add helper nr_context_switches_cpu()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"The usual mix of performance improvements and new features.
The core change is reworking how checksums are processed, with
followup cleanups and simplifications. There are two minor changes in
block layer and iomap code.
Features:
- block group allocation class heuristics:
- pack files by size (up to 128k, up to 8M, more) to avoid
fragmentation in block groups, assuming that file size and life
time is correlated, in particular this may help during balance
- with tracepoints and extensible in the future
Performance:
- send: cache directory utimes and only emit the command when
necessary
- speedup up to 10x
- smaller final stream produced (no redundant utimes commands
issued)
- compatibility not affected
- fiemap: skip backref checks for shared leaves
- speedup 3x on sample filesystem with all leaves shared (e.g. on
snapshots)
- micro optimized b-tree key lookup, speedup in metadata operations
(sample benchmark: fs_mark +10% of files/sec)
Core changes:
- change where checksumming is done in the io path:
- checksum and read repair does verification at lower layer
- cascaded cleanups and simplifications
- raid56 refactoring and cleanups
Fixes:
- sysfs: make sure that a run-time change of a feature is correctly
tracked by the feature files
- scrub: better reporting of tree block errors
Other:
- locally enable -Wmaybe-uninitialized after fixing all warnings
- misc cleanups, spelling fixes
Other code:
- block: export bio_split_rw
- iomap: remove IOMAP_F_ZONE_APPEND"
* tag 'for-6.3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (109 commits)
btrfs: make kobj_type structures constant
btrfs: remove the bdev argument to btrfs_rmap_block
btrfs: don't rely on unchanging ->bi_bdev for zone append remaps
btrfs: never return true for reads in btrfs_use_zone_append
btrfs: pass a btrfs_bio to btrfs_use_append
btrfs: set bbio->file_offset in alloc_new_bio
btrfs: use file_offset to limit bios size in calc_bio_boundaries
btrfs: do unsigned integer division in the extent buffer binary search loop
btrfs: eliminate extra call when doing binary search on extent buffer
btrfs: raid56: handle endio in scrub_rbio
btrfs: raid56: handle endio in recover_rbio
btrfs: raid56: handle endio in rmw_rbio
btrfs: raid56: submit the read bios from scrub_assemble_read_bios
btrfs: raid56: fold rmw_read_wait_recover into rmw_read_bios
btrfs: raid56: fold recover_assemble_read_bios into recover_rbio
btrfs: raid56: add a bio_list_put helper
btrfs: raid56: wait for I/O completion in submit_read_bios
btrfs: raid56: simplify code flow in rmw_rbio
btrfs: raid56: simplify error handling and code flow in raid56_parity_write
btrfs: replace btrfs_wait_tree_block_writeback by wait_on_extent_buffer_writeback
...
|
|
Pull fsverity updates from Eric Biggers:
"Fix the longstanding implementation limitation that fsverity was only
supported when the Merkle tree block size, filesystem block size, and
PAGE_SIZE were all equal.
Specifically, add support for Merkle tree block sizes less than
PAGE_SIZE, and make ext4 support fsverity on filesystems where the
filesystem block size is less than PAGE_SIZE.
Effectively, this means that fsverity can now be used on systems with
non-4K pages, at least on ext4. These changes have been tested using
the verity group of xfstests, newly updated to cover the new code
paths.
Also update fs/verity/ to support verifying data from large folios.
There's also a similar patch for fs/crypto/, to support decrypting
data from large folios, which I'm including in here to avoid a merge
conflict between the fscrypt and fsverity branches"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
fscrypt: support decrypting data from large folios
fsverity: support verifying data from large folios
fsverity.rst: update git repo URL for fsverity-utils
ext4: allow verity with fs block size < PAGE_SIZE
fs/buffer.c: support fsverity in block_read_full_folio()
f2fs: simplify f2fs_readpage_limit()
ext4: simplify ext4_readpage_limit()
fsverity: support enabling with tree block size < PAGE_SIZE
fsverity: support verification with tree block size < PAGE_SIZE
fsverity: replace fsverity_hash_page() with fsverity_hash_block()
fsverity: use EFBIG for file too large to enable verity
fsverity: store log2(digest_size) precomputed
fsverity: simplify Merkle tree readahead size calculation
fsverity: use unsigned long for level_start
fsverity: remove debug messages and CONFIG_FS_VERITY_DEBUG
fsverity: pass pos and size to ->write_merkle_tree_block
fsverity: optimize fsverity_cleanup_inode() on non-verity files
fsverity: optimize fsverity_prepare_setattr() on non-verity files
fsverity: optimize fsverity_file_open() on non-verity files
|