summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent-tree.c
AgeCommit message (Collapse)Author
2025-01-13btrfs: use SECTOR_SIZE defines in btrfs_issue_discard()David Sterba
Use the existing define for single sector size. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: extent-tree: remove unnecessary calls to btrfs_mark_buffer_dirty()Filipe Manana
We have several places explicitly calling btrfs_mark_buffer_dirty() but that is not necessarily since the target leaf came from a path that was obtained for a btree search function that modifies the btree, something like btrfs_insert_empty_item() or anything else that ends up calling btrfs_search_slot() with a value of 1 for its 'cow' argument. These just make the code more verbose, confusing and add a little extra overhead and well as increase the module's text size, so remove them. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: add assertions and comment about path expectations to ↵Filipe Manana
btrfs_cross_ref_exist() We should always call check_delayed_ref() with a path having a locked leaf from the extent tree where either the extent item is located or where it should be located in case it doesn't exist yet (when there's a pending unflushed delayed ref to do it), as we need to lock any existing delayed ref head while holding such leaf locked in order to avoid races with flushing delayed references, which could make us think an extent is not shared when it really is. So add some assertions and a comment about such expectations to btrfs_cross_ref_exist(), which is the only caller of check_delayed_ref(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: add function comment for check_committed_ref()Filipe Manana
There are some not immediately obvious details about the operation of check_committed_ref(), namely that when it returns 0 it must return with the path having a locked leaf from the extent tree that contains the extent's extent item, so that we can later check for delayed refs when calling check_delayed_ref() in a way that doesn't race with a task running delayed references. For similar reasons, it must also return with a locked leaf when the extent item is not found, and that leaf is where the extent item should be located, because we may have delayed references that are going to create the extent item. Also document that the function can return false positives in order to not be too slow, and that the most important is to not return false negatives. So add a function comment to check_committed_ref(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: simplify arguments for btrfs_cross_ref_exist()Filipe Manana
Instead of passing a root and an objectid which matches an inode number, pass the inode instead, since the root is always the root associated to the inode and the objectid is the number of that inode. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: simplify return logic at check_committed_ref()Filipe Manana
Instead of setting the value to return in a local variable 'ret' and then jumping into a label named 'out' that does nothing but return that value, simplify everything by getting rid of the label and directly returning a value. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: avoid redundant call to get inline ref type at check_committed_ref()Filipe Manana
At check_committed_ref() we are calling btrfs_get_extent_inline_ref_type() twice, once before we check if have an inline extent owner ref (for simple qgroups) and then once again sometime after that check. This second call is redundant when we have simple quotas disabled or we found an inline ref that is not of the owner ref type. So avoid this second call unless we have simple quotas enabled and found an owner ref, saving a function call that does inline ref validation again. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove the snapshot check from check_committed_ref()Filipe Manana
At check_committed_ref() we have this check to see if the data extent was created in a generation lower than or equals to the generation where the last snapshot for the root was created, and if so we return immediately with 1, since it's very likely the extent is shared, referenced by other root. The only call chain for check_committed_ref() is the following: can_nocow_file_extent() btrfs_cross_ref_exist() check_committed_ref() And we already do that snapshot check at can_nocow_file_extent(), before we call btrfs_cross_ref_exist(). This makes the check done at check_committed_ref() redundant, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: remove no longer needed strict argument from can_nocow_extent()Filipe Manana
All callers of can_nocow_extent() now pass a value of false for its 'strict' argument, making it redundant. So remove the argument from can_nocow_extent() as well as can_nocow_file_extent(), btrfs_cross_ref_exist() and check_committed_ref(), because this argument was used just to influence the behavior of check_committed_ref(). Also remove the 'strict' field from struct can_nocow_file_extent_args, which is now always false as well, as its value is taken from the argument to can_nocow_extent(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: drop fs_info argument from btrfs_update_space_info_*()Naohiro Aota
Since commit e1e577aafe41 ("btrfs: store fs_info in space_info"), we have the fs_info in a space_info. So, we can drop fs_info argument from btrfs_update_space_info_*. There is no behavior change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: factor out btrfs_return_free_space()Naohiro Aota
Factor out a part of unpin_extent_range() that returns space back to the space info, prioritizing global block reserve. Also, move the "len" variable into the loop to clarify we don't need to carry it beyond an iteration. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-01-13btrfs: move select_delayed_ref() and export itJosef Bacik
This helper is how we select the delayed ref to run once we've selected the delayed ref head. I need this exported to add a unit test for delayed refs, and it's more natural home is in delayed-ref.c. Rename it to btrfs_select_delayed_ref and move it into delayed-ref.c. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-12-18Merge tag 'for-6.13-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - tree-checker catches invalid number of inline extent references - zoned mode fixes: - enhance zone append IO command so it also detects emulated writes - handle bio splitting at sectorsize boundary - when deleting a snapshot, fix a condition for visiting nodes in reloc trees * tag 'for-6.13-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: tree-checker: reject inline extent items with 0 ref count btrfs: split bios to the fs sector size boundary btrfs: use bio_is_zone_append() in the completion handler btrfs: fix improper generation check in snapshot delete
2024-12-17btrfs: fix improper generation check in snapshot deleteJosef Bacik
We have been using the following check if (generation <= root->root_key.offset) to make decisions about whether or not to visit a node during snapshot delete. This is because for normal subvolumes this is set to 0, and for snapshots it's set to the creation generation. The idea being that if the generation of the node is less than or equal to our creation generation then we don't need to visit that node, because it doesn't belong to us, we can simply drop our reference and move on. However reloc roots don't have their generation stored in root->root_key.offset, instead that is the objectid of their corresponding fs root. This means we can incorrectly not walk into nodes that need to be dropped when deleting a reloc root. There are a variety of consequences to making the wrong choice in two distinct areas. visit_node_for_delete() 1. False positive. We think we are newer than the block when we really aren't. We don't visit the node and drop our reference to the node and carry on. This would result in leaked space. 2. False negative. We do decide to walk down into a block that we should have just dropped our reference to. However this means that the child node will have refs > 1, so we will switch to UPDATE_BACKREF, and then the subsequent walk_down_proc() will notice that btrfs_header_owner(node) != root->root_key.objectid and it'll break out of the loop, and then walk_up_proc() will drop our reference, so this appears to be ok. do_walk_down() 1. False positive. We are in UPDATE_BACKREF and incorrectly decide that we are done and don't need to update the backref for our lower nodes. This is another case that simply won't happen with relocation, as we only have to do UPDATE_BACKREF if the node below us was shared and didn't have FULL_BACKREF set, and since we don't own that node because we're a reloc root we actually won't end up in this case. 2. False negative. Again this is tricky because as described above, we simply wouldn't be here from relocation, because we don't own any of the nodes because we never set btrfs_header_owner() to the reloc root objectid, and we always use FULL_BACKREF, we never actually need to set FULL_BACKREF on any children. Having spent a lot of time stressing relocation/snapshot delete recently I've not seen this pop in practice. But this is objectively incorrect, so fix this to get the correct starting generation based on the root we're dropping to keep me from thinking there's a problem here. CC: stable@vger.kernel.org Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-29btrfs: don't loop for nowait writes when checking for cross referencesFilipe Manana
When checking for delayed refs when verifying if there are cross references for a data extent, we stop if the path has nowait set and we can't try lock the delayed ref head's mutex, returning -EAGAIN with the goal of making a write fallback to a blocking context. However we ignore the -EAGAIN at btrfs_cross_ref_exist() when check_delayed_ref() returns it, and keep looping instead of immediately returning the -EAGAIN to the caller. Fix this by not looping if we get -EAGAIN and we have a nowait path. Fixes: 26ce91144631 ("btrfs: make can_nocow_extent nowait compatible") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: track delayed ref heads in an xarrayFilipe Manana
Currently we use a red black tree (rb-tree) to track the delayed ref heads (in struct btrfs_delayed_ref_root::href_root). This however is not very efficient when the number of delayed ref heads is large (and it's very common to be at least in the order of thousands) since rb-trees are binary trees. For example for 10K delayed ref heads, the tree has a depth of 13. Besides that, inserting into the tree requires navigating through it and pulling useless cache lines in the process since the red black tree nodes are embedded within the delayed ref head structure - on the other hand, by being embedded, it requires no extra memory allocations. We can improve this by using an xarray instead which has a much higher branching factor than a red black tree (binary balanced tree) and is more cache friendly and behaves like a resizable array, with a much better search and insertion complexity than a red black tree. This only has one small disadvantage which is that insertion will sometimes require allocating memory for the xarray - which may fail (not that often since it uses a kmem_cache) - but on the other hand we can reduce the delayed ref head structure size by 24 bytes (from 152 down to 128 bytes) after removing the embedded red black tree node, meaning than we can now fit 32 delayed ref heads per 4K page instead of 26, and that gain compensates for the occasional memory allocations needed for the xarray nodes. We also end up using only 2 cache lines instead of 3 per delayed ref head. Running the following fs_mark test showed some improvements: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 MOUNT_OPTIONS="-o ssd" FILES=100000 THREADS=$(nproc --all) echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f $DEV mount $MOUNT_OPTIONS $DEV $MNT OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k" for ((i = 1; i <= $THREADS; i++)); do OPTS="$OPTS -d $MNT/d$i" done fs_mark $OPTS umount $MNT Before this patch: FSUse% Count Size Files/sec App Overhead 10 1200000 0 171845.7 12253839 16 2400000 0 230898.7 12308254 23 3600000 0 212292.9 12467768 30 4800000 0 195737.8 12627554 46 6000000 0 171055.2 12783329 After this patch: FSUse% Count Size Files/sec App Overhead 10 1200000 0 173835.0 12246131 16 2400000 0 233537.8 12271746 23 3600000 0 220398.7 12307737 30 4800000 0 204483.6 12392318 40 6000000 0 182923.3 12771843 Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: pass fs_info to btrfs_delete_ref_head()Filipe Manana
One of the following patches in the series will need to access fs_info at btrfs_delete_ref_head(), so pass a fs_info argument to it. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: pass fs_info to functions that search for delayed ref headsFilipe Manana
One of the following patches in the series will need to access fs_info in the function find_ref_head(), so pass a fs_info argument to it as well as to the functions btrfs_select_ref_head() and btrfs_find_delayed_ref_head() which call find_ref_head(). Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: move delayed ref head unselection to delayed-ref.cFilipe Manana
The unselect_delayed_ref_head() at extent-tree.c doesn't really belong in that file as it's a delayed refs specific detail and therefore should be at delayed-ref.c. Further its inverse, btrfs_select_ref_head(), is at delayed-ref.c, so it only makes sense to have it there too. So move unselect_delayed_ref_head() into delayed-ref.c and rename it to btrfs_unselect_ref_head() so that its name closely matches its inverse (btrfs_select_ref_head()). Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: simplify obtaining a delayed ref headFilipe Manana
Instead of doing it in two steps outside of delayed-ref.c, leaking low level details such as locking, move the logic entirely to delayed-ref.c under btrfs_select_ref_head(), reducing code and making things simpler for the caller. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: change return type of btrfs_delayed_ref_lock() to booleanFilipe Manana
The function only returns 0, meaning it was able to lock the delayed ref head, or -EAGAIN in case it wasn't able to lock it. So simplify this and use a boolean return type instead, returning true if it was able to lock and false otherwise. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: remove num_entries atomic counter from delayed ref rootFilipe Manana
The atomic counter 'num_entries' is not used anymore, we increment it and decrement it but then we don't ever read it to use for any logic. Its last use was removed with commit 61a56a992fcf ("btrfs: delayed refs pre-flushing should only run the heads we have"). So remove it. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: reduce extent tree lock contention when searching for inline backrefRobbie Ko
When inserting extent backref, in order to check whether refs other than inline refs are used, we always use path keep locks for tree search, which will increase the lock contention of extent tree. We do not need the parent node every time to determine whether normal refs are used. It is only needed when the extent item is the last item in a leaf. Therefore, we change it to first use keep_locks=0 for search. If the extent item happens to be the last item in the leaf, we then change to keep_locks=1 for the second search to reduce lock contention. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Robbie Ko <robbieko@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: drop unused parameter refs from visit_node_for_delete()David Sterba
The parameter duplicates what can be effectively obtained from wc->refs[level - 1] and this is what's actually used inside. Added in commit 2b73c7e761c4 ("btrfs: unify logic to decide if we need to walk down into a node during snapshot delete"). Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11btrfs: correct typos in multiple comments across various filesShen Lichuan
Fix some confusing spelling errors that were currently identified, the details are as follows: block-group.c: 2800: uncompressible ==> incompressible extent-tree.c: 3131: EXTEMT ==> EXTENT extent_io.c: 3124: utlizing ==> utilizing extent_map.c: 1323: ealier ==> earlier extent_map.c: 1325: possiblity ==> possibility fiemap.c: 189: emmitted ==> emitted fiemap.c: 197: emmitted ==> emitted fiemap.c: 203: emmitted ==> emitted transaction.h: 36: trasaction ==> transaction volumes.c: 5312: filesysmte ==> filesystem zoned.c: 1977: trasnsaction ==> transaction Signed-off-by: Shen Lichuan <shenlichuan@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-07btrfs: add cancellation points to trim loopsLuca Stefani
There are reports that system cannot suspend due to running trim because the task responsible for trimming the device isn't able to finish in time, especially since we have a free extent discarding phase, which can trim a lot of unallocated space. There are no limits on the trim size (unlike the block group part). Since trime isn't a critical call it can be interrupted at any time, in such cases we stop the trim, report the amount of discarded bytes and return an error. Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180 Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737 CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-07btrfs: split remaining space to discard in chunksLuca Stefani
Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device, mostly empty although we will do the split according to our super block locations, the last super block ends at 256G, we can submit a huge discard for the range [256G, 8T), causing a large delay. Split the space left to discard based on BTRFS_MAX_DISCARD_CHUNK_SIZE in preparation of introduction of cancellation points to trim. The value of the chunk size is arbitrary, it can be higher or derived from actual device capabilities but we can't easily read that using bio_discard_limit(). Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180 Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737 CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10btrfs: always update fstrim_range on failure in FITRIM ioctlLuca Stefani
Even in case of failure we could've discarded some data and userspace should be made aware of it, so copy fstrim_range to userspace regardless. Also make sure to update the trimmed bytes amount even if btrfs_trim_free_extents fails. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-13btrfs: check delayed refs when we're checking if a ref existsJosef Bacik
In the patch 78c52d9eb6b7 ("btrfs: check for refs on snapshot delete resume") I added some code to handle file systems that had been corrupted by a bug that incorrectly skipped updating the drop progress key while dropping a snapshot. This code would check to see if we had already deleted our reference for a child block, and skip the deletion if we had already. Unfortunately there is a bug, as the check would only check the on-disk references. I made an incorrect assumption that blocks in an already deleted snapshot that was having the deletion resume on mount wouldn't be modified. If we have 2 pending deleted snapshots that share blocks, we can easily modify the rules for a block. Take the following example subvolume a exists, and subvolume b is a snapshot of subvolume a. They share references to block 1. Block 1 will have 2 full references, one for subvolume a and one for subvolume b, and it belongs to subvolume a (btrfs_header_owner(block 1) == subvolume a). When deleting subvolume a, we will drop our full reference for block 1, and because we are the owner we will drop our full reference for all of block 1's children, convert block 1 to FULL BACKREF, and add a shared reference to all of block 1's children. Then we will start the snapshot deletion of subvolume b. We look up the extent info for block 1, which checks delayed refs and tells us that FULL BACKREF is set, so sets parent to the bytenr of block 1. However because this is a resumed snapshot deletion, we call into check_ref_exists(). Because check_ref_exists() only looks at the disk, it doesn't find the shared backref for the child of block 1, and thus returns 0 and we skip deleting the reference for the child of block 1 and continue. This orphans the child of block 1. The fix is to lookup the delayed refs, similar to what we do in btrfs_lookup_extent_info(). However we only care about whether the reference exists or not. If we fail to find our reference on disk, go look up the bytenr in the delayed refs, and if it exists look for an existing ref in the delayed ref head. If that exists then we know we can delete the reference safely and carry on. If it doesn't exist we know we have to skip over this block. This bug has existed since I introduced this fix, however requires having multiple deleted snapshots pending when we unmount. We noticed this in production because our shutdown path stops the container on the system, which deletes a bunch of subvolumes, and then reboots the box. This gives us plenty of opportunities to hit this issue. Looking at the history we've seen this occasionally in production, but we had a big spike recently thanks to faster machines getting jobs with multiple subvolumes in the job. Chris Mason wrote a reproducer which does the following mount /dev/nvme4n1 /btrfs btrfs subvol create /btrfs/s1 simoop -E -f 4k -n 200000 -z /btrfs/s1 while(true) ; do btrfs subvol snap /btrfs/s1 /btrfs/s2 simoop -f 4k -n 200000 -r 10 -z /btrfs/s2 btrfs subvol snap /btrfs/s2 /btrfs/s3 btrfs balance start -dusage=80 /btrfs btrfs subvol del /btrfs/s2 /btrfs/s3 umount /btrfs btrfsck /dev/nvme4n1 || exit 1 mount /dev/nvme4n1 /btrfs done On the second loop this would fail consistently, with my patch it has been running for hours and hasn't failed. I also used dm-log-writes to capture the state of the failure so I could debug the problem. Using the existing failure case to test my patch validated that it fixes the problem. Fixes: 78c52d9eb6b7 ("btrfs: check for refs on snapshot delete resume") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-29btrfs: zoned: fix zone_unusable accounting on making block group read-write ↵Naohiro Aota
again When btrfs makes a block group read-only, it adds all free regions in the block group to space_info->bytes_readonly. That free space excludes reserved and pinned regions. OTOH, when btrfs makes the block group read-write again, it moves all the unused regions into the block group's zone_unusable. That unused region includes reserved and pinned regions. As a result, it counts too much zone_unusable bytes. Fortunately (or unfortunately), having erroneous zone_unusable does not affect the calculation of space_info->bytes_readonly, because free space (num_bytes in btrfs_dec_block_group_ro) calculation is done based on the erroneous zone_unusable and it reduces the num_bytes just to cancel the error. This behavior can be easily discovered by adding a WARN_ON to check e.g, "bg->pinned > 0" in btrfs_dec_block_group_ro(), and running fstests test case like btrfs/282. Fix it by properly considering pinned and reserved in btrfs_dec_block_group_ro(). Also, add a WARN_ON and introduce btrfs_space_info_update_bytes_zone_unusable() to catch a similar mistake. Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones") CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: avoid allocating and running pointless delayed extent operationsFilipe Manana
We always allocate a delayed extent op structure when allocating a tree block (except for log trees), but most of the time we don't need it as we only need to set the BTRFS_BLOCK_FLAG_FULL_BACKREF if we're dealing with a relocation tree and we only need to set the key of a tree block in a btrfs_tree_block_info structure if we are not using skinny metadata (feature enabled by default since btrfs-progs 3.18 and available as of kernel 3.10). In these cases, where we don't need neither to update flags nor to set the key, we only use the delayed extent op structure to set the tree block's level. This is a waste of memory and besides that, the memory allocation can fail and can add additional latency. Instead of using a delayed extent op structure to store the level of the tree block, use the delayed ref head to store it. This doesn't change the size of neither structure and helps us avoid allocating delayed extent ops structures when using the skinny metadata feature and there's no relocation going on. This also gets rid of a BUG_ON(). For example, for a fs_mark run, with 5 iterations, 8 threads and 100K files per iteration, before this patch there were 118109 allocations of delayed extent op structures and after it there were none. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: don't BUG_ON() when 0 reference count at btrfs_lookup_extent_info()Filipe Manana
Instead of doing a BUG_ON() handle the error by returning -EUCLEAN, aborting the transaction and logging an error message. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: reduce nesting for extent processing at btrfs_lookup_extent_info()Filipe Manana
Instead of using an if-else statement when processing the extent item at btrfs_lookup_extent_info(), use a single if statement for the error case since it does a goto at the end and leave the success (expected) case following the if statement, reducing indentation and making the logic a bit easier to follow. Also make the if statement's condition as unlikely since it's not expected to ever happen, as it signals some corruption, making it clear and hint the compiler to generate more efficient code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove superfluous metadata check at btrfs_lookup_extent_info()Filipe Manana
If we didn't found an extent item with the initial btrfs_search_slot() call, it's pointless to test if the "metadata" variable is "true", because right after we check if the key type is BTRFS_METADATA_ITEM_KEY and that is the case only when "metadata" is set to "true". So remove the redundant check. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove NULL transaction support for btrfs_lookup_extent_info()Filipe Manana
There are no callers of btrfs_lookup_extent_info() that pass a NULL value for the transaction handle argument, so there's no point in having special logic to deal with the NULL. The last caller that passed a NULL value was removed in commit 19b546d7a1b2 ("btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves"). So remove the NULL handling from btrfs_lookup_extent_info(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: do not BUG_ON() when freeing tree block after errorFilipe Manana
When freeing a tree block, at btrfs_free_tree_block(), if we fail to create a delayed reference we don't deal with the error and just do a BUG_ON(). The error most likely to happen is -ENOMEM, and we have a comment mentioning that only -ENOMEM can happen, but that is not true, because in case qgroups are enabled any error returned from btrfs_qgroup_trace_extent_post() (can be -EUCLEAN or anything returned from btrfs_search_slot() for example) can be propagated back to btrfs_free_tree_block(). So stop doing a BUG_ON() and return the error to the callers and make them abort the transaction to prevent leaking space. Syzbot was triggering this, likely due to memory allocation failure injection. Reported-by: syzbot+a306f914b4d01b3958fe@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000fcba1e05e998263c@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: add documentation around snapshot deleteJosef Bacik
Snapshot delete has some complicated looking code that is weirdly subtle at times. I've cleaned it up the best I can without re-writing it, but there are still a lot of details that are non-obvious. Add a bunch of comments to the main parts of the code to help future developers better understand the mechanics of snapshot deletion. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: handle errors from btrfs_dec_ref() properlyJosef Bacik
In walk_up_proc() we BUG_ON(ret) from btrfs_dec_ref(). This is incorrect, we have proper error handling here, return the error. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: convert correctness BUG_ON()'s to ASSERT()'s in walk_up_proc()Josef Bacik
In walk_up_proc() we have several sanity checks that should only trip if the programmer made a mistake. Convert these to ASSERT()'s instead of BUG_ON()'s. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: clean up our handling of refs == 0 in snapshot deleteJosef Bacik
In reada we BUG_ON(refs == 0), which could be unkind since we aren't holding a lock on the extent leaf and thus could get a transient incorrect answer. In walk_down_proc we also BUG_ON(refs == 0), which could happen if we have extent tree corruption. Change that to return -EUCLEAN. In do_walk_down() we catch this case and handle it correctly, however we return -EIO, which -EUCLEAN is a more appropriate error code. Finally in walk_up_proc we have the same BUG_ON(refs == 0), so convert that to proper error handling. Also adjust the error message so we can actually do something with the information. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: replace BUG_ON with ASSERT in walk_down_proc()Josef Bacik
We have a couple of areas where we check to make sure the tree block is locked before looking up or messing with references. This is old code so it has this as BUG_ON(). Convert this to ASSERT() for developers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: handle errors from ref mods during UPDATE_BACKREF in walk_down_proc()Josef Bacik
We have blanket BUG_ON(ret) after every one of these reference mod attempts, which is just incorrect. If we encounter any errors during walk_down_tree() we will abort, so abort on any one of these failures. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: don't BUG_ON on ENOMEM from btrfs_lookup_extent_info() in ↵Josef Bacik
walk_down_proc() We handle errors here properly, ENOMEM isn't fatal, return the error. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: extract the reference dropping code into it's own helperJosef Bacik
This is a big chunk of code in do_walk_down() that will conditionally remove the reference for the child block we're currently evaluating. Extract it out into it's own helper and call that from do_walk_down() instead. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: unify logic to decide if we need to walk down into a node during ↵Josef Bacik
snapshot delete We currently duplicate the logic for walking into a node during snapshot delete. In one case it is during the actual delete, and in the other we use it for deciding if we should reada the block or not. Factor this code into it's own helper and comment fully what we're doing, and then update the two users to use the new helper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: remove local variable need_account in do_walk_down()Josef Bacik
We only set this if wc->refs[level - 1] > 1, and we check this way up above where we need it because the first thing we do before dropping our refs is reset wc->refs[level - 1] to 0. Reorder resetting of wc->refs to after our drop logic, and then remove the need_account variable and simply check wc->refs[level - 1] directly instead of using a variable. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: factor out eb uptodate check from do_walk_down()Josef Bacik
do_walk_down() already has a bunch of things going on, and there's a bit of code related to reading in the next eb if we decide we need it. Move this code off into it's own helper to clean up do_walk_down() a little bit. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: push lookup_info into struct walk_controlJosef Bacik
Instead of using a flag we're passing around everywhere, add a field to walk_control that we're already passing around everywhere and use that instead. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: use btrfs_read_extent_buffer() in do_walk_down()Josef Bacik
Currently if our extent buffer isn't uptodate we will drop the lock, free it, and then call read_tree_block() for the bytenr. This is inefficient, we already have the extent buffer, we can simply call btrfs_read_extent_buffer(). Merge these two cases down into one if statement, if we are not uptodate we can drop the lock, trigger readahead, and do the read using btrfs_read_extent_buffer(), and carry on. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11btrfs: don't do extra find_extent_buffer() in do_walk_down()Josef Bacik
We do find_extent_buffer(), and then if we don't find the eb in cache we call btrfs_find_create_tree_block(), which calls find_extent_buffer() first and then allocates the extent buffer. The reason we're doing this is because if we don't find the extent buffer in cache we set reada = 1. However this doesn't matter, because lower down we only trigger reada if !btrfs_buffer_uptodate(eb), which is what the case would be if we didn't find the extent buffer in cache and had to allocate it. Clean this up to simply call btrfs_find_create_tree_block(), and then use the fact that we're having to read the extent buffer off of disk to go ahead and kick off readahead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>