summaryrefslogtreecommitdiff
path: root/fs/btrfs/tree-log.c
AgeCommit message (Collapse)Author
2023-12-15btrfs: use the flags of an extent map to identify the compression typeFilipe Manana
Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: remove now unneeded btrfs_redirty_list_addJohannes Thumshirn
Now that we're not clearing the dirty flag off of extent_buffers in zoned mode, all that is left of btrfs_redirty_list_add() is a memzero() and some ASSERT()ions. As we're also memzero()ing the buffer on write-out btrfs_redirty_list_add() has become obsolete and can be removed. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-30Merge tag 'for-6.7-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "New features: - raid-stripe-tree New tree for logical file extent mapping where the physical mapping may not match on multiple devices. This is now used in zoned mode to implement RAID0/RAID1* profiles, but can be used in non-zoned mode as well. The support for RAID56 is in development and will eventually fix the problems with the current implementation. This is a backward incompatible feature and has to be enabled at mkfs time. - simple quota accounting (squota) A simplified mode of qgroup that accounts all space on the initial extent owners (a subvolume), the snapshots are then cheap to create and delete. The deletion of snapshots in fully accounting qgroups is a known CPU/IO performance bottleneck. The squota is not suitable for the general use case but works well for containers where the original subvolume exists for the whole time. This is a backward incompatible feature as it needs extending some structures, but can be enabled on an existing filesystem. - temporary filesystem fsid (temp_fsid) The fsid identifies a filesystem and is hard coded in the structures, which disallows mounting the same fsid found on different devices. For a single device filesystem this is not strictly necessary, a new temporary fsid can be generated on mount e.g. after a device is cloned. This will be used by Steam Deck for root partition A/B testing, or can be used for VM root images. Other user visible changes: - filesystems with partially finished metadata_uuid conversion cannot be mounted anymore and the uuid fixup has to be done by btrfs-progs (btrfstune). Performance improvements: - reduce reservations for checksum deletions (with enabled free space tree by factor of 4), on a sample workload on file with many extents the deletion time decreased by 12% - make extent state merges more efficient during insertions, reduce rb-tree iterations (run time of critical functions reduced by 5%) Core changes: - the integrity check functionality has been removed, this was a debugging feature and removal does not affect other integrity checks like checksums or tree-checker - space reservation changes: - more efficient delayed ref reservations, this avoids building up too much work or overusing or exhausting the global block reserve in some situations - move delayed refs reservation to the transaction start time, this prevents some ENOSPC corner cases related to exhaustion of global reserve - improvements in reducing excessive reservations for block group items - adjust overcommit logic in near full situations, account for one more chunk to eventually allocate metadata chunk, this is mostly relevant for small filesystems (<10GiB) - single device filesystems are scanned but not registered (except seed devices), this allows temp_fsid to work - qgroup iterations do not need GFP_ATOMIC allocations anymore - cleanups, refactoring, reduced data structure size, function parameter simplifications, error handling fixes" * tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits) btrfs: open code timespec64 in struct btrfs_inode btrfs: remove redundant log root tree index assignment during log sync btrfs: remove redundant initialization of variable dirty in btrfs_update_time() btrfs: sysfs: show temp_fsid feature btrfs: disable the device add feature for temp-fsid btrfs: disable the seed feature for temp-fsid btrfs: update comment for temp-fsid, fsid, and metadata_uuid btrfs: remove pointless empty log context list check when syncing log btrfs: update comment for struct btrfs_inode::lock btrfs: remove pointless barrier from btrfs_sync_file() btrfs: add and use helpers for reading and writing last_trans_committed btrfs: add and use helpers for reading and writing fs_info->generation btrfs: add and use helpers for reading and writing log_transid btrfs: add and use helpers for reading and writing last_log_commit btrfs: support cloned-device mount capability btrfs: add helper function find_fsid_by_disk btrfs: stop reserving excessive space for block group item insertions btrfs: stop reserving excessive space for block group item updates btrfs: reorder btrfs_inode to fill gaps btrfs: open code btrfs_ordered_inode_tree in btrfs_inode ...
2023-10-18btrfs: convert to new timestamp accessorsJeff Layton
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-21-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-12btrfs: remove redundant log root tree index assignment during log syncFilipe Manana
During log syncing, when we start updating the log root tree we compute an index value, stored in variable 'index2', once we lock the log root tree's mutex. This value depends on the log root's log_transid. And shortly after we compute again the same value for 'index2' - the value is exactly the same since we haven't released the mutex and therefore the log_transid of the log root is the same as before. This second 'index2' computation became pointless after commit a93e01682e28 ("btrfs: remove no longer needed use of log_writers for the log root tree"). So remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: remove pointless empty log context list check when syncing logFilipe Manana
When syncing the log, if we get an error when updating the log root, we check first if the log root tree context is in a log context list, and if so it deletes from the log root tree context from the list. This check however is pointless because at this moment the context is always in a list, he have just added it to a context list. The check became pointless after commit a93e01682e28 ("btrfs: remove no longer needed use of log_writers for the log root tree"). So remove this now pointless empty list check. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add and use helpers for reading and writing log_transidFilipe Manana
Currently the log_transid field of a root is always modified while holding the root's log_mutex locked. Most readers of a root's log_transid are also holding the root's log_mutex locked, however there is one exception which is btrfs_set_inode_last_trans() where we don't take the lock to avoid blocking several operations if log syncing is happening in parallel. Any races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the log_transid field of a root using READ_ONCE() and WRITE_ONCE(), and use these helpers where needed. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add and use helpers for reading and writing last_log_commitFilipe Manana
Currently, the last_log_commit of a root can be accessed concurrently without any lock protection. Readers can be calling btrfs_inode_in_log() early in a fsync call, which reads a root's last_log_commit, while a writer can change the last_log_commit while a log tree if being synced, at btrfs_sync_log(). Any races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the last_log_commit field of a root using READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: open code btrfs_ordered_inode_tree in btrfs_inodeDavid Sterba
The structure btrfs_ordered_inode_tree is used only in one place, in btrfs_inode. The structure itself has a 4 byte hole which is wasted space. Move the btrfs_ordered_inode_tree members to btrfs_inode with a common prefix 'ordered_tree_' where the hole can be utilized and shrink inode size. Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: use extent_io_tree_release() to empty dirty log pagesFilipe Manana
When freeing a log tree, during a transaction commit, we clear its dirty log pages io tree by calling clear_extent_bits() using a range from 0 to (u64)-1. This will iterate the io tree's rbtree and call rb_erase() on each node before freeing it, which will often trigger rebalance operations on the rbtree. A better alternative it to use extent_io_tree_release(), which will not do deletions and trigger rebalances. So use extent_io_tree_release() instead of clear_extent_bits(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: remove redundant root argument from fixup_inode_link_count()Filipe Manana
The root argument for fixup_inode_link_count() always matches the root of the given inode, so remove the root argument and get it from the inode argument. This also applies to the helpers count_inode_extrefs() and count_inode_refs() used by fixup_inode_link_count() - they don't need the root argument, as it always matches the root of the inode passed to them. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: remove redundant root argument from btrfs_update_inode()Filipe Manana
The root argument for btrfs_update_inode() always matches the root of the given inode, so remove the root argument and get it from the inode argument. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: track owning root in btrfs_refBoris Burkov
While data extents require us to store additional inline refs to track the original owner on free, this information is available implicitly for metadata. It is found in the owner field of the header of the tree block. Even if other trees refer to this block and the original ref goes away, we will not rewrite that header field, so it will reliably give the original owner. In addition, there is a relocation case where a new data extent needs to have an owning root separate from the referring root wired through delayed refs. To use it for recording simple quota deltas, we need to wire this root id through from when we create the delayed ref until we fully process it. Store it in the generic btrfs_ref struct of the delayed ref. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: abort transaction on generation mismatch when marking eb as dirtyFilipe Manana
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation. So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: reduce parameters of btrfs_pin_extent_for_log_replayDavid Sterba
Both callers of btrfs_pin_extent_for_log_replay expand the parameters to extent buffer members. We can simply pass the extent buffer instead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: reduce parameters of btrfs_pin_reserved_extentDavid Sterba
There is only one caller of btrfs_pin_reserved_extent that expands the parameters to extent buffer members. We can simply pass the extent buffer instead. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: reformat remaining kdoc style commentsDavid Sterba
Function name in the comment does not bring much value to code not exposed as API and we don't stick to the kdoc format anymore. Update formatting of parameter descriptions. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-26Merge tag 'for-6.6-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - delayed refs fixes: - fix race when refilling delayed refs block reserve - prevent transaction block reserve underflow when starting transaction - error message and value adjustments - fix build warnings with CONFIG_CC_OPTIMIZE_FOR_SIZE and -Wmaybe-uninitialized - fix for smatch report where uninitialized data from invalid extent buffer range could be returned to the caller - fix numeric overflow in statfs when calculating lower threshold for a full filesystem * tag 'for-6.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: initialize start_slot in btrfs_log_prealloc_extents btrfs: make sure to initialize start and len in find_free_dev_extent btrfs: reset destination buffer when read_extent_buffer() gets invalid range btrfs: properly report 0 avail for very full file systems btrfs: log message if extent item not found when running delayed extent op btrfs: remove redundant BUG_ON() from __btrfs_inc_extent_ref() btrfs: return -EUCLEAN for delayed tree ref with a ref count not equals to 1 btrfs: prevent transaction block reserve underflow when starting transaction btrfs: fix race when refilling delayed refs block reserve
2023-09-21btrfs: initialize start_slot in btrfs_log_prealloc_extentsJosef Bacik
Jens reported a compiler warning when using CONFIG_CC_OPTIMIZE_FOR_SIZE=y that looks like this fs/btrfs/tree-log.c: In function ‘btrfs_log_prealloc_extents’: fs/btrfs/tree-log.c:4828:23: warning: ‘start_slot’ may be used uninitialized [-Wmaybe-uninitialized] 4828 | ret = copy_items(trans, inode, dst_path, path, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4829 | start_slot, ins_nr, 1, 0); | ~~~~~~~~~~~~~~~~~~~~~~~~~ fs/btrfs/tree-log.c:4725:13: note: ‘start_slot’ was declared here 4725 | int start_slot; | ^~~~~~~~~~ The compiler is incorrect, as we only use this code when ins_len > 0, and when ins_len > 0 we have start_slot properly initialized. However we generally find the -Wmaybe-uninitialized warnings valuable, so initialize start_slot to get rid of the warning. Reported-by: Jens Axboe <axboe@kernel.dk> Tested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-28Merge tag 'for-6.6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "No new features, the bulk of the changes are fixes, refactoring and cleanups. The notable fix is the scrub performance restoration after rewrite in 6.4, though still only partial. Fixes: - scrub performance drop due to rewrite in 6.4 partially restored: - do IO grouping by blg_plug/blk_unplug again - avoid unnecessary tree searches when processing stripes, in extent and checksum trees - the drop is noticeable on fast PCIe devices, -66% and restored to -33% of the original - backports to 6.4 planned - handle more corner cases of transaction commit during orphan cleanup or delayed ref processing - use correct fsid/metadata_uuid when validating super block - copy directory permissions and time when creating a stub subvolume Core: - debugging feature integrity checker deprecated, to be removed in 6.7 - in zoned mode, zones are activated just before the write, making error handling easier, now the overcommit mechanism can be enabled again which improves performance by avoiding more frequent flushing - v0 extent handling completely removed, deprecated long time ago - error handling improvements - tests: - extent buffer bitmap tests - pinned extent splitting tests - cleanups and refactoring: - compression writeback - extent buffer bitmap - space flushing, ENOSPC handling" * tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits) btrfs: zoned: skip splitting and logical rewriting on pre-alloc write btrfs: tests: test invalid splitting when skipping pinned drop extent_map btrfs: tests: add a test for btrfs_add_extent_mapping btrfs: tests: add extent_map tests for dropping with odd layouts btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker() btrfs: scrub: don't go ordered workqueue for dev-replace btrfs: scrub: fix grouping of read IO btrfs: scrub: avoid unnecessary csum tree search preparing stripes btrfs: scrub: avoid unnecessary extent tree search preparing stripes btrfs: copy dir permission and time when creating a stub subvolume btrfs: remove pointless empty list check when reading delayed dir indexes btrfs: drop redundant check to use fs_devices::metadata_uuid btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super btrfs: use the correct superblock to compare fsid in btrfs_validate_super btrfs: simplify memcpy either of metadata_uuid or fsid btrfs: add a helper to read the superblock metadata_uuid btrfs: remove v0 extent handling btrfs: output extra debug info if we failed to find an inline backref btrfs: move the !zoned assert into run_delalloc_cow btrfs: consolidate the error handling in run_delalloc_nocow ...
2023-08-21btrfs: use LIST_HEAD() to initialize the list_headRuan Jinjie
Use LIST_HEAD() to initialize the list_head instead of open-coding it. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: remove redundant initialization of variables in log_new_ancestorsColin Ian King
The variables leaf and slot are initialized when declared but the values assigned to them are never read as they are being re-assigned later on. The initializations are redundant and can be removed. Cleans up clang scan build warnings: fs/btrfs/tree-log.c:6797:25: warning: Value stored to 'leaf' during its initialization is never read [deadcode.DeadStores] fs/btrfs/tree-log.c:6798:7: warning: Value stored to 'slot' during its initialization is never read [deadcode.DeadStores] It's been there since b8aa330d2acb ("Btrfs: improve performance on fsync of files with multiple hardlinks") without any usage so it's safe to be removed. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-13btrfs: convert to ctime accessor functionsJeff Layton
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-27-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-06-19btrfs: do not BUG_ON() when dropping inode items from log rootFilipe Manana
When dropping inode items from a log tree at drop_inode_items(), we this BUG_ON() on the result of btrfs_search_slot() because we don't expect an exact match since having a key with an offset of (u64)-1 is unexpected. That is generally true, but for dir index keys for example, we can get a key with such an offset value, even though it's very unlikely and it would take ages to increase the sequence counter for a dir index up to (u64)-1. We can deal with an exact match, we just have to delete the key at that slot, so there is really no need to BUG_ON(), error out or trigger any warning. So remove the BUG_ON(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: rename the bytenr field in struct btrfs_ordered_sum to logicalChristoph Hellwig
btrfs_ordered_sum::bytendr stores a logical address. Make that clear by renaming it to ->logical. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: change for_rename argument of btrfs_record_unlink_dir() to boolFilipe Manana
The for_rename argument of btrfs_record_unlink_dir() is defined as an integer, but the argument is in fact used as a boolean. So change it to a boolean to make its use more clear. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: remove pointless label and goto at btrfs_record_unlink_dir()Filipe Manana
There's no point of having a label and goto at btrfs_record_unlink_dir() because the function is trivial and can just return early if we are not in a rename context. So remove the label and goto and instead return early if we are not in a rename. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: update comments at btrfs_record_unlink_dir() to be more clearFilipe Manana
Update the comments at btrfs_record_unlink_dir() so that they mention where new names are logged and where old names are removed. Also, while at it make the width of the comments closer to 80 columns and capitalize the sentences and finish them with punctuation. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: use inode_logged() at btrfs_record_unlink_dir()Filipe Manana
At btrfs_record_unlink_dir() we directly check the logged_trans field of the given inodes to check if they were previously logged in the current transaction, and if any of them were, then we can avoid setting the field last_unlink_trans of the directory to the id of the current transaction if we are in a rename path. Avoiding that can later prevent falling back to a transaction commit if anyone attempts to log the directory. However the logged_trans field, store in struct btrfs_inode, is transient, not persisted in the inode item on its subvolume b+tree, so that means that if an inode is evicted and then loaded again, its original value is lost and it's reset to 0. So directly checking the logged_trans field can lead to some false negative, and that only results in a performance impact as mentioned before. Instead of directly checking the logged_trans field of the inodes, use the inode_logged() helper, which will check in the log tree if an inode was logged before in case its logged_trans field has a value of 0. This way we can avoid setting the directory inode's last_unlink_trans and cause future logging attempts of it to fallback to transaction commits. The following test script shows one example where this happens without this patch: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 num_init_files=10000 num_new_files=10000 mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $num_init_files; i++)); do echo -n > $MNT/testdir/file_$i done echo -n > $MNT/testdir/foo sync # Add some files so that there's more work in the transaction other # than just renaming file foo. for ((i = 1; i <= $num_new_files; i++)); do echo -n > $MNT/testdir/new_file_$i done # Change the file, fsync it. setfattr -n user.x1 -v 123 $MNT/testdir/foo xfs_io -c "fsync" $MNT/testdir/foo # Now triggger eviction of file foo but no eviction for our test # directory, since it is being used by the process below. This will # set logged_trans of the file's inode to 0 once it is loaded again. ( cd $MNT/testdir while true; do : done ) & pid=$! echo 2 > /proc/sys/vm/drop_caches kill $pid wait $pid # Move foo out of our testdir. This will set last_unlink_trans # of the directory inode to the current transaction, because # logged_trans of both the directory and the file are set to 0. mv $MNT/testdir/foo $MNT/foo # Change file foo again and fsync it. # This fsync will result in a transaction commit because the rename # above has set last_unlink_trans of the parent directory to the id # of the current transaction and because our inode for file foo has # last_unlink_trans set to the current transaction, since it was # evicted and reloaded and it was previously modified in the current # transaction (the xattr addition). xfs_io -c "pwrite 0 64K" $MNT/foo start=$(date +%s%N) xfs_io -c "fsync" $MNT/foo end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "file fsync took: $dur milliseconds" umount $MNT Before this patch: fsync took 19 milliseconds After this patch: fsync took 5 milliseconds Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: use inode_logged() at need_log_inode()Filipe Manana
At need_log_inode() we directly check the ->logged_trans field of the given inode to check if it was previously logged in the transaction, with the goal of skipping logging the inode again when it's not necessary. The ->logged_trans field in not persisted in the inode item or elsewhere, it's only stored in memory (struct btrfs_inode), so it's transient and lost once the inode is evicted and then loaded again. Once an inode is loaded, we are conservative and set ->logged_trans to 0, which may mean that either the inode was never logged in the current transaction or it was logged but evicted before being loaded again. Instead of checking the inode's ->logged_trans field directly, we can use instead the helper inode_logged(), which will really check if the inode was logged before in the current transaction in case we have a ->logged_trans field with a value of 0. This will prevent unnecessarily logging an inode when it's not needed, and in some cases preventing a transaction commit, in case the logging requires a fallback to a transaction commit. The following test script shows a scenario where due to eviction we fallback a transaction commit when trying to fsync a file that was renamed: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 num_init_files=10000 num_new_files=10000 mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $num_init_files; i++)); do echo -n > $MNT/testdir/file_$i done echo -n > $MNT/testdir/foo sync # Add some files so that there's more work in the transaction other # than just renaming file foo. for ((i = 1; i <= $num_new_files; i++)); do echo -n > $MNT/testdir/new_file_$i done # Fsync the directory first. xfs_io -c "fsync" $MNT/testdir # Rename file foo. mv $MNT/testdir/foo $MNT/testdir/bar # Now trigger eviction of the test directory's inode. # Once loaded again, it will have logged_trans set to 0 and # last_unlink_trans set to the current transaction. echo 2 > /proc/sys/vm/drop_caches # Fsync file bar (ex-foo). # Before the patch the fsync would result in a transaction commit # because the inode for file bar has last_unlink_trans set to the # current transaction, so it will attempt to log the parent directory # as well, which will fallback to a full transaction commit because # it also has its last_unlink_trans set to the current transaction, # due to the inode eviction. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir/bar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "file fsync took: $dur milliseconds" umount $MNT Before this patch: fsync took 22 milliseconds After this patch: fsync took 8 milliseconds Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-26btrfs: fix an uninitialized variable warning in btrfs_log_inodeShida Zhang
This fixes the following warning reported by gcc 10.2.1 under x86_64: ../fs/btrfs/tree-log.c: In function ‘btrfs_log_inode’: ../fs/btrfs/tree-log.c:6211:9: error: ‘last_range_start’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 6211 | ret = insert_dir_log_key(trans, log, path, key.objectid, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 6212 | first_dir_index, last_dir_index); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ../fs/btrfs/tree-log.c:6161:6: note: ‘last_range_start’ was declared here 6161 | u64 last_range_start; | ^~~~~~~~~~~~~~~~ This might be a false positive fixed in later compiler versions but we want to have it fixed. Reported-by: k2ci <kernel-bot@kylinos.cn> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: use log root when iterating over index keys when logging directoryFilipe Manana
When logging dir dentries of a directory, we iterate over the subvolume tree to find dir index keys on leaves modified in the current transaction. This however is heavy on locking, since btrfs_search_forward() may often keep locks on extent buffers for quite a while when walking the tree to find a suitable leaf modified in the current transaction and with a key not smaller than then the provided minimum key. That means it will block other tasks trying to access the subvolume tree, which may be common fs operations like creating, renaming, linking, unlinking, reflinking files, etc. A better solution is to iterate the log tree, since it's much smaller than a subvolume tree and just use plain btrfs_search_slot() (or the wrapper btrfs_for_each_slot()) and only contains dir index keys added in the current transaction. The following bonnie++ test on a non-debug kernel (with Debian's default kernel config) on a 20G null block device, was used to measure the impact: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 NR_DIRECTORIES=20 NR_FILES=20480 # must be a multiple of 1024 DATASET_SIZE=$(( (8 * 1024 * 1024 * 1024) / 1048576 )) # 8 GiB as megabytes DIRECTORY_SIZE=$(( DATASET_SIZE / NR_FILES )) NR_FILES=$(( NR_FILES / 1024 )) umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount $DEV $MNT bonnie++ -u root -d $MNT \ -n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \ -r 0 -s $DATASET_SIZE -b umount $MNT Before patchset: Version 2.00a ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP debian0 8G 376k 99 1.1g 98 939m 92 1527k 99 3.2g 99 9060 256 Latency 24920us 207us 680ms 5594us 171us 2891us Version 2.00a ------Sequential Create------ --------Random Create-------- debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20/20 20480 96 +++++ +++ 20480 95 20480 99 +++++ +++ 20480 97 Latency 8708us 137us 5128us 6743us 60us 19712us After patchset: Version 2.00a ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP debian0 8G 384k 99 1.2g 99 971m 91 1533k 99 3.3g 99 9180 309 Latency 24930us 125us 661ms 5587us 46us 2020us Version 2.00a ------Sequential Create------ --------Random Create-------- debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 20/20 20480 90 +++++ +++ 20480 99 20480 99 +++++ +++ 20480 97 Latency 7030us 61us 1246us 4942us 56us 16855us The patchset consists of this patch plus a previous one that has the following subject: "btrfs: avoid iterating over all indexes when logging directory" Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: avoid iterating over all indexes when logging directoryFilipe Manana
When logging a directory, after copying all directory index items from the subvolume tree to the log tree, we iterate over the subvolume tree to find all dir index items that are located in leaves COWed (or created) in the current transaction. If we keep logging a directory several times during the same transaction, we end up iterating over the same dir index items everytime we log the directory, wasting time and adding extra lock contention on the subvolume tree. So just keep track of the last logged dir index offset in order to start the search for that index (+1) the next time the directory is logged, as dir index values (key offsets) come from a monotonically increasing counter. The following test measures the difference before and after this change: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT # Time values in milliseconds. declare -a fsync_times # Total number of files added to the test directory. num_files=1000000 # Fsync directory after every N files are added. fsync_period=100 mkdir $MNT/testdir fsync_total_time=0 for ((i = 1; i <= $num_files; i++)); do echo -n > $MNT/testdir/file_$i if [ $((i % fsync_period)) -eq 0 ]; then start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) fsync_total_time=$((fsync_total_time + (end - start))) fsync_times[i]=$(( (end - start) / 1000000 )) echo -n -e "Progress $i / $num_files\r" fi done echo -e "\nHistogram of directory fsync duration in ms:\n" printf '%s\n' "${fsync_times[@]}" | \ perl -MStatistics::Histogram -e '@d = <>; print get_histogram(\@d);' fsync_total_time=$((fsync_total_time / 1000000)) echo -e "\nTotal time spent in fsync: $fsync_total_time ms\n" echo umount $MNT The test was run on a non-debug kernel (Debian's default kernel config) against a 15G null block device. Result before this change: Histogram of directory fsync duration in ms: Count: 10000 Range: 3.000 - 362.000; Mean: 34.556; Median: 31.000; Stddev: 25.751 Percentiles: 90th: 71.000; 95th: 77.000; 99th: 81.000 3.000 - 5.278: 1423 ################################# 5.278 - 8.854: 1173 ########################### 8.854 - 14.467: 591 ############## 14.467 - 23.277: 1025 ####################### 23.277 - 37.105: 1422 ################################# 37.105 - 58.809: 2036 ############################################### 58.809 - 92.876: 2316 ##################################################### 92.876 - 146.346: 6 | 146.346 - 230.271: 6 | 230.271 - 362.000: 2 | Total time spent in fsync: 350527 ms Result after this change: Histogram of directory fsync duration in ms: Count: 10000 Range: 3.000 - 1088.000; Mean: 8.704; Median: 8.000; Stddev: 12.576 Percentiles: 90th: 12.000; 95th: 14.000; 99th: 17.000 3.000 - 6.007: 3222 ################################# 6.007 - 11.276: 5197 ##################################################### 11.276 - 20.506: 1551 ################ 20.506 - 36.674: 24 | 36.674 - 201.552: 1 | 201.552 - 353.841: 4 | 353.841 - 1088.000: 1 | Total time spent in fsync: 92114 ms Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: tree-log: factor out a clean_log_buffer helperChristoph Hellwig
The tree-log code has three almost identical copies for the accounting on an extent_buffer that doesn't need to be written any more. The only difference is that walk_down_log_tree passed the bytenr used to find the buffer instead of extent_buffer.start and calculates the length using the nodesize, while the other two callers look at the extent_buffer.len field that must always be equivalent to the nodesize. Factor the code into a common helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: open code btrfs_bin_search()Anand Jain
btrfs_bin_search() is a simple wrapper that searches for the whole slots by calling btrfs_generic_bin_search() with the starting slot/first_slot preset to 0. This simple wrapper can be open coded as btrfs_bin_search(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: replace btrfs_wait_tree_block_writeback by ↵Josef Bacik
wait_on_extent_buffer_writeback This is used in the tree-log code and is a holdover from previous iterations of extent buffer writeback. We can simply use wait_on_extent_buffer_writeback here, and remove btrfs_wait_tree_block_writeback completely as it's equivalent (waiting on page write writeback). Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: rename btrfs_clean_tree_block to btrfs_clear_buffer_dirtyJosef Bacik
btrfs_clean_tree_block is a misnomer, it's just clear_extent_buffer_dirty with some extra accounting around it. Rename this to btrfs_clear_buffer_dirty to make it more clear it belongs with it's setter, btrfs_mark_buffer_dirty. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: replace clearing extent buffer dirty bit with btrfs_clean_blockJosef Bacik
Now that we're passing in the trans into btrfs_clean_tree_block, we can easily roll in the handling of the !trans case and replace all occurrences of if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) clear_extent_buffer_dirty(eb); with btrfs_tree_lock(eb); btrfs_clean_tree_block(eb); btrfs_tree_unlock(eb); We need the lock because if we are actually dirty we need to make sure we aren't racing with anything that's starting writeout currently. This also makes sure that we're accounting fs_info->dirty_metadata_bytes appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: add trans argument to btrfs_clean_tree_blockJosef Bacik
We check the header generation in the extent buffer against the current running transaction id to see if it's safe to clear DIRTY on this buffer. Generally speaking if we're clearing the buffer dirty we're holding the transaction open, but in the case of cleaning up an aborted transaction we don't, so we have extra checks in that path to check the transid. To allow for a future cleanup go ahead and pass in the trans handle so we don't have to rely on ->running_transaction being set. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: use a single variable to track return value for log_dir_items()Filipe Manana
We currently use 'ret' and 'err' to track the return value for log_dir_items(), which is confusing and likely the cause for previous bugs where log_dir_items() did not return an error when it should, fixed in previous patches. So change this and use only a single variable, 'ret', to track the return value. This is simpler and makes it similar to most of the existing code. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: use a negative value for BTRFS_LOG_FORCE_COMMITFilipe Manana
Currently we use the value 1 for BTRFS_LOG_FORCE_COMMIT, but that value has a few inconveniences: 1) If it's ever used by btrfs_log_inode(), or any function down the call chain, we have to remember to btrfs_set_log_full_commit(), which is repetitive and has a chance to be forgotten in future use cases. btrfs_log_inode_parent() only calls btrfs_set_log_full_commit() when it gets a negative value from btrfs_log_inode(); 2) Down the call chain of btrfs_log_inode(), we may have functions that need to force a log commit, but can return either an error (negative value), false (0) or true (1). So they are forced to return some random negative to force a log commit - using BTRFS_LOG_FORCE_COMMIT would make the intention more clear. Currently the only example is flush_dir_items_batch(). So turn BTRFS_LOG_FORCE_COMMIT into a negative value. The chosen value is -(MAX_ERRNO + 1), so that it does not overlap any errno value and makes it easier to debug. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-06btrfs: simplify update of last_dir_index_offset when logging a directoryFilipe Manana
When logging a directory, we always set the inode's last_dir_index_offset to the offset of the last dir index item we found. This is using an extra field in the log context structure, and it makes more sense to update it only after we insert dir index items, and we could directly update the inode's last_dir_index_offset field instead. So make this simpler by updating the inode's last_dir_index_offset only when we actually insert dir index keys in the log tree, and getting rid of the last_dir_item_offset field in the log context structure. Reported-by: David Arendt <admin@prnet.org> Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/ Reported-by: Maxim Mikityanskiy <maxtram95@gmail.com> Link: https://lore.kernel.org/linux-btrfs/Y8voyTXdnPDz8xwY@mail.gmail.com/ Reported-by: Hunter Wardlaw <wardlawhunter@gmail.com> Link: https://bugzilla.suse.com/show_bug.cgi?id=1207231 Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216851 CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-12btrfs: do not abort transaction on failure to update log rootFilipe Manana
When syncing a log, if we fail to update a log root in the log root tree, we are aborting the transaction if the failure was not -ENOSPC. This is excessive because there is a chance that a transaction commit can succeed, and therefore avoid to turn the filesystem into RO mode. All we need to be careful about is to mark the log for a full commit, which we already do, to make sure no one commits a super block pointing to an outdated log root tree. So don't abort the transaction if we fail to update a log root in the log root tree, and log an error if the failure is not -ENOSPC, so that it does not go completely unnoticed. CC: stable@vger.kernel.org # 6.0+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-12btrfs: do not abort transaction on failure to write log tree when syncing logFilipe Manana
When syncing the log, if we fail to write log tree extent buffers, we mark the log for a full commit and abort the transaction. However we don't need to abort the transaction, all we really need to do is to make sure no one can commit a superblock pointing to new log tree roots. Just because we got a failure writing extent buffers for a log tree, it does not mean we will also fail to do a transaction commit. One particular case is if due to a bug somewhere, when writing log tree extent buffers, the tree checker detects some corruption and the writeout fails because of that. Aborting the transaction can be very disruptive for a user, specially if the issue happened on a root filesystem. One example is the scenario in the Link tag below, where an isolated corruption on log tree leaves was causing transaction aborts when syncing the log. Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/ CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-12btrfs: add missing setup of log for full commit at add_conflicting_inode()Filipe Manana
When logging conflicting inodes, if we reach the maximum limit of inodes, we return BTRFS_LOG_FORCE_COMMIT to force a transaction commit. However we don't mark the log for full commit (with btrfs_set_log_full_commit()), which means that once we leave the log transaction and before we commit the transaction, some other task may sync the log, which is incomplete as we have not logged all conflicting inodes, leading to some inconsistent in case that log ends up being replayed. So also call btrfs_set_log_full_commit() at add_conflicting_inode(). Fixes: e09d94c9e448 ("btrfs: log conflicting inodes without holding log mutex of the initial inode") CC: stable@vger.kernel.org # 6.1 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-12btrfs: fix directory logging due to race with concurrent index key deletionFilipe Manana
Sometimes we log a directory without holding its VFS lock, so while we logging it, dir index entries may be added or removed. This typically happens when logging a dentry from a parent directory that points to a new directory, through log_new_dir_dentries(), or when while logging some other inode we also need to log its parent directories (through btrfs_log_all_parents()). This means that while we are at log_dir_items(), we may not find a dir index key we found before, because it was deleted in the meanwhile, so a call to btrfs_search_slot() may return 1 (key not found). In that case we return from log_dir_items() with a success value (the variable 'err' has a value of 0). This can lead to a few problems, specially in the case where the variable 'last_offset' has a value of (u64)-1 (and it's initialized to that when it was declared): 1) By returning from log_dir_items() with success (0) and a value of (u64)-1 for '*last_offset_ret', we end up not logging any other dir index keys that follow the missing, just deleted, index key. The (u64)-1 value makes log_directory_changes() not call log_dir_items() again; 2) Before returning with success (0), log_dir_items(), will log a dir index range item covering a range from the last old dentry index (stored in the variable 'last_old_dentry_offset') to the value of 'last_offset'. If 'last_offset' has a value of (u64)-1, then it means if the log is persisted and replayed after a power failure, it will cause deletion of all the directory entries that have an index number between last_old_dentry_offset + 1 and (u64)-1; 3) We can end up returning from log_dir_items() with ctx->last_dir_item_offset having a lower value than inode->last_dir_index_offset, because the former is set to the current key we are processing at process_dir_items_leaf(), and at the end of log_directory_changes() we set inode->last_dir_index_offset to the current value of ctx->last_dir_item_offset. So if for example a deletion of a lower dir index key happened, we set ctx->last_dir_item_offset to that index value, then if we return from log_dir_items() because btrfs_search_slot() returned 1, we end up returning from log_dir_items() with success (0) and then log_directory_changes() sets inode->last_dir_index_offset to a lower value than it had before. This can result in unpredictable and unexpected behaviour when we need to log again the directory in the same transaction, and can result in ending up with a log tree leaf that has duplicated keys, as we do batch insertions of dir index keys into a log tree. So fix this by making log_dir_items() move on to the next dir index key if it does not find the one it was looking for. Reported-by: David Arendt <admin@prnet.org> Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/ CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-12btrfs: fix missing error handling when logging directory itemsFilipe Manana
When logging a directory, at log_dir_items(), if we get an error when attempting to search the subvolume tree for a dir index item, we end up returning 0 (success) from log_dir_items() because 'err' is left with a value of 0. This can lead to a few problems, specially in the case the variable 'last_offset' has a value of (u64)-1 (and it's initialized to that when it was declared): 1) By returning from log_dir_items() with success (0) and a value of (u64)-1 for '*last_offset_ret', we end up not logging any other dir index keys that follow the missing, just deleted, index key. The (u64)-1 value makes log_directory_changes() not call log_dir_items() again; 2) Before returning with success (0), log_dir_items(), will log a dir index range item covering a range from the last old dentry index (stored in the variable 'last_old_dentry_offset') to the value of 'last_offset'. If 'last_offset' has a value of (u64)-1, then it means if the log is persisted and replayed after a power failure, it will cause deletion of all the directory entries that have an index number between last_old_dentry_offset + 1 and (u64)-1; 3) We can end up returning from log_dir_items() with ctx->last_dir_item_offset having a lower value than inode->last_dir_index_offset, because the former is set to the current key we are processing at process_dir_items_leaf(), and at the end of log_directory_changes() we set inode->last_dir_index_offset to the current value of ctx->last_dir_item_offset. So if for example a deletion of a lower dir index key happened, we set ctx->last_dir_item_offset to that index value, then if we return from log_dir_items() because btrfs_search_slot() returned an error, we end up returning without any error from log_dir_items() and then log_directory_changes() sets inode->last_dir_index_offset to a lower value than it had before. This can result in unpredictable and unexpected behaviour when we need to log again the directory in the same transaction, and can result in ending up with a log tree leaf that has duplicated keys, as we do batch insertions of dir index keys into a log tree. Fix this by setting 'err' to the value of 'ret' in case btrfs_search_slot() or btrfs_previous_item() returned an error. That will result in falling back to a full transaction commit. Reported-by: David Arendt <admin@prnet.org> Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/ Fixes: e02119d5a7b4 ("Btrfs: Add a write ahead tree log to optimize synchronous operations") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-20btrfs: fix fscrypt name leak after failure to join log transactionFilipe Manana
When logging a new name, we don't expect to fail joining a log transaction since we know at least one of the inodes was logged before in the current transaction. However if we fail for some unexpected reason, we end up not freeing the fscrypt name we previously allocated. So fix that by freeing the name in case we failed to join a log transaction. Fixes: ab3c5c18e8fa ("btrfs: setup qstr from dentrys using fscrypt helper") Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: remove outdated logic from overwrite_item() and add assertionFilipe Manana
As of commit 193df6245704 ("btrfs: search for last logged dir index if it's not cached in the inode"), the overwrite_item() function is always called for a root that is from a fs/subvolume tree. In other words, now it's only used during log replay to modify a fs/subvolume tree. Therefore we can remove the logic that checks if we are dealing with a log tree at overwrite_item(). So remove that logic, replacing it with an assertion and document that if we ever need to support a log root there, we will need to clone the leaf from the fs/subvolume tree and then release it before modifying the log tree, which is needed to avoid a potential deadlock, similar to the one recently fixed by a patch with the subject: "btrfs: do not modify log tree while holding a leaf from fs tree locked" Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: unify overwrite_item() and do_overwrite_item()Filipe Manana
After commit 193df6245704 ("btrfs: search for last logged dir index if it's not cached in the inode"), there are no more callers of do_overwrite_item(), except overwrite_item(). Originally both used to be the same function, but were split in commit 086dcbfa50d3 ("btrfs: insert items in batches when logging a directory when possible"), as there was the need to execute all logic of overwrite_item() but skip the tree search, since in the context of directory logging we already had a path with a leaf to copy data from. So unify them again as there is no more need to have them split. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>