summaryrefslogtreecommitdiff
path: root/fs/btrfs/ordered-data.c
AgeCommit message (Collapse)Author
2023-12-15btrfs: migrate subpage code to folio interfacesQu Wenruo
Although subpage itself is conflicting with higher folio, since subpage (sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never need higher order folio, there is a hidden pitfall: - btrfs_page_*() helpers Those helpers are an abstraction to handle both subpage and non-subpage cases, which means we're going to pass pages pointers to those helpers. And since those helpers are shared between data and metadata paths, it's unavoidable to let them to handle folios, including higher order folios). Meanwhile for true subpage case, we should only have a single page backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert() to ensure that. Also since those helpers are shared between both data and metadata, add some extra ASSERT()s for data path to make sure we only get single page backed folio for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-06btrfs: fix qgroup_free_reserved_data int overflowBoris Burkov
The reserved data counter and input parameter is a u64, but we inadvertently accumulate it in an int. Overflowing that int results in freeing the wrong amount of data and breaking reserve accounting. Unfortunately, this overflow rot spreads from there, as the qgroup release/free functions rely on returning an int to take advantage of negative values for error codes. Therefore, the full fix is to return the "released" or "freed" amount by a u64 argument and to return 0 or negative error code via the return value. Most of the call sites simply ignore the return value, though some of them handle the error and count the returned bytes. Change all of them accordingly. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-06btrfs: free qgroup reserve when ORDERED_IOERR is setBoris Burkov
An ordered extent completing is a critical moment in qgroup reserve handling, as the ownership of the reservation is handed off from the ordered extent to the delayed ref. In the happy path we release (unlock) but do not free (decrement counter) the reservation, and the delayed ref drives the free. However, on an error, we don't create a delayed ref, since there is no ref to add. Therefore, free on the error path. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: open code btrfs_ordered_inode_tree in btrfs_inodeDavid Sterba
The structure btrfs_ordered_inode_tree is used only in one place, in btrfs_inode. The structure itself has a 4 byte hole which is wasted space. Move the btrfs_ordered_inode_tree members to btrfs_inode with a common prefix 'ordered_tree_' where the hole can be utilized and shrink inode size. Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: merge ordered work callbacks in btrfs_work into oneDavid Sterba
There are two callbacks defined in btrfs_work but only two actually make use of them, otherwise there are NULLs. We can get rid of the freeing callback making it a special case of the normal work. This reduces the size of btrfs_work by 8 bytes, final layout: struct btrfs_work { btrfs_func_t func; /* 0 8 */ btrfs_ordered_func_t ordered_func; /* 8 8 */ struct work_struct normal_work; /* 16 32 */ struct list_head ordered_list; /* 48 16 */ /* --- cacheline 1 boundary (64 bytes) --- */ struct btrfs_workqueue * wq; /* 64 8 */ long unsigned int flags; /* 72 8 */ /* size: 80, cachelines: 2, members: 6 */ /* last cacheline: 16 bytes */ }; This in turn reduces size of other structures (on a release config): - async_chunk 160 -> 152 - async_submit_bio 152 -> 144 - btrfs_async_delayed_work 104 -> 96 - btrfs_caching_control 176 -> 168 - btrfs_delalloc_work 144 -> 136 - btrfs_fs_info 3608 -> 3600 - btrfs_ordered_extent 440 -> 424 - btrfs_writepage_fixup 104 -> 96 Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add support for inserting raid stripe extentsJohannes Thumshirn
Add support for inserting stripe extents into the raid stripe tree on completion of every write that needs an extra logical-to-physical translation when using RAID. Inserting the stripe extents happens after the data I/O has completed, this is done to a) support zone-append and b) rule out the possibility of a RAID-write-hole. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08btrfs: check for BTRFS_FS_ERROR in pending ordered assertJosef Bacik
If we do fast tree logging we increment a counter on the current transaction for every ordered extent we need to wait for. This means we expect the transaction to still be there when we clear pending on the ordered extent. However if we happen to abort the transaction and clean it up, there could be no running transaction, and thus we'll trip the "ASSERT(trans)" check. This is obviously incorrect, and the code properly deals with the case that the transaction doesn't exist. Fix this ASSERT() to only fire if there's no trans and we don't have BTRFS_FS_ERROR() set on the file system. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: use LIST_HEAD() to initialize the list_headRuan Jinjie
Use LIST_HEAD() to initialize the list_head instead of open-coding it. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: remove btrfs_writepage_endio_finish_orderedChristoph Hellwig
btrfs_writepage_endio_finish_ordered is a small wrapper around btrfs_mark_ordered_io_finished that just changs the argument passing slightly, and adds a tracepoint. Move the tracpoint to btrfs_mark_ordered_io_finished, which means it now also covers the error handling in btrfs_cleanup_ordered_extent and switch all callers to just call btrfs_mark_ordered_io_finished directly. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: add a btrfs_finish_ordered_extent helperChristoph Hellwig
Add a helper to complete an ordered_extent without first doing a lookup. The tracepoint cannot use the ordered_extent class as we also want to print the range. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: factor out a btrfs_queue_ordered_fn helperChristoph Hellwig
Factor out a helper to queue up an ordered_extent completion in a work queue. This new helper will later be used complete an ordered_extent without first doing a lookup. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: factor out a can_finish_ordered_extent helperChristoph Hellwig
Factor out a helper from btrfs_mark_ordered_io_finished that does the actual per-ordered_extent work to check if we want to schedule an I/O completion. This new helper will later be used complete an ordered_extent without first doing a lookup. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: remove btrfs_add_ordered_extentChristoph Hellwig
All callers are gone now. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: handle completed ordered extents in btrfs_split_ordered_extentChristoph Hellwig
To delay splitting ordered_extents to I/O completion time we need to be able to handle fully completed ordered extents in btrfs_split_ordered_extent. Besides a bit of accounting this primarily involved moving over the csums to the split bio for the range that it covers, which is simple enough because we always have one btrfs_ordered_sum per bio. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: atomically insert the new extent in btrfs_split_ordered_extentChristoph Hellwig
Currently there is a small race window in btrfs_split_ordered_extent, where the reduced old extent can be looked up on the per-inode rbtree or the per-root list while the newly split out one isn't visible yet. Fix this by open coding btrfs_alloc_ordered_extent in btrfs_split_ordered_extent, and holding the tree lock and root->ordered_extent_lock over the entire tree and extent manipulation. Note that this introduces new lock ordering because previously ordered_extent_lock was never held over the tree lock. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: split btrfs_alloc_ordered_extent to allocation and insertion helpersChristoph Hellwig
Split two low-level helpers out of btrfs_alloc_ordered_extent to allocate and insert the logic extent. The pure alloc helper will be used to improve btrfs_split_ordered_extent. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: return the new ordered_extent from btrfs_split_ordered_extentChristoph Hellwig
Return the ordered_extent split from the passed in one. This will be needed to be able to store an ordered_extent in the btrfs_bio. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: optimize the logical to physical mapping for zoned writesChristoph Hellwig
The current code to store the final logical to physical mapping for a zone append write in the extent tree is rather inefficient. It first has to split the ordered extent so that there is one ordered extent per bio, so that it can look up the ordered extent on I/O completion in btrfs_record_physical_zoned and store the physical LBA returned by the block driver in the ordered extent. btrfs_rewrite_logical_zoned then has to do a lookup in the chunk tree to see what physical address the logical address for this bio / ordered extent is mapped to, and then rewrite it in the extent tree. To optimize this process, we can store the physical address assigned in the chunk tree to the original logical address and a pointer to btrfs_ordered_sum structure the in the btrfs_bio structure, and then use this information to rewrite the logical address in the btrfs_ordered_sum structure directly at I/O completion time in btrfs_record_physical_zoned. btrfs_rewrite_logical_zoned then simply updates the logical address in the extent tree and the ordered_extent itself. The code in btrfs_rewrite_logical_zoned now runs for all data I/O completions in zoned file systems, which is fine as there is no remapping to do for non-append writes to conventional zones or for relocation, and the overhead for quickly breaking out of the loop is very low. Because zoned file systems now need the ordered_sums structure to record the actual write location returned by zone append, allocate dummy structures without the csum array for them when the I/O doesn't use checksums, and free them when completing the ordered_extent. Note that the btrfs_bio doesn't grow as the new field are places into a union that is so far not used for data writes and has plenty of space left in it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extentChristoph Hellwig
The function btrfs_clone_ordered_extent is very specific to the usage in btrfs_split_ordered_extent. Now that only a single call to btrfs_clone_ordered_extent is left, just fold it into btrfs_split_ordered_extent to make the operation more clear. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: sink parameter len to btrfs_split_ordered_extentChristoph Hellwig
btrfs_split_ordered_extent is only ever asked to split out the beginning of an ordered_extent (i.e. post == 0). Change it to only take a len to split out, and switch it to allocate the new extent for the beginning, as that helps with callers that want to keep a pointer to the ordered_extent that it is stealing from. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: move ordered_extent internal sanity checks into ↵Christoph Hellwig
btrfs_split_ordered_extent Move the three checks that are about ordered extent internal sanity checking into btrfs_split_ordered_extent instead of doing them in the higher level btrfs_extract_ordered_extent routine. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: pass flags as unsigned long to btrfs_add_ordered_extentBoris Burkov
The ordered_extent flags are declared as unsigned long, so pass them as such to btrfs_add_ordered_extent. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Boris Burkov <boris@bur.io> [ hch: split from a larger patch ] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: add function to create and return an ordered extentBoris Burkov
Currently, btrfs_add_ordered_extent allocates a new ordered extent, adds it to the rb_tree, but doesn't return a referenced pointer to the caller. There are cases where it is useful for the creator of a new ordered_extent to hang on to such a pointer, so add a new function btrfs_alloc_ordered_extent which is the same as btrfs_add_ordered_extent, except it takes an additional reference count and returns a pointer to the ordered_extent. Implement btrfs_add_ordered_extent as btrfs_alloc_ordered_extent followed by dropping the new reference and handling the IS_ERR case. The type of flags in btrfs_alloc_ordered_extent and btrfs_add_ordered_extent is changed from unsigned int to unsigned long so it's unified with the other ordered extent functions. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: remove the wait argument to btrfs_start_ordered_extentChristoph Hellwig
Given that wait is always set to 1, so remove the argument. Last use of wait with 0 was in 0c304304feab ("Btrfs: remove csum_bytes_left"). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-12Merge tag 'for-6.2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This round there are a lot of cleanups and moved code so the diffstat looks huge, otherwise there are some nice performance improvements and an update to raid56 reliability. User visible features: - raid56 reliability vs performance trade off: - fix destructive RMW for raid5 data (raid6 still needs work): do full checksum verification for all data during RMW cycle, this should prevent rewriting potentially corrupted data without notice - stripes are cached in memory which should reduce the performance impact but still can hurt some workloads - checksums are verified after repair again - this is the last option without introducing additional features (write intent bitmap, journal, another tree), the extra checksum read/verification was supposed to be avoided by the original implementation exactly for performance reasons but that caused all the reliability problems - discard=async by default for devices that support it - implement emergency flush reserve to avoid almost all unnecessary transaction aborts due to ENOSPC in cases where there are too many delayed refs or delayed allocation - skip block group synchronization if there's no change in used bytes, can reduce transaction commit count for some workloads Performance improvements: - fiemap and lseek: - overall speedup due to skipping unnecessary or duplicate searches (-40% run time) - cache some data structures and sharedness of extents (-30% run time) - send: - faster backref resolution when finding clones - cached leaf to root mapping for faster backref walking - improved clone/sharing detection - overall run time improvements (-70%) Core: - module initialization converted to a table of function pointers run in a sequence - preparation for fscrypt, extend passing file names across calls, dir item can store encryption status - raid56 updates: - more accurate error tracking of sectors within stripe - simplify recovery path and remove dedicated endio worker kthread - simplify scrub call paths - refactoring to support the extra data checksum verification during RMW cycle - tree block parentness checks consolidated and done at metadata read time - improved error handling - cleanups: - move a lot of code for better synchronization between kernel and user space sources, split big files - enum cleanups - GFP flag cleanups - header file cleanups, prototypes, dependencies - redundant parameter cleanups - inline extent handling simplifications - inode parameter conversion - data structure cleanups, reductions, renames, merges" * tag 'for-6.2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (249 commits) btrfs: print transaction aborted messages with an error level btrfs: sync some cleanups from progs into uapi/btrfs.h btrfs: do not BUG_ON() on ENOMEM when dropping extent items for a range btrfs: fix extent map use-after-free when handling missing device in read_one_chunk btrfs: remove outdated logic from overwrite_item() and add assertion btrfs: unify overwrite_item() and do_overwrite_item() btrfs: replace strncpy() with strscpy() btrfs: fix uninitialized variable in find_first_clear_extent_bit btrfs: fix uninitialized parent in insert_state btrfs: add might_sleep() annotations btrfs: add stack helpers for a few btrfs items btrfs: add nr_global_roots to the super block definition btrfs: remove BTRFS_LEAF_DATA_OFFSET btrfs: add helpers for manipulating leaf items and data btrfs: add eb to btrfs_node_key_ptr_offset btrfs: pass the extent buffer for the btrfs_item_nr helpers btrfs: move the csum helpers into ctree.h btrfs: move eb offset helpers into extent_io.h btrfs: move file_extent_item helpers into file-item.h btrfs: move leaf_data_end into ctree.c ...
2022-12-05btrfs: pass btrfs_inode to btrfs_add_delayed_iputDavid Sterba
The function is for internal interfaces so we should use the btrfs_inode. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move super_block specific helpers into super.hJosef Bacik
This will make syncing fs.h to user space a little easier if we can pull the super block specific helpers out of fs.h and put them in super.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move file prototypes to file.hJosef Bacik
Move these out of ctree.h into file.h to cut down on code in ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: update function commentsDavid Sterba
Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move the printk helpers out of ctree.hJosef Bacik
We have a bunch of printk helpers that are in ctree.h. These have nothing to do with ctree.c, so move them into their own header. Subsequent patches will cleanup the printk helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: use cached_state for btrfs_check_nocow_lockJosef Bacik
Now that try_lock_extent() takes a cached_state, plumb the cached_state through btrfs_try_lock_ordered_range() and then use a cached_state in btrfs_check_nocow_lock everywhere to avoid extra tree searches on the extent_io_tree. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: add a cached_state to try_lock_extentJosef Bacik
With nowait becoming more pervasive throughout our codebase go ahead and add a cached_state to try_lock_extent(). This allows us to be faster about clearing the locked area if we have contention, and then gives us the same optimization for unlock if we are able to lock the range. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-25btrfs: replace INT_LIMIT(loff_t) with OFFSET_MAXZhen Lei
OFFSET_MAX is self-annotated and more readable. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-09-29btrfs: add btrfs_try_lock_ordered_rangeJosef Bacik
For IOCB_NOWAIT we're going to want to use try lock on the extent lock, and simply bail if there's an ordered extent in the range because the only choice there is to wait for the ordered extent to complete. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Stefan Roesch <shr@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: unify the lock/unlock extent variantsJosef Bacik
We have two variants of lock/unlock extent, one set that takes a cached state, another that does not. This is slightly annoying, and generally speaking there are only a few places where we don't have a cached state. Simplify this by making lock_extent/unlock_extent the only variant and make it take a cached state, then convert all the callers appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: add lockdep annotations for the ordered extents wait eventIoannis Angelakopoulos
This wait event is very similar to the pending ordered wait event in the sense that it occurs in a different context than the condition signaling for the event. The signaling occurs in btrfs_remove_ordered_extent() while the wait event is implemented in btrfs_start_ordered_extent() in fs/btrfs/ordered-data.c However, in this case a thread must not acquire the lockdep map for the ordered extents wait event when the ordered extent is related to a free space inode. That is because lockdep creates dependencies between locks acquired both in execution paths related to normal inodes and paths related to free space inodes, thus leading to false positives. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: add lockdep annotations for pending_ordered wait eventIoannis Angelakopoulos
In contrast to the num_writers and num_extwriters wait events, the condition for the pending ordered wait event is signaled in a different context from the wait event itself. The condition signaling occurs in btrfs_remove_ordered_extent() in fs/btrfs/ordered-data.c while the wait event is implemented in btrfs_commit_transaction() in fs/btrfs/transaction.c Thus the thread signaling the condition has to acquire the lockdep map as a reader at the start of btrfs_remove_ordered_extent() and release it after it has signaled the condition. In this case some dependencies might be left out due to the placement of the annotation, but it is better than no annotation at all. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: remove the finish_func argument to btrfs_mark_ordered_io_finishedChristoph Hellwig
finish_func is always set to finish_ordered_fn, so remove it and also the now pointless and somewhat confusingly named __endio_write_update_ordered wrapper. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: add tracepoints for ordered extentsJohannes Thumshirn
When debugging a reference counting issue with ordered extents, I've found we're lacking a lot of tracepoint coverage in the ordered extent code. Close these gaps by adding tracepoints after every refcount_inc() in the ordered extent code. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: fix typos in commentsDavid Sterba
Codespell has found a few typos. Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: add BTRFS_IOC_ENCODED_WRITEOmar Sandoval
The implementation resembles direct I/O: we have to flush any ordered extents, invalidate the page cache, and do the io tree/delalloc/extent map/ordered extent dance. From there, we can reuse the compression code with a minor modification to distinguish the write from writeback. This also creates inline extents when possible. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: add ram_bytes and offset to btrfs_ordered_extentOmar Sandoval
Currently, we only create ordered extents when ram_bytes == num_bytes and offset == 0. However, BTRFS_IOC_ENCODED_WRITE writes may create extents which only refer to a subset of the full unencoded extent, so we need to plumb these fields through the ordered extent infrastructure and pass them down to insert_reserved_file_extent(). Since we're changing the btrfs_add_ordered_extent* signature, let's get rid of the trivial wrappers and add a kernel-doc. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-07btrfs: zoned: fix double counting of split ordered extentNaohiro Aota
btrfs_add_ordered_extent_*() add num_bytes to fs_info->ordered_bytes. Then, splitting an ordered extent will call btrfs_add_ordered_extent_*() again for split extents, leading to double counting of the region of a split extent. These leaked bytes are finally reported at unmount time as follow: BTRFS info (device dm-1): at unmount dio bytes count 364544 Fix the double counting by subtracting split extent's size from fs_info->ordered_bytes. Fixes: d22002fd37bd ("btrfs: zoned: split ordered extent when bio is sent") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-08-23btrfs: remove uptodate parameter from btrfs_dec_test_first_ordered_pendingDavid Sterba
In commit e65f152e4348 ("btrfs: refactor how we finish ordered extent io for endio functions") there was last caller not using 1 for the uptodate parameter. Now there's only one, passing 1, so we can remove it and simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-22btrfs: store a block_device in struct btrfs_ordered_extentChristoph Hellwig
Store the block device instead of the gendisk in the btrfs_ordered_extent structure instead of acquiring a reference to it later. Note: this is from series removing bdgrab/bdput, btrfs is one of the last users. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: make page Ordered bit to be subpage compatibleQu Wenruo
This involves the following modification: - Ordered extent creation This is done in process_one_page(), now PAGE_SET_ORDERED will call subpage helper to do the work. - endio functions This is done in btrfs_mark_ordered_io_finished(). - btrfs_invalidatepage() - btrfs_cleanup_ordered_extents() Use the subpage page helper, and add an extra branch to exit if the locked page have covered the full range. Now the usage of page Ordered flag for ordered extent accounting is fully subpage compatible. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: rename PagePrivate2 to PageOrdered inside btrfsQu Wenruo
Inside btrfs we use Private2 page status to indicate we have an ordered extent with pending IO for the sector. But the page status name, Private2, tells us nothing about the bit itself, so this patch will rename it to Ordered. And with extra comment about the bit added, so reader who is still uncertain about the page Ordered status, will find the comment pretty easily. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: introduce btrfs_lookup_first_ordered_range()Qu Wenruo
Although we already have btrfs_lookup_first_ordered_extent() and btrfs_lookup_ordered_extent(), they all have their own limitations: - btrfs_lookup_ordered_extent() can't do extra range check It's only designed to lookup any ordered extent before certain bytenr. - btrfs_lookup_first_ordered_extent() may not return the first ordered extent in the range It doesn't ensure the first ordered extent is returned. The existing callers are only interested in exhausting all ordered extents in a range, the order is not important. For incoming btrfs_invalidatepage() refactoring, we need a way to properly iterate all ordered extents in their bytenr order of a range. So this patch will introduce a new function, btrfs_lookup_first_ordered_range(), to do ordered extent with bytenr order awareness and extra range check. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-21btrfs: refactor how we finish ordered extent io for endio functionsQu Wenruo
Btrfs has two endio functions to mark certain io range finished for ordered extents: - __endio_write_update_ordered() This is for direct IO - btrfs_writepage_endio_finish_ordered() This for buffered IO. However they go different routines to handle ordered extent io: - Whether to iterate through all ordered extents __endio_write_update_ordered() will but btrfs_writepage_endio_finish_ordered() will not. In fact, iterating through all ordered extents will benefit later subpage support, while for current PAGE_SIZE == sectorsize requirement this behavior makes no difference. - Whether to update page Private2 flag __endio_write_update_ordered() will not update page Private2 flag as for iomap direct IO, the page can not be even mapped. While btrfs_writepage_endio_finish_ordered() will clear Private2 to prevent double accounting against btrfs_invalidatepage(). Those differences are pretty subtle, and the ordered extent iterations code in callers makes code much harder to read. So this patch will introduce a new function, btrfs_mark_ordered_io_finished(), to do the heavy lifting: - Iterate through all ordered extents in the range - Do the ordered extent accounting - Queue the work for finished ordered extent This function has two new feature: - Proper underflow detection and recovery The old underflow detection will only detect the problem, then continue. No proper info like root/inode/ordered extent info, nor noisy enough to be caught by fstests. Furthermore when underflow happens, the ordered extent will never finish. New error detection will reset the bytes_left to 0, do proper kernel warning, and output extra info including root, ino, ordered extent range, the underflow value. - Prevent double accounting based on Private2 flag Now if we find a range without Private2 flag, we will skip to next range. As that means someone else has already finished the accounting of ordered extent. This makes no difference for current code, but will be a critical part for incoming subpage support, as we can call btrfs_mark_ordered_io_finished() for multiple sectors if they are beyond inode size. Thus such double accounting prevention is a key feature for subpage. Now both endio functions only need to call that new function. And since the only caller of btrfs_dec_test_first_ordered_pending() is removed, also remove btrfs_dec_test_first_ordered_pending() completely. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-28btrfs: zoned: fix silent data loss after failure splitting ordered extentFilipe Manana
On a zoned filesystem, sometimes we need to split an ordered extent into 3 different ordered extents. The original ordered extent is shortened, at the front and at the rear, and we create two other new ordered extents to represent the trimmed parts of the original ordered extent. After adjusting the original ordered extent, we create an ordered extent to represent the pre-range, and that may fail with ENOMEM for example. After that we always try to create the ordered extent for the post-range, and if that happens to succeed we end up returning success to the caller as we overwrite the 'ret' variable which contained the previous error. This means we end up with a file range for which there is no ordered extent, which results in the range never getting a new file extent item pointing to the new data location. And since the split operation did not return an error, writeback does not fail and the inode's mapping is not flagged with an error, resulting in a subsequent fsync not reporting an error either. It's possibly very unlikely to have the creation of the post-range ordered extent succeed after the creation of the pre-range ordered extent failed, but it's not impossible. So fix this by making sure we only create the post-range ordered extent if there was no error creating the ordered extent for the pre-range. Fixes: d22002fd37bd97 ("btrfs: zoned: split ordered extent when bio is sent") CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>