summaryrefslogtreecommitdiff
path: root/fs/btrfs/disk-io.c
AgeCommit message (Collapse)Author
2022-05-16btrfs: do not return errors from btrfs_submit_metadata_bioChristoph Hellwig
btrfs_submit_metadata_bio already calls ->bi_end_io on error and the caller must ignore the return value, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: remove unused bio_flags argument to btrfs_submit_metadata_bioChristoph Hellwig
This argument is unused since commit 953651eb308f ("btrfs: factor out helper adding a page to bio") and commit 1b36294a6cd5 ("btrfs: call submit_bio_hook directly for metadata pages") reworked the way metadata bio submission is handled. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: use a read/write lock for protecting the block groups treeFilipe Manana
Currently we use a spin lock to protect the red black tree that we use to track block groups. Most accesses to that tree are actually read only and for large filesystems, with thousands of block groups, it actually has a bad impact on performance, as concurrent read only searches on the tree are serialized. Read only searches on the tree are very frequent and done when: 1) Pinning and unpinning extents, as we need to lookup the respective block group from the tree; 2) Freeing the last reference of a tree block, regardless if we pin the underlying extent or add it back to free space cache/tree; 3) During NOCOW writes, both buffered IO and direct IO, we need to check if the block group that contains an extent is read only or not and to increment the number of NOCOW writers in the block group. For those operations we need to search for the block group in the tree. Similarly, after creating the ordered extent for the NOCOW write, we need to decrement the number of NOCOW writers from the same block group, which requires searching for it in the tree; 4) Decreasing the number of extent reservations in a block group; 5) When allocating extents and freeing reserved extents; 6) Adding and removing free space to the free space tree; 7) When releasing delalloc bytes during ordered extent completion; 8) When relocating a block group; 9) During fitrim, to iterate over the block groups; 10) etc; Write accesses to the tree, to add or remove block groups, are much less frequent as they happen only when allocating a new block group or when deleting a block group. We also use the same spin lock to protect the list of currently caching block groups. Additions to this list are made when we need to cache a block group, because we don't have a free space cache for it (or we have but it's invalid), and removals from this list are done when caching of the block group's free space finishes. These cases are also not very common, but when they happen, they happen only once when the filesystem is mounted. So switch the lock that protects the tree of block groups from a spinning lock to a read/write lock. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: use rbtree with leftmost node cached for tracking lowest block groupFilipe Manana
We keep track of the start offset of the block group with the lowest start offset at fs_info->first_logical_byte. This requires explicitly updating that field every time we add, delete or lookup a block group to/from the red black tree at fs_info->block_group_cache_tree. Since the block group with the lowest start address happens to always be the one that is the leftmost node of the tree, we can use a red black tree that caches the left most node. Then when we need the start address of that block group, we can just quickly get the leftmost node in the tree and extract the start offset of that node's block group. This avoids the need to explicitly keep track of that address in the dedicated member fs_info->first_logical_byte, and it also allows the next patch in the series to switch the lock that protects the red black tree from a spin lock to a read/write lock - without this change it would be tricky because block group searches also update fs_info->first_logical_byte. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: check-integrity: split submit_bio from btrfsic checkingChristoph Hellwig
Require a separate call to the integrity checking helpers from the actual bio submission. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: remove unnecessary type castsYu Zhe
Explicit type casts are not necessary when it's void* to another pointer type. Signed-off-by: Yu Zhe <yuzhe@nfschina.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: expand subpage support to any PAGE_SIZE > 4KQu Wenruo
With the recent change in metadata handling, we can handle metadata in the following cases: - nodesize < PAGE_SIZE and sectorsize < PAGE_SIZE Go subpage routine for both metadata and data. - nodesize < PAGE_SIZE and sectorsize >= PAGE_SIZE Invalid case for now. As we require nodesize >= sectorsize. - nodesize >= PAGE_SIZE and sectorsize < PAGE_SIZE Go subpage routine for data, but regular page routine for metadata. - nodesize >= PAGE_SIZE and sectorsize >= PAGE_SIZE Go regular page routine for both metadata and data. Now we can handle any sectorsize < PAGE_SIZE, plus the existing sectorsize == PAGE_SIZE support. But here we introduce an artificial limit, any PAGE_SIZE > 4K case, we will only support 4K and PAGE_SIZE as sector size. The idea here is to reduce the test combinations, and push 4K as the default standard in the future. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routineQu Wenruo
The reason why we only support 64K page size for subpage is, for 64K page size we can ensure no matter what the nodesize is, we can fit it into one page. When other page size come, especially like 16K, the limitation is a bit limiting. To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the non-subpage routine. By this, we can allow 4K sectorsize on 16K page size. Although this introduces another smaller limitation, the metadata can not cross page boundary, which is already met by most recent mkfs. Another small improvement is, we can avoid the overhead for metadata if nodesize >= PAGE_SIZE. For 4K sector size and 64K page size/node size, or 4K sector size and 16K page size/node size, we don't need to allocate extra memory for the metadata pages. Please note that, this patch will not yet enable other page size support yet. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: tree-checker: check extent buffer owner against owner rootidQu Wenruo
Btrfs doesn't check whether the tree block respects the root owner. This means, if a tree block referred by a parent in extent tree, but has owner of 5, btrfs can still continue reading the tree block, as long as it doesn't trigger other sanity checks. Normally this is fine, but combined with the empty tree check in check_leaf(), if we hit an empty extent tree, but the root node has csum tree owner, we can let such extent buffer to sneak in. Shrink the hole by: - Do extra eb owner check at tree read time - Make sure the root owner extent buffer exactly matches the root id. Unfortunately we can't yet completely patch the hole, there are several call sites can't pass all info we need: - For reloc/log trees Their owner is key::offset, not key::objectid. We need the full root key to do that accurate check. For now, we just skip the ownership check for those trees. - For add_data_references() of relocation That call site doesn't have any parent/ownership info, as all the bytenrs are all from btrfs_find_all_leafs(). - For direct backref items walk Direct backref items records the parent bytenr directly, thus unlike indirect backref item, we don't do a full tree search. Thus in that case, we don't have full parent owner to check. For the later two cases, they all pass 0 as @owner_root, thus we can skip those cases if @owner_root is 0. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: remove trivial wrapper btrfs_read_buffer()Filipe Manana
The function btrfs_read_buffer() is useless, it just calls btree_read_extent_buffer_pages() with exactly the same arguments. So remove it and rename btree_read_extent_buffer_pages() to btrfs_read_extent_buffer(), which is a shorter name, has the "btrfs_" prefix (since it's used outside disk-io.c) and the name is clear enough about what it does. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-09btrfs: Convert to release_folioMatthew Wilcox (Oracle)
I've only converted the outer layers of the btrfs release_folio paths to use folios; the use of folios should be pushed further down into btrfs from here. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2022-05-06Merge tag 'for-5.18-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Regression fixes in zone activation: - move a loop invariant out of the loop to avoid checking space status - properly handle unlimited activation Other fixes: - for subpage, force the free space v2 mount to avoid a warning and make it easy to switch a filesystem on different page size systems - export sysfs status of exclusive operation 'balance paused', so the user space tools can recognize it and allow adding a device with paused balance - fix assertion failure when logging directory key range item" * tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: sysfs: export the balance paused state of exclusive operation btrfs: fix assertion failure when logging directory key range item btrfs: zoned: activate block group properly on unlimited active zone device btrfs: zoned: move non-changing condition check out of the loop btrfs: force v2 space cache usage for subpage mount
2022-05-05btrfs: force v2 space cache usage for subpage mountQu Wenruo
[BUG] For a 4K sector sized btrfs with v1 cache enabled and only mounted on systems with 4K page size, if it's mounted on subpage (64K page size) systems, it can cause the following warning on v1 space cache: BTRFS error (device dm-1): csum mismatch on free space cache BTRFS warning (device dm-1): failed to load free space cache for block group 84082688, rebuilding it now Although not a big deal, as kernel can rebuild it without problem, such warning will bother end users, especially if they want to switch the same btrfs seamlessly between different page sized systems. [CAUSE] V1 free space cache is still using fixed PAGE_SIZE for various bitmap, like BITS_PER_BITMAP. Such hard-coded PAGE_SIZE usage will cause various mismatch, from v1 cache size to checksum. Thus kernel will always reject v1 cache with a different PAGE_SIZE with csum mismatch. [FIX] Although we should fix v1 cache, it's already going to be marked deprecated soon. And we have v2 cache based on metadata (which is already fully subpage compatible), and it has almost everything superior than v1 cache. So just force subpage mount to use v2 cache on mount. Reported-by: Matt Corallo <blnxfsl@bluematt.me> CC: stable@vger.kernel.org # 5.15+ Link: https://lore.kernel.org/linux-btrfs/61aa27d1-30fc-c1a9-f0f4-9df544395ec3@bluematt.me/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-26Merge tag 'for-5.18-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - direct IO fixes: - restore passing file offset to correctly calculate checksums when repairing on read and bio split happens - use correct bio when sumitting IO on zoned filesystem - zoned mode fixes: - fix selection of device to correctly calculate device capabilities when allocating a new bio - use a dedicated lock for exclusion during relocation - fix leaked plug after failure syncing log - fix assertion during scrub and relocation * tag 'for-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: use dedicated lock for data relocation btrfs: fix assertion failure during scrub due to block group reallocation btrfs: fix direct I/O writes for split bios on zoned devices btrfs: fix direct I/O read repair for split bios btrfs: fix and document the zoned device choice in alloc_new_bio btrfs: fix leaked plug after failure syncing log on zoned filesystems
2022-04-21btrfs: zoned: use dedicated lock for data relocationNaohiro Aota
Currently, we use btrfs_inode_{lock,unlock}() to grant an exclusive writeback of the relocation data inode in btrfs_zoned_data_reloc_{lock,unlock}(). However, that can cause a deadlock in the following path. Thread A takes btrfs_inode_lock() and waits for metadata reservation by e.g, waiting for writeback: prealloc_file_extent_cluster() - btrfs_inode_lock(&inode->vfs_inode, 0); - btrfs_prealloc_file_range() ... - btrfs_replace_file_extents() - btrfs_start_transaction ... - btrfs_reserve_metadata_bytes() Thread B (e.g, doing a writeback work) needs to wait for the inode lock to continue writeback process: do_writepages - btrfs_writepages - extent_writpages - btrfs_zoned_data_reloc_lock(BTRFS_I(inode)); - btrfs_inode_lock() The deadlock is caused by relying on the vfs_inode's lock. By using it, we introduced unnecessary exclusion of writeback and btrfs_prealloc_file_range(). Also, the lock at this point is useless as we don't have any dirty pages in the inode yet. Introduce fs_info->zoned_data_reloc_io_lock and use it for the exclusive writeback. Fixes: 35156d852762 ("btrfs: zoned: only allow one process to add pages to a relocation inode") CC: stable@vger.kernel.org # 5.16.x: 869f4cdc73f9: btrfs: zoned: encapsulate inode locking for zoned relocation CC: stable@vger.kernel.org # 5.16.x CC: stable@vger.kernel.org # 5.17 Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-17block: add a bdev_write_cache helperChristoph Hellwig
Add a helper to check the write cache flag based on the block_device instead of having to poke into the block layer internal request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220415045258.199825-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-17btrfs: simplify ->flush_bio handlingChristoph Hellwig
Use and embedded bios that is initialized when used instead of bio_kmalloc plus bio_reset. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220406061228.410163-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-04-14Merge tag 'for-5.18-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more code and warning fixes. There's one feature ioctl removal patch slated for 5.18 that did not make it to the main pull request. It's just a one-liner and the ioctl has a v2 that's in use for a long time, no point to postpone it to 5.19. Late update: - remove balance v1 ioctl, superseded by v2 in 2012 Fixes: - add back cgroup attribution for compressed writes - add super block write start/end annotations to asynchronous balance - fix root reference count on an error handling path - in zoned mode, activate zone at the chunk allocation time to avoid ENOSPC due to timing issues - fix delayed allocation accounting for direct IO Warning fixes: - simplify assertion condition in zoned check - remove an unused variable" * tag 'for-5.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix btrfs_submit_compressed_write cgroup attribution btrfs: fix root ref counts in error handling in btrfs_get_root_ref btrfs: zoned: activate block group only for extent allocation btrfs: return allocated block group from do_chunk_alloc() btrfs: mark resumed async balance as writing btrfs: remove support of balance v1 ioctl btrfs: release correct delalloc amount in direct IO write path btrfs: remove unused variable in btrfs_{start,write}_dirty_block_groups() btrfs: zoned: remove redundant condition in btrfs_run_delalloc_range
2022-04-06btrfs: fix root ref counts in error handling in btrfs_get_root_refJia-Ju Bai
In btrfs_get_root_ref(), when btrfs_insert_fs_root() fails, btrfs_put_root() can happen for two reasons: - the root already exists in the tree, in that case it returns the reference obtained in btrfs_lookup_fs_root() - another error so the cleanup is done in the fail label Calling btrfs_put_root() unconditionally would lead to double decrement of the root reference possibly freeing it in the second case. Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Fixes: bc44d7c4b2b1 ("btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root") CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-22Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds
Pull filesystem folio updates from Matthew Wilcox: "Primarily this series converts some of the address_space operations to take a folio instead of a page. Notably: - a_ops->is_partially_uptodate() takes a folio instead of a page and changes the type of the 'from' and 'count' arguments to make it obvious they're bytes. - a_ops->invalidatepage() becomes ->invalidate_folio() and has a similar type change. - a_ops->launder_page() becomes ->launder_folio() - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the address_space as an argument. There are a couple of other misc changes up front that weren't worth separating into their own pull request" * tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits) fs: Remove aops ->set_page_dirty fb_defio: Use noop_dirty_folio() fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio fs: Convert __set_page_dirty_buffers to block_dirty_folio nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio() mm: Convert swap_set_page_dirty() to swap_dirty_folio() ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio() btrfs: Convert extent_range_redirty_for_io() to use folios fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio btrfs: Convert from set_page_dirty to dirty_folio fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio() fs: Add aops->dirty_folio fs: Remove aops->launder_page orangefs: Convert launder_page to launder_folio nfs: Convert from launder_page to launder_folio fuse: Convert from launder_page to launder_folio ...
2022-03-22Merge tag 'for-5.18-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This contains feature updates, performance improvements, preparatory and core work and some related VFS updates: Features: - encoded read/write ioctls, allows user space to read or write raw data directly to extents (now compressed, encrypted in the future), will be used by send/receive v2 where it saves processing time - zoned mode now works with metadata DUP (the mkfs.btrfs default) - error message header updates: - print error state: transaction abort, other error, log tree errors - print transient filesystem state: remount, device replace, ignored checksum verifications - tree-checker: verify the transaction id of the to-be-written dirty extent buffer Performance improvements for fsync: - directory logging speedups (up to -90% run time) - avoid logging all directory changes during renames (up to -60% run time) - avoid inode logging during rename and link when possible (up to -60% run time) - prepare extents to be logged before locking a log tree path (throughput +7%) - stop copying old file extents when doing a full fsync() - improved logging of old extents after truncate Core, fixes: - improved stale device identification by dev_t and not just path (for devices that are behind other layers like device mapper) - continued extent tree v2 preparatory work - disable features that won't work yet - add wrappers and abstractions for new tree roots - improved error handling - add super block write annotations around background block group reclaim - fix device scanning messages potentially accessing stale pointer - cleanups and refactoring VFS: - allow reflinks/deduplication from two different mounts of the same filesystem - export and add helpers for read/write range verification, for the encoded ioctls" * tag 'for-5.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (98 commits) btrfs: zoned: put block group after final usage btrfs: don't access possibly stale fs_info data in device_list_add btrfs: add lockdep_assert_held to need_preemptive_reclaim btrfs: verify the tranisd of the to-be-written dirty extent buffer btrfs: unify the error handling of btrfs_read_buffer() btrfs: unify the error handling pattern for read_tree_block() btrfs: factor out do_free_extent_accounting helper btrfs: remove last_ref from the extent freeing code btrfs: add a alloc_reserved_extent helper btrfs: remove BUG_ON(ret) in alloc_reserved_tree_block btrfs: add and use helper for unlinking inode during log replay btrfs: extend locking to all space_info members accesses btrfs: zoned: mark relocation as writing fs: allow cross-vfsmount reflink/dedupe btrfs: remove the cross file system checks from remap btrfs: pass btrfs_fs_info to btrfs_recover_relocation btrfs: pass btrfs_fs_info for deleting snapshots and cleaner btrfs: add filesystems state details to error messages btrfs: deal with unexpected extent type during reflinking btrfs: fix unexpected error path when reflinking an inline extent ...
2022-03-21Merge tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - BFQ cleanups and fixes (Yu, Zhang, Yahu, Paolo) - blk-rq-qos completion fix (Tejun) - blk-cgroup merge fix (Tejun) - Add offline error return value to distinguish it from an IO error on the device (Song) - IO stats fixes (Zhang, Christoph) - blkcg refcount fixes (Ming, Yu) - Fix for indefinite dispatch loop softlockup (Shin'ichiro) - blk-mq hardware queue management improvements (Ming) - sbitmap dead code removal (Ming, John) - Plugging merge improvements (me) - Show blk-crypto capabilities in sysfs (Eric) - Multiple delayed queue run improvement (David) - Block throttling fixes (Ming) - Start deprecating auto module loading based on dev_t (Christoph) - bio allocation improvements (Christoph, Chaitanya) - Get rid of bio_devname (Christoph) - bio clone improvements (Christoph) - Block plugging improvements (Christoph) - Get rid of genhd.h header (Christoph) - Ensure drivers use appropriate flush helpers (Christoph) - Refcounting improvements (Christoph) - Queue initialization and teardown improvements (Ming, Christoph) - Misc fixes/improvements (Barry, Chaitanya, Colin, Dan, Jiapeng, Lukas, Nian, Yang, Eric, Chengming) * tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block: (127 commits) block: cancel all throttled bios in del_gendisk() block: let blkcg_gq grab request queue's refcnt block: avoid use-after-free on throttle data block: limit request dispatch loop duration block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative" sr: simplify the local variable initialization in sr_block_open() block: don't merge across cgroup boundaries if blkcg is enabled block: fix rq-qos breakage from skipping rq_qos_done_bio() block: flush plug based on hardware and software queue order block: ensure plug merging checks the correct queue at least once block: move rq_qos_exit() into disk_release() block: do more work in elevator_exit block: move blk_exit_queue into disk_release block: move q_usage_counter release into blk_queue_release block: don't remove hctx debugfs dir from blk_mq_exit_queue block: move blkcg initialization/destroy into disk allocation/release handler sr: implement ->free_disk to simplify refcounting sd: implement ->free_disk to simplify refcounting sd: delay calling free_opal_dev sd: call sd_zbc_release_disk before releasing the scsi_device reference ...
2022-03-15btrfs: Convert from set_page_dirty to dirty_folioMatthew Wilcox (Oracle)
Optimise the non-DEBUG case to just call filemap_dirty_folio directly. The DEBUG case doesn't actually compile, but convert it to dirty_folio anyway. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15btrfs: Convert from invalidatepage to invalidate_folioMatthew Wilcox (Oracle)
A lot of the underlying infrastructure in btrfs needs to be switched over to folios, but this at least documents that invalidatepage can't be passed a tail page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-14btrfs: verify the tranisd of the to-be-written dirty extent bufferQu Wenruo
[BUG] There is a bug report that a bitflip in the transid part of an extent buffer makes btrfs to reject certain tree blocks: BTRFS error (device dm-0): parent transid verify failed on 1382301696 wanted 262166 found 22 [CAUSE] Note the failed transid check, hex(262166) = 0x40016, while hex(22) = 0x16. It's an obvious bitflip. Furthermore, the reporter also confirmed the bitflip is from the hardware, so it's a real hardware caused bitflip, and such problem can not be detected by the existing tree-checker framework. As tree-checker can only verify the content inside one tree block, while generation of a tree block can only be verified against its parent. So such problem remain undetected. [FIX] Although tree-checker can not verify it at write-time, we still have a quick (but not the most accurate) way to catch such obvious corruption. Function csum_one_extent_buffer() is called before we submit metadata write. Thus it means, all the extent buffer passed in should be dirty tree blocks, and should be newer than last committed transaction. Using that we can catch the above bitflip. Although it's not a perfect solution, as if the corrupted generation is higher than the correct value, we have no way to catch it at all. Reported-by: Christoph Anton Mitterer <calestyo@scientia.org> Link: https://lore.kernel.org/linux-btrfs/2dfcbc130c55cc6fd067b93752e90bd2b079baca.camel@scientia.org/ CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@sus,ree.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: unify the error handling pattern for read_tree_block()Qu Wenruo
We had an error handling pattern for read_tree_block() like this: eb = read_tree_block(); if (IS_ERR(eb)) { /* * Handling error here * Normally ended up with return or goto out. */ } else if (!extent_buffer_uptodate(eb)) { /* * Different error handling here * Normally also ended up with return or goto out; */ } This is fine, but if we want to add extra check for each read_tree_block(), the existing if-else-if is not that expandable and will take reader some seconds to figure out there is no extra branch. Here we change it to a more common way, without the extra else: eb = read_tree_block(); if (IS_ERR(eb)) { /* * Handling error here */ return eb or goto out; } if (!extent_buffer_uptodate(eb)) { /* * Different error handling here */ return eb or goto out; } This also removes some oddball call sites which uses some creative way to check error. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: pass btrfs_fs_info to btrfs_recover_relocationJosef Bacik
We don't need a root here, we just need the btrfs_fs_info, we can just get the specific roots we need from fs_info. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: pass btrfs_fs_info for deleting snapshots and cleanerJosef Bacik
We're passing a root around here, but we only really need the fs_info, so fix up btrfs_clean_one_deleted_snapshot() to take an fs_info instead, and then fix up all the callers appropriately. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: add support for multiple global rootsJosef Bacik
With extent tree v2 you will be able to create multiple csum, extent, and free space trees. They will be used based on the block group, which will now use the block_group_item->chunk_objectid to point to the set of global roots that it will use. When allocating new block groups we'll simply mod the gigabyte offset of the block group against the number of global roots we have and that will be the block groups global id. >From there we can take the bytenr that we're modifying in the respective tree, look up the block group and get that block groups corresponding global root id. From there we can get to the appropriate global root for that bytenr. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: add code to support the block group rootJosef Bacik
This code adds the on disk structures for the block group root, which will hold the block group items for extent tree v2. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: abstract out loading the tree rootJosef Bacik
We're going to be adding more roots that need to be loaded from the super block, so abstract out the code to read the tree_root from the super block, and use this helper for the chunk root as well. This will make it simpler to load the new trees in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-02btrfs: do not start relocation until in progress drops are doneJosef Bacik
We hit a bug with a recovering relocation on mount for one of our file systems in production. I reproduced this locally by injecting errors into snapshot delete with balance running at the same time. This presented as an error while looking up an extent item WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680 CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8 RIP: 0010:lookup_inline_extent_backref+0x647/0x680 RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000 RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001 R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000 R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000 FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0 Call Trace: <TASK> insert_inline_extent_backref+0x46/0xd0 __btrfs_inc_extent_ref.isra.0+0x5f/0x200 ? btrfs_merge_delayed_refs+0x164/0x190 __btrfs_run_delayed_refs+0x561/0xfa0 ? btrfs_search_slot+0x7b4/0xb30 ? btrfs_update_root+0x1a9/0x2c0 btrfs_run_delayed_refs+0x73/0x1f0 ? btrfs_update_root+0x1a9/0x2c0 btrfs_commit_transaction+0x50/0xa50 ? btrfs_update_reloc_root+0x122/0x220 prepare_to_merge+0x29f/0x320 relocate_block_group+0x2b8/0x550 btrfs_relocate_block_group+0x1a6/0x350 btrfs_relocate_chunk+0x27/0xe0 btrfs_balance+0x777/0xe60 balance_kthread+0x35/0x50 ? btrfs_balance+0xe60/0xe60 kthread+0x16b/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> Normally snapshot deletion and relocation are excluded from running at the same time by the fs_info->cleaner_mutex. However if we had a pending balance waiting to get the ->cleaner_mutex, and a snapshot deletion was running, and then the box crashed, we would come up in a state where we have a half deleted snapshot. Again, in the normal case the snapshot deletion needs to complete before relocation can start, but in this case relocation could very well start before the snapshot deletion completes, as we simply add the root to the dead roots list and wait for the next time the cleaner runs to clean up the snapshot. Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that had a pending drop_progress key. If they do then we know we were in the middle of the drop operation and set a flag on the fs_info. Then balance can wait until this flag is cleared to start up again. If there are DEAD_ROOT's that don't have a drop_progress set then we're safe to start balance right away as we'll be properly protected by the cleaner_mutex. CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-02block: pass a block_device and opf to bio_resetChristoph Hellwig
Pass the block_device that we plan to use this bio for and the operation to bio_reset to optimize the assigment. A NULL block_device can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02block: pass a block_device and opf to bio_allocChristoph Hellwig
Pass the block_device and operation that we plan to use this bio for to bio_alloc to optimize the assignment. NULL/0 can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Also move the gfp_mask argument after the nr_vecs argument for a much more logical calling convention matching what most of the kernel does. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-07btrfs: output more debug messages for uncommitted transactionQu Wenruo
Print extra information about how many dirty bytes an uncommitted has at the end of mount. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: remove reada infrastructureQu Wenruo
Currently there is only one user for btrfs metadata readahead, and that's scrub. But even for the single user, it's not providing the correct functionality it needs, as scrub needs reada for commit root, which current readahead can't provide. (Although it's pretty easy to add such feature). Despite this, there are some extra problems related to metadata readahead: - Duplicated feature with btrfs_path::reada - Partly duplicated feature of btrfs_fs_info::buffer_radix Btrfs already caches its metadata in buffer_radix, while readahead tries to read the tree block no matter if it's already cached. - Poor layer separation Metadata readahead works kinda at device level. This is definitely not the correct layer it should be, since metadata is at btrfs logical address space, it should not bother device at all. This brings extra chance for bugs to sneak in, while brings unnecessary complexity. - Dead code In the very beginning of scrub.c we have #undef DEBUG, rendering all the debug related code useless and unable to test. Thus here I purpose to remove the metadata readahead mechanism completely. [BENCHMARK] There is a full benchmark for the scrub performance difference using the old btrfs_reada_add() and btrfs_path::reada. For the worst case (no dirty metadata, slow HDD), there could be a 5% performance drop for scrub. For other cases (even SATA SSD), there is no distinguishable performance difference. The number is reported scrub speed, in MiB/s. The resolution is limited by the reported duration, which only has a resolution of 1 second. Old New Diff SSD 455.3 466.332 +2.42% HDD 103.927 98.012 -5.69% Comprehensive test methodology is in the cover letter of the patch. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-07btrfs: make send work with concurrent block group relocationFilipe Manana
We don't allow send and balance/relocation to run in parallel in order to prevent send failing or silently producing some bad stream. This is because while send is using an extent (specially metadata) or about to read a metadata extent and expecting it belongs to a specific parent node, relocation can run, the transaction used for the relocation is committed and the extent gets reallocated while send is still using the extent, so it ends up with a different content than expected. This can result in just failing to read a metadata extent due to failure of the validation checks (parent transid, level, etc), failure to find a backreference for a data extent, and other unexpected failures. Besides reallocation, there's also a similar problem of an extent getting discarded when it's unpinned after the transaction used for block group relocation is committed. The restriction between balance and send was added in commit 9e967495e0e0 ("Btrfs: prevent send failures and crashes due to concurrent relocation"), kernel 5.3, while the more general restriction between send and relocation was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs while we have send operations running"), kernel 5.14. Both send and relocation can be very long running operations. Relocation because it has to do a lot of IO and expensive backreference lookups in case there are many snapshots, and send due to read IO when operating on very large trees. This makes it inconvenient for users and tools to deal with scheduling both operations. For zoned filesystem we also have automatic block group relocation, so send can fail with -EAGAIN when users least expect it or send can end up delaying the block group relocation for too long. In the future we might also get the automatic block group relocation for non zoned filesystems. This change makes it possible for send and relocation to run in parallel. This is achieved the following way: 1) For all tree searches, send acquires a read lock on the commit root semaphore; 2) After each tree search, and before releasing the commit root semaphore, the leaf is cloned and placed in the search path (struct btrfs_path); 3) After releasing the commit root semaphore, the changed_cb() callback is invoked, which operates on the leaf and writes commands to the pipe (or file in case send/receive is not used with a pipe). It's important here to not hold a lock on the commit root semaphore, because if we did we could deadlock when sending and receiving to the same filesystem using a pipe - the send task blocks on the pipe because it's full, the receive task, which is the only consumer of the pipe, triggers a transaction commit when attempting to create a subvolume or reserve space for a write operation for example, but the transaction commit blocks trying to write lock the commit root semaphore, resulting in a deadlock; 4) Before moving to the next key, or advancing to the next change in case of an incremental send, check if a transaction used for relocation was committed (or is about to finish its commit). If so, release the search path(s) and restart the search, to where we were before, so that we don't operate on stale extent buffers. The search restarts are always possible because both the send and parent roots are RO, and no one can add, remove of update keys (change their offset) in RO trees - the only exception is deduplication, but that is still not allowed to run in parallel with send; 5) Periodically check if there is contention on the commit root semaphore, which means there is a transaction commit trying to write lock it, and release the semaphore and reschedule if there is contention, so as to avoid causing any significant delays to transaction commits. This leaves some room for optimizations for send to have less path releases and re searching the trees when there's relocation running, but for now it's kept simple as it performs quite well (on very large trees with resulting send streams in the order of a few hundred gigabytes). Test case btrfs/187, from fstests, stresses relocation, send and deduplication attempting to run in parallel, but without verifying if send succeeds and if it produces correct streams. A new test case will be added that exercises relocation happening in parallel with send and then checks that send succeeds and the resulting streams are correct. A final note is that for now this still leaves the mutual exclusion between send operations and deduplication on files belonging to a root used by send operations. A solution for that will be slightly more complex but it will eventually be built on top of this change. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: track the csum, extent, and free space trees in a rb treeJosef Bacik
In the future we are going to have multiple copies of these trees. To facilitate this we need a way to lookup the different roots we are looking for. Handle this by adding a global root rb tree that is indexed on the root->root_key. Then instead of loading the roots at mount time with individually targeted keys, simply search the tree_root for anything with the specific objectid we want. This will make it straightforward to support both old style and new style file systems. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: stop accessing ->free_space_root directlyJosef Bacik
We're going to have multiple free space roots in the future, so adjust all the users of the free space root to use a helper to access the root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: stop accessing ->csum_root directlyJosef Bacik
We are going to have multiple csum roots in the future, so convert all users of ->csum_root to btrfs_csum_root() and rename ->csum_root to ->_csum_root so we can easily find remaining users in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: set BTRFS_FS_STATE_NO_CSUMS if we fail to load the csum rootJosef Bacik
We have a few places where we skip doing csums if we mounted with one of the rescue options that ignores bad csum roots. In the future when there are multiple csum roots it'll be costly to check and see if there are any missing csum roots, so simply add a flag to indicate the fs should skip loading csums in case of errors. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: stop accessing ->extent_root directlyJosef Bacik
When we start having multiple extent roots we'll need to use a helper to get to the correct extent_root. Rename fs_info->extent_root to _extent_root and convert all of the users of the extent root to using the btrfs_extent_root() helper. This will allow us to easily clean up the remaining direct accesses in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: init root block_rsv at init root timeJosef Bacik
In the future we're going to have multiple csum and extent root trees, so init the roots block_rsv at setup_root time based on their root key objectid. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: pass fs_info to trace_btrfs_transaction_commitJosef Bacik
The root on the trans->root can be anything, and generally we're committing from the transaction kthread so it's usually the tree_root. Change this to just take an fs_info, and to maintain compatibility simply put the ROOT_TREE_OBJECTID as the root objectid for the tracepoint. This will allow use to remove trans->root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: rework async transaction committingJosef Bacik
Currently we do this awful thing where we get another ref on a trans handle, async off that handle and commit the transaction from that work. Because we do this we have to mess with current->journal_info and the freeze counting stuff. We already have an async thing to kick for the transaction commit, the transaction kthread. Replace this work struct with a flag on the fs_info to tell the kthread to go ahead and commit even if it's before our timeout. Then we can drastically simplify the async transaction commit path. Note: this can be simplified and functionality based on the pending operation COMMIT. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add note ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: remove unused BTRFS_FS_BARRIER flagJosef Bacik
This is no longer used, the -o nobarrier is handled by BTRFS_MOUNT_NOBARRIER. Remove the flag. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: get rid of root->orphan_cleanup_stateJosef Bacik
Now that we don't care about the stage of the orphan_cleanup_state, simply replace it with a bit on ->state to make sure we don't call the orphan cleanup every time we wander into this root. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: zoned: cache reported zone during mountNaohiro Aota
When mounting a device, we are reporting the zones twice: once for checking the zone attributes in btrfs_get_dev_zone_info and once for loading block groups' zone info in btrfs_load_block_group_zone_info(). With a lot of block groups, that leads to a lot of REPORT ZONE commands and slows down the mount process. This patch introduces a zone info cache in struct btrfs_zoned_device_info. The cache is populated while in btrfs_get_dev_zone_info() and used for btrfs_load_block_group_zone_info() to reduce the number of REPORT ZONE commands. The zone cache is then released after loading the block groups, as it will not be much effective during the run time. Benchmark: Mount an HDD with 57,007 block groups Before patch: 171.368 seconds After patch: 64.064 seconds While it still takes a minute due to the slowness of loading all the block groups, the patch reduces the mount time by 1/3. Link: https://lore.kernel.org/linux-btrfs/CAHQ7scUiLtcTqZOMMY5kbWUBOhGRwKo6J6wYPT5WY+C=cD49nQ@mail.gmail.com/ Fixes: 5b316468983d ("btrfs: get zone information of zoned block devices") CC: stable@vger.kernel.org Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-03btrfs: remove unused parameter fs_devices from btrfs_init_workqueuesSu Yue
Since commit ba8a9d079543 ("Btrfs: delete the entire async bio submission framework") removed submit workqueues, the parameter fs_devices is not used anymore. Remove it, no functional changes. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-17Merge tag 'for-5.16-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes, almost all error handling one-liners and for stable. - regression fix in directory logging items - regression fix of extent buffer status bits handling after an error - fix memory leak in error handling path in tree-log - fix freeing invalid anon device number when handling errors during subvolume creation - fix warning when freeing leaf after subvolume creation failure - fix missing blkdev put in device scan error handling - fix invalid delayed ref after subvolume creation failure" * tag 'for-5.16-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix missing blkdev_put() call in btrfs_scan_one_device() btrfs: fix warning when freeing leaf after subvolume creation failure btrfs: fix invalid delayed ref after subvolume creation failure btrfs: check WRITE_ERR when trying to read an extent buffer btrfs: fix missing last dir item offset update when logging directory btrfs: fix double free of anon_dev after failure to create subvolume btrfs: fix memory leak in __add_inode_ref()