summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent_io.c
AgeCommit message (Collapse)Author
2022-06-26Merge tag 'for-5.19-rc3-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - zoned relocation fixes: - fix critical section end for extent writeback, this could lead to out of order write - prevent writing to previous data relocation block group if space gets low - reflink fixes: - fix race between reflinking and ordered extent completion - proper error handling when block reserve migration fails - add missing inode iversion/mtime/ctime updates on each iteration when replacing extents - fix deadlock when running fsync/fiemap/commit at the same time - fix false-positive KCSAN report regarding pid tracking for read locks and data race - minor documentation update and link to new site * tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Documentation: update btrfs list of features and link to readthedocs.io btrfs: fix deadlock with fsync+fiemap+transaction commit btrfs: don't set lock_owner when locking extent buffer for reading btrfs: zoned: fix critical section of relocation inode writeback btrfs: zoned: prevent allocation from previous data relocation BG btrfs: do not BUG_ON() on failure to migrate space when replacing extents btrfs: add missing inode updates on each iteration when replacing extents btrfs: fix race between reflinking and ordered extent completion
2022-06-21btrfs: zoned: fix critical section of relocation inode writebackNaohiro Aota
We use btrfs_zoned_data_reloc_{lock,unlock} to allow only one process to write out to the relocation inode. That critical section must include all the IO submission for the inode. However, flush_write_bio() in extent_writepages() is out of the critical section, causing an IO submission outside of the lock. This leads to an out of the order IO submission and fail the relocation process. Fix it by extending the critical section. Fixes: 35156d852762 ("btrfs: zoned: only allow one process to add pages to a relocation inode") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-24Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds
Pull page cache updates from Matthew Wilcox: - Appoint myself page cache maintainer - Fix how scsicam uses the page cache - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS - Remove the AOP flags entirely - Remove pagecache_write_begin() and pagecache_write_end() - Documentation updates - Convert several address_space operations to use folios: - is_dirty_writeback - readpage becomes read_folio - releasepage becomes release_folio - freepage becomes free_folio - Change filler_t to require a struct file pointer be the first argument like ->read_folio * tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits) nilfs2: Fix some kernel-doc comments Appoint myself page cache maintainer fs: Remove aops->freepage secretmem: Convert to free_folio nfs: Convert to free_folio orangefs: Convert to free_folio fs: Add free_folio address space operation fs: Convert drop_buffers() to use a folio fs: Change try_to_free_buffers() to take a folio jbd2: Convert release_buffer_page() to use a folio jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio reiserfs: Convert release_buffer_page() to use a folio fs: Remove last vestiges of releasepage ubifs: Convert to release_folio reiserfs: Convert to release_folio orangefs: Convert to release_folio ocfs2: Convert to release_folio nilfs2: Remove comment about releasepage nfs: Convert to release_folio jfs: Convert to release_folio ...
2022-05-16btrfs: zoned: properly finish block group on metadata writeNaohiro Aota
Commit be1a1d7a5d24 ("btrfs: zoned: finish fully written block group") introduced zone finishing code both for data and metadata end_io path. However, the metadata side is not working as it should. First, it compares logical address (eb->start + eb->len) with offset within a block group (cache->zone_capacity) in submit_eb_page(). That essentially disabled zone finishing on metadata end_io path. Furthermore, fixing the issue above revealed we cannot call btrfs_zone_finish_endio() in end_extent_buffer_writeback(). We cannot call btrfs_lookup_block_group() which require spin lock inside end_io context. Introduce btrfs_schedule_zone_finish_bg() to wait for the extent buffer writeback and do the zone finish IO in a workqueue. Also, drop EXTENT_BUFFER_ZONE_FINISH as it is no longer used. Fixes: be1a1d7a5d24 ("btrfs: zoned: finish fully written block group") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: rename bio_ctrl::bio_flags to compress_typeDavid Sterba
The bio_ctrl is the last use of bio_flags that has been converted to compress type everywhere else. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: rename bio_flags in parameters and switch typeDavid Sterba
Several functions take parameter bio_flags that was simplified to just compress type, unify it and change the type accordingly. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: rename io_failure_record::bio_flags to compress_typeDavid Sterba
The bio_flags is now used to store unchanged compress type, so unify that. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: open code extent_set_compress_type helpersDavid Sterba
The helpers extent_set_compress_type and extent_compress_type have become trivial after previous cleanups and can be removed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: simplify handling of bio_ctrl::bio_flagsDavid Sterba
The bio_flags are used only to encode the compression and there are no other EXTENT_BIO_* flags, so the compress type can be stored directly. The struct member name is left unchanged and will be cleaned in later patches. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: remove trivial helper update_nr_writtenDavid Sterba
The helper used to do more with the wbc state but now it's just one subtraction, no need to have a special helper. It became trivial in a91326679f2a ("Btrfs: make mapping->writeback_index point to the last written page"). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: turn fs_info member buffer_radix into XArrayGabriel Niebler
… named 'extent_buffers'. Also adjust all usages of this object to use the XArray API, which greatly simplifies the code as it takes care of locking and is generally easier to use and understand, providing notionally simpler array semantics. Also perform some light refactoring. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Gabriel Niebler <gniebler@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: remove unused bio_flags argument to btrfs_submit_metadata_bioChristoph Hellwig
This argument is unused since commit 953651eb308f ("btrfs: factor out helper adding a page to bio") and commit 1b36294a6cd5 ("btrfs: call submit_bio_hook directly for metadata pages") reworked the way metadata bio submission is handled. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: move btrfs_readpage to extent_io.cChristoph Hellwig
Keep btrfs_readpage next to btrfs_do_readpage and the other address space operations. This allows to keep submit_one_bio and struct btrfs_bio_ctrl file local in extent_io.c. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: return correct error number for __extent_writepage_io()Qu Wenruo
[BUG] If we hit an error from submit_extent_page() inside __extent_writepage_io(), we could still return 0 to the caller, and even trigger the warning in btrfs_page_assert_not_dirty(). [CAUSE] In __extent_writepage_io(), if we hit an error from submit_extent_page(), we will just clean up the range and continue. This is completely fine for regular PAGE_SIZE == sectorsize, as we can only hit one sector in one page, thus after the error we're ensured to exit and @ret will be saved. But for subpage case, we may have other dirty subpage range in the page, and in the next loop, we may succeeded submitting the next range. In that case, @ret will be overwritten, and we return 0 to the caller, while we have hit some error. [FIX] Introduce @has_error and @saved_ret to record the first error we hit, so we will never forget what error we hit. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: fix the error handling for submit_extent_page() for btrfs_do_readpage()Qu Wenruo
[BUG] Test case generic/475 have a very high chance (almost 100%) to hit a fs hang, where a data page will never be unlocked and hang all later operations. [CAUSE] In btrfs_do_readpage(), if we hit an error from submit_extent_page() we will try to do the cleanup for our current io range, and exit. This works fine for PAGE_SIZE == sectorsize cases, but not for subpage. For subpage btrfs_do_readpage() will lock the full page first, which can contain several different sectors and extents: btrfs_do_readpage() |- begin_page_read() | |- btrfs_subpage_start_reader(); | Now the page will have PAGE_SIZE / sectorsize reader pending, | and the page is locked. | |- end_page_read() for different branches | This function will reduce subpage readers, and when readers | reach 0, it will unlock the page. But when submit_extent_page() failed, we only cleanup the current io range, while the remaining io range will never be cleaned up, and the page remains locked forever. [FIX] Update the error handling of submit_extent_page() to cleanup all the remaining subpage range before exiting the loop. Please note that, now submit_extent_page() can only fail due to sanity check in alloc_new_bio(). Thus regular IO errors are impossible to trigger the error path. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: avoid double clean up when submit_one_bio() failedQu Wenruo
[BUG] When running generic/475 with 64K page size and 4K sector size, it has a very high chance (almost 100%) to hang, with mostly data page locked but no one is going to unlock it. [CAUSE] With commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on reads"), if we failed to lookup checksum due to metadata IO error, we will return error for btrfs_submit_data_bio(). This will cause the page to be unlocked twice in btrfs_do_readpage(): btrfs_do_readpage() |- submit_extent_page() | |- submit_one_bio() | |- btrfs_submit_data_bio() | |- if (ret) { | |- bio->bi_status = ret; | |- bio_endio(bio); } | In the endio function, we will call end_page_read() | and unlock_extent() to cleanup the subpage range. | |- if (ret) { |- unlock_extent(); end_page_read() } Here we unlock the extent and cleanup the subpage range again. For unlock_extent(), it's mostly double unlock safe. But for end_page_read(), it's not, especially for subpage case, as for subpage case we will call btrfs_subpage_end_reader() to reduce the reader number, and use that to number to determine if we need to unlock the full page. If double accounted, it can underflow the number and leave the page locked without anyone to unlock it. [FIX] The commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on reads") itself is completely fine, it's our existing code not properly handling the error from bio submission hook properly. This patch will make submit_one_bio() to return void so that the callers will never be able to do cleanup when bio submission hook fails. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: simplify parameters of submit_read_repair() and renameQu Wenruo
Cleanup the function submit_read_repair() by: - Remove the fixed argument submit_bio_hook() The function is only called on buffered data read path, so the @submit_bio_hook argument is always btrfs_submit_data_bio(). Since it's fixed, then there is no need to pass that argument at all. - Rename the function to submit_data_read_repair() Just to be more explicit on all the 3 things, data, read and repair. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: pass a block_device to btrfs_bio_cloneChristoph Hellwig
Pass the block_device to bio_alloc_clone instead of setting it later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: use on-stack bio in repair_io_failureChristoph Hellwig
The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio, so just use an on-stack bio. Also cleanup the error handling to use goto labels and not discard the actual return values. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: check-integrity: split submit_bio from btrfsic checkingChristoph Hellwig
Require a separate call to the integrity checking helpers from the actual bio submission. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: wait between incomplete batch memory allocationsSweet Tea Dorminy
When allocating memory in a loop, each iteration should call memalloc_retry_wait() in order to prevent starving memory-freeing processes (and to mark where allocation loops are). Other filesystems do that as well. The bulk page allocation is the only place in btrfs with an allocation retry loop, so add an appropriate call to it. Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: allocate page arrays using bulk page allocatorSweet Tea Dorminy
While calling alloc_page() in a loop is an effective way to populate an array of pages, the MM subsystem provides a method to allocate pages in bulk. alloc_pages_bulk_array() populates the NULL slots in a page array, trying to grab more than one page at a time. Unfortunately, it doesn't guarantee allocating all slots in the array, but it's easy to call it in a loop and return an error if no progress occurs. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: factor out allocating an array of pagesSweet Tea Dorminy
Several functions currently populate an array of page pointers one allocated page at a time. Factor out the common code so as to allow improvements to all of the sites at once. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routineQu Wenruo
The reason why we only support 64K page size for subpage is, for 64K page size we can ensure no matter what the nodesize is, we can fit it into one page. When other page size come, especially like 16K, the limitation is a bit limiting. To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the non-subpage routine. By this, we can allow 4K sectorsize on 16K page size. Although this introduces another smaller limitation, the metadata can not cross page boundary, which is already met by most recent mkfs. Another small improvement is, we can avoid the overhead for metadata if nodesize >= PAGE_SIZE. For 4K sector size and 64K page size/node size, or 4K sector size and 16K page size/node size, we don't need to allocate extra memory for the metadata pages. Please note that, this patch will not yet enable other page size support yet. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16btrfs: warn when extent buffer leak test failsQu Wenruo
Although we have btrfs_extent_buffer_leak_debug_check() (enabled by CONFIG_BTRFS_DEBUG option) to detect and warn QA testers that we have some extent buffer leakage, it's just pr_err(), not noisy enough for fstests to cache. So here we trigger a WARN_ON() if the allocated_ebs list is not empty. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-09btrfs: Convert to release_folioMatthew Wilcox (Oracle)
I've only converted the outer layers of the btrfs release_folio paths to use folios; the use of folios should be pushed further down into btrfs from here. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jeff Layton <jlayton@kernel.org>
2022-04-26Merge tag 'for-5.18-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - direct IO fixes: - restore passing file offset to correctly calculate checksums when repairing on read and bio split happens - use correct bio when sumitting IO on zoned filesystem - zoned mode fixes: - fix selection of device to correctly calculate device capabilities when allocating a new bio - use a dedicated lock for exclusion during relocation - fix leaked plug after failure syncing log - fix assertion during scrub and relocation * tag 'for-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: use dedicated lock for data relocation btrfs: fix assertion failure during scrub due to block group reallocation btrfs: fix direct I/O writes for split bios on zoned devices btrfs: fix direct I/O read repair for split bios btrfs: fix and document the zoned device choice in alloc_new_bio btrfs: fix leaked plug after failure syncing log on zoned filesystems
2022-04-19btrfs: fix direct I/O read repair for split biosChristoph Hellwig
When a bio is split in btrfs_submit_direct, dip->file_offset contains the file offset for the first bio. But this means the start value used in btrfs_check_read_dio_bio is incorrect for subsequent bios. Add a file_offset field to struct btrfs_bio to pass along the correct offset. Given that check_data_csum only uses start of an error message this means problems with this miscalculation will only show up when I/O fails or checksums mismatch. The logic was removed in f4f39fc5dc30 ("btrfs: remove btrfs_bio::logical member") but we need it due to the bio splitting. CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-19btrfs: fix and document the zoned device choice in alloc_new_bioChristoph Hellwig
Zone Append bios only need a valid block device in struct bio, but not the device in the btrfs_bio. Use the information from btrfs_zoned_get_device to set up bi_bdev and fix zoned writes on multi-device file system with non-homogeneous capabilities and remove the pointless btrfs_bio.device assignment. Add big fat comments explaining what is going on here. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-26Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull NVMe write streams removal from Jens Axboe: "This removes the write streams support in NVMe. No vendor ever really shipped working support for this, and they are not interested in supporting it. With the NVMe support gone, we have nothing in the tree that supports this. Remove passing around of the hints. The only discussion point in this patchset imho is the fact that the file specific write hint setting/getting fcntl helpers will now return -1/EINVAL like they did before we supported write hints. No known applications use these functions, I only know of one prototype that I help do for RocksDB, and that's not used. That said, with a change like this, it's always a bit controversial. Alternatively, we could just make them return 0 and pretend it worked. It's placement based hints after all" * tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block: fs: remove fs.f_write_hint fs: remove kiocb.ki_hint block: remove the per-bio/request write hint nvme: remove support or stream based temperature hint
2022-03-22Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds
Pull filesystem folio updates from Matthew Wilcox: "Primarily this series converts some of the address_space operations to take a folio instead of a page. Notably: - a_ops->is_partially_uptodate() takes a folio instead of a page and changes the type of the 'from' and 'count' arguments to make it obvious they're bytes. - a_ops->invalidatepage() becomes ->invalidate_folio() and has a similar type change. - a_ops->launder_page() becomes ->launder_folio() - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the address_space as an argument. There are a couple of other misc changes up front that weren't worth separating into their own pull request" * tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits) fs: Remove aops ->set_page_dirty fb_defio: Use noop_dirty_folio() fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio fs: Convert __set_page_dirty_buffers to block_dirty_folio nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio() mm: Convert swap_set_page_dirty() to swap_dirty_folio() ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio() btrfs: Convert extent_range_redirty_for_io() to use folios fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio btrfs: Convert from set_page_dirty to dirty_folio fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio() fs: Add aops->dirty_folio fs: Remove aops->launder_page orangefs: Convert launder_page to launder_folio nfs: Convert from launder_page to launder_folio fuse: Convert from launder_page to launder_folio ...
2022-03-22Merge tag 'for-5.18-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This contains feature updates, performance improvements, preparatory and core work and some related VFS updates: Features: - encoded read/write ioctls, allows user space to read or write raw data directly to extents (now compressed, encrypted in the future), will be used by send/receive v2 where it saves processing time - zoned mode now works with metadata DUP (the mkfs.btrfs default) - error message header updates: - print error state: transaction abort, other error, log tree errors - print transient filesystem state: remount, device replace, ignored checksum verifications - tree-checker: verify the transaction id of the to-be-written dirty extent buffer Performance improvements for fsync: - directory logging speedups (up to -90% run time) - avoid logging all directory changes during renames (up to -60% run time) - avoid inode logging during rename and link when possible (up to -60% run time) - prepare extents to be logged before locking a log tree path (throughput +7%) - stop copying old file extents when doing a full fsync() - improved logging of old extents after truncate Core, fixes: - improved stale device identification by dev_t and not just path (for devices that are behind other layers like device mapper) - continued extent tree v2 preparatory work - disable features that won't work yet - add wrappers and abstractions for new tree roots - improved error handling - add super block write annotations around background block group reclaim - fix device scanning messages potentially accessing stale pointer - cleanups and refactoring VFS: - allow reflinks/deduplication from two different mounts of the same filesystem - export and add helpers for read/write range verification, for the encoded ioctls" * tag 'for-5.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (98 commits) btrfs: zoned: put block group after final usage btrfs: don't access possibly stale fs_info data in device_list_add btrfs: add lockdep_assert_held to need_preemptive_reclaim btrfs: verify the tranisd of the to-be-written dirty extent buffer btrfs: unify the error handling of btrfs_read_buffer() btrfs: unify the error handling pattern for read_tree_block() btrfs: factor out do_free_extent_accounting helper btrfs: remove last_ref from the extent freeing code btrfs: add a alloc_reserved_extent helper btrfs: remove BUG_ON(ret) in alloc_reserved_tree_block btrfs: add and use helper for unlinking inode during log replay btrfs: extend locking to all space_info members accesses btrfs: zoned: mark relocation as writing fs: allow cross-vfsmount reflink/dedupe btrfs: remove the cross file system checks from remap btrfs: pass btrfs_fs_info to btrfs_recover_relocation btrfs: pass btrfs_fs_info for deleting snapshots and cleaner btrfs: add filesystems state details to error messages btrfs: deal with unexpected extent type during reflinking btrfs: fix unexpected error path when reflinking an inline extent ...
2022-03-21Merge tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - BFQ cleanups and fixes (Yu, Zhang, Yahu, Paolo) - blk-rq-qos completion fix (Tejun) - blk-cgroup merge fix (Tejun) - Add offline error return value to distinguish it from an IO error on the device (Song) - IO stats fixes (Zhang, Christoph) - blkcg refcount fixes (Ming, Yu) - Fix for indefinite dispatch loop softlockup (Shin'ichiro) - blk-mq hardware queue management improvements (Ming) - sbitmap dead code removal (Ming, John) - Plugging merge improvements (me) - Show blk-crypto capabilities in sysfs (Eric) - Multiple delayed queue run improvement (David) - Block throttling fixes (Ming) - Start deprecating auto module loading based on dev_t (Christoph) - bio allocation improvements (Christoph, Chaitanya) - Get rid of bio_devname (Christoph) - bio clone improvements (Christoph) - Block plugging improvements (Christoph) - Get rid of genhd.h header (Christoph) - Ensure drivers use appropriate flush helpers (Christoph) - Refcounting improvements (Christoph) - Queue initialization and teardown improvements (Ming, Christoph) - Misc fixes/improvements (Barry, Chaitanya, Colin, Dan, Jiapeng, Lukas, Nian, Yang, Eric, Chengming) * tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block: (127 commits) block: cancel all throttled bios in del_gendisk() block: let blkcg_gq grab request queue's refcnt block: avoid use-after-free on throttle data block: limit request dispatch loop duration block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative" sr: simplify the local variable initialization in sr_block_open() block: don't merge across cgroup boundaries if blkcg is enabled block: fix rq-qos breakage from skipping rq_qos_done_bio() block: flush plug based on hardware and software queue order block: ensure plug merging checks the correct queue at least once block: move rq_qos_exit() into disk_release() block: do more work in elevator_exit block: move blk_exit_queue into disk_release block: move q_usage_counter release into blk_queue_release block: don't remove hctx debugfs dir from blk_mq_exit_queue block: move blkcg initialization/destroy into disk allocation/release handler sr: implement ->free_disk to simplify refcounting sd: implement ->free_disk to simplify refcounting sd: delay calling free_opal_dev sd: call sd_zbc_release_disk before releasing the scsi_device reference ...
2022-03-15btrfs: Convert extent_range_redirty_for_io() to use foliosMatthew Wilcox (Oracle)
This removes a call to __set_page_dirty_nobuffers(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15btrfs: Convert from invalidatepage to invalidate_folioMatthew Wilcox (Oracle)
A lot of the underlying infrastructure in btrfs needs to be switched over to folios, but this at least documents that invalidatepage can't be passed a tail page. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15btrfs: Use folio_invalidate()Matthew Wilcox (Oracle)
Instead of calling ->invalidatepage directly, use folio_invalidate(). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-14btrfs: zoned: put block group after final usageNikolay Borisov
It's counter-intuitive (and wrong) to put the block group _before_ the final usage in submit_eb_page. Fix it by re-ordering the call to btrfs_put_block_group after its final reference. Also fix a minor typo in 'implies' Fixes: be1a1d7a5d24 ("btrfs: zoned: finish fully written block group") CC: stable@vger.kernel.org # 5.16+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: do not clean up repair bio if submit failsJosef Bacik
The submit helper will always run bio_endio() on the bio if it fails to submit, so cleaning up the bio just leads to a variety of use-after-free and NULL pointer dereference bugs because we race with the endio function that is cleaning up the bio. Instead just return BLK_STS_OK as the repair function has to continue to process the rest of the pages, and the endio for the repair bio will do the appropriate cleanup for the page that it was given. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: do not try to repair bio that has no mirror setJosef Bacik
If we fail to submit a bio for whatever reason, we may not have setup a mirror_num for that bio. This means we shouldn't try to do the repair workflow, if we do we'll hit an BUG_ON(!failrec->this_mirror) in clean_io_failure. Instead simply skip the repair workflow if we have no mirror set, and add an assert to btrfs_check_repairable() to make it easier to catch what is happening in the future. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: remove no longer used counter when reading data pageFilipe Manana
After commit 92082d40976ed0 ("btrfs: integrate page status update for data read path into begin/end_page_read"), the 'nr' counter at btrfs_do_readpage() is no longer used, we increment it but we never read from it. So just remove it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: fix lost error return value when reading a data pageFilipe Manana
At btrfs_do_readpage(), if we get an error when trying to lookup for an extent map, we end up marking the page with the error bit, clearing the uptodate bit on it, and doing everything else that should be done. However we return success (0) to the caller, when we should return the error encoded in the extent map pointer. So fix that by returning the error encoded in the pointer. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: stop checking for NULL return from btrfs_get_extent()Filipe Manana
At extent_io.c, in the read page and write page code paths, we are testing if the return value from btrfs_get_extent() can be NULL. However that is not possible, as btrfs_get_extent() always returns either an error pointer or a (non-NULL) pointer to an extent map structure. Everywhere else outside extent_io.c we never check for NULL, we always treat any returned value as a non-NULL pointer if it does not encode an error. So check only for the IS_ERR() case at extent_io.c. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: stop checking for NULL return from btrfs_get_extent_fiemap()Johannes Thumshirn
In get_extent_skip_holes() we're checking the return of btrfs_get_extent_fiemap() for an error pointer or NULL, but btrfs_get_extent_fiemap() will never return NULL, only error pointers or a valid extent_map. The other caller of btrfs_get_extent_fiemap(), find_desired_extent(), correctly only checks for error-pointers. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-07block: remove the per-bio/request write hintChristoph Hellwig
With the NVMe support for this gone, there are no consumers of these hints left, so remove them. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220304175556.407719-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-07Merge branch 'for-5.18/block' into for-5.18/write-streamsJens Axboe
* for-5.18/block: (96 commits) block: remove bio_devname ext4: stop using bio_devname raid5-ppl: stop using bio_devname raid1: stop using bio_devname md-multipath: stop using bio_devname dm-integrity: stop using bio_devname dm-crypt: stop using bio_devname pktcdvd: remove a pointless debug check in pkt_submit_bio block: remove handle_bad_sector block: fix and cleanup bio_check_ro bfq: fix use-after-free in bfq_dispatch_request blk-crypto: show crypto capabilities in sysfs block: don't delete queue kobject before its children block: simplify calling convention of elv_unregister_queue() block: remove redundant semicolon block: default BLOCK_LEGACY_AUTOLOAD to y block: update io_ticks when io hang block, bfq: don't move oom_bfqq block, bfq: avoid moving bfqq to it's parent bfqg block, bfq: cleanup bfq_bfqq_to_bfqg() ...
2022-03-06Merge tag 'for-5.17-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes for various problems that have user visible effects or seem to be urgent: - fix corruption when combining DIO and non-blocking io_uring over multiple extents (seen on MariaDB) - fix relocation crash due to premature return from commit - fix quota deadlock between rescan and qgroup removal - fix item data bounds checks in tree-checker (found on a fuzzed image) - fix fsync of prealloc extents after EOF - add missing run of delayed items after unlink during log replay - don't start relocation until snapshot drop is finished - fix reversed condition for subpage writers locking - fix warning on page error" * tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fallback to blocking mode when doing async dio over multiple extents btrfs: add missing run of delayed items after unlink during log replay btrfs: qgroup: fix deadlock between rescan worker and remove qgroup btrfs: fix relocation crash due to premature return from btrfs_commit_transaction() btrfs: do not start relocation until in progress drops are done btrfs: tree-checker: use u64 for item data end to avoid overflow btrfs: do not WARN_ON() if we have PageError set btrfs: fix lost prealloc extents beyond eof after full fsync btrfs: subpage: fix a wrong check on subpage->writers
2022-03-02btrfs: do not WARN_ON() if we have PageError setJosef Bacik
Whenever we do any extent buffer operations we call assert_eb_page_uptodate() to complain loudly if we're operating on an non-uptodate page. Our overnight tests caught this warning earlier this week WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50 CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G W 5.17.0-rc3+ #564 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Workqueue: btrfs-cache btrfs_work_helper RIP: 0010:assert_eb_page_uptodate+0x3f/0x50 RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246 RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000 RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0 RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000 R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1 R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000 FS: 0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0 Call Trace: extent_buffer_test_bit+0x3f/0x70 free_space_test_bit+0xa6/0xc0 load_free_space_tree+0x1f6/0x470 caching_thread+0x454/0x630 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? lock_release+0x1f0/0x2d0 btrfs_work_helper+0xf2/0x3e0 ? lock_release+0x1f0/0x2d0 ? finish_task_switch.isra.0+0xf9/0x3a0 process_one_work+0x26d/0x580 ? process_one_work+0x580/0x580 worker_thread+0x55/0x3b0 ? process_one_work+0x580/0x580 kthread+0xf0/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 This was partially fixed by c2e39305299f01 ("btrfs: clear extent buffer uptodate when we fail to write it"), however all that fix did was keep us from finding extent buffers after a failed writeout. It didn't keep us from continuing to use a buffer that we already had found. In this case we're searching the commit root to cache the block group, so we can start committing the transaction and switch the commit root and then start writing. After the switch we can look up an extent buffer that hasn't been written yet and start processing that block group. Then we fail to write that block out and clear Uptodate on the page, and then we start spewing these errors. Normally we're protected by the tree lock to a certain degree here. If we read a block we have that block read locked, and we block the writer from locking the block before we submit it for the write. However this isn't necessarily fool proof because the read could happen before we do the submit_bio and after we locked and unlocked the extent buffer. Also in this particular case we have path->skip_locking set, so that won't save us here. We'll simply get a block that was valid when we read it, but became invalid while we were using it. What we really want is to catch the case where we've "read" a block but it's not marked Uptodate. On read we ClearPageError(), so if we're !Uptodate and !Error we know we didn't do the right thing for reading the page. Fix this by checking !Uptodate && !Error, this way we will not complain if our buffer gets invalidated while we're using it, and we'll maintain the spirit of the check which is to make sure we have a fully in-cache block while we're messing with it. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-04block: pass a block_device to bio_clone_fastChristoph Hellwig
Pass a block_device to bio_clone_fast and __bio_clone_fast and give the functions more suitable names. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220202160109.108149-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02block: pass a block_device and opf to bio_alloc_biosetChristoph Hellwig
Pass the block_device and operation that we plan to use this bio for to bio_alloc_bioset to optimize the assigment. NULL/0 can be passed, both for the passthrough case on a raw request_queue and to temporarily avoid refactoring some nasty code. Also move the gfp_mask argument after the nr_vecs argument for a much more logical calling convention matching what most of the kernel does. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220124091107.642561-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-22mm: remove cleancacheChristoph Hellwig
Patch series "remove Xen tmem leftovers". Since the removal of the Xen tmem driver in 2019, the cleancache hooks are entirely unused, as are large parts of frontswap. This series against linux-next (with the folio changes included) removes cleancaches, and cuts down frontswap to the bits actually used by zswap. This patch (of 13): The cleancache subsystem is unused since the removal of Xen tmem driver in commit 814bbf49dcd0 ("xen: remove tmem driver"). [akpm@linux-foundation.org: remove now-unreachable code] Link: https://lkml.kernel.org/r/20211224062246.1258487-1-hch@lst.de Link: https://lkml.kernel.org/r/20211224062246.1258487-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>