summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent_io.c
AgeCommit message (Collapse)Author
2023-12-15btrfs: rename EXTENT_BUFFER_NO_CHECK to EXTENT_BUFFER_ZONED_ZEROOUTJohannes Thumshirn
EXTENT_BUFFER_ZONED_ZEROOUT better describes the state of the extent buffer, namely it is written as all zeros. This is needed in zoned mode, to preserve I/O ordering. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-15btrfs: migrate to use folio private instead of page privateQu Wenruo
As a cleanup and preparation for future folio migration, this patch would replace all page->private to folio version. This includes: - PagePrivate() -> folio_test_private() - page->private -> folio_get_private() - attach_page_private() -> folio_attach_private() - detach_page_private() -> folio_detach_private() Since we're here, also remove the forced cast on page->private, since it's (void *) already, we don't really need to do the cast. For now even if we missed some call sites, it won't cause any problem yet, as we're only using order 0 folio (single page), thus all those folio/page flags should be synced. But for the future conversion to utilize higher order folio, the page <-> folio flag sync is no longer guaranteed, thus we have to migrate to utilize folio flags. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-06btrfs: don't clear qgroup reserved bit in release_folioBoris Burkov
The EXTENT_QGROUP_RESERVED bit is used to "lock" regions of the file for duplicate reservations. That is two writes to that range in one transaction shouldn't create two reservations, as the reservation will only be freed once when the write finally goes down. Therefore, it is never OK to clear that bit without freeing the associated qgroup reserve. At this point, we don't want to be freeing the reserve, so mask off the bit. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-24btrfs: free the allocated memory if btrfs_alloc_page_array() failsQu Wenruo
[BUG] If btrfs_alloc_page_array() fail to allocate all pages but part of the slots, then the partially allocated pages would be leaked in function btrfs_submit_compressed_read(). [CAUSE] As explicitly stated, if btrfs_alloc_page_array() returned -ENOMEM, caller is responsible to free the partially allocated pages. For the existing call sites, most of them are fine: - btrfs_raid_bio::stripe_pages Handled by free_raid_bio(). - extent_buffer::pages[] Handled btrfs_release_extent_buffer_pages(). - scrub_stripe::pages[] Handled by release_scrub_stripe(). But there is one exception in btrfs_submit_compressed_read(), if btrfs_alloc_page_array() failed, we didn't cleanup the array and freed the array pointer directly. Initially there is still the error handling in commit dd137dd1f2d7 ("btrfs: factor out allocating an array of pages"), but later in commit 544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio"), the error handling is removed, leading to the possible memory leak. [FIX] This patch would add back the error handling first, then to prevent such situation from happening again, also Make btrfs_alloc_page_array() to free the allocated pages as a extra safety net, then we don't need to add the error handling to btrfs_submit_compressed_read(). Fixes: 544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio") CC: stable@vger.kernel.org # 6.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: change test_range_bit to scan the whole rangeDavid Sterba
The semantics of test_range_bit() with filled == 0 is now in it's own helper so test_range_bit will check the whole range unconditionally. The detection logic is flipped and assumes success by default and catches exceptions. Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: add specific helper for range bit test existsDavid Sterba
The existing helper test_range_bit works in two ways, checks if the whole range contains all the bits, or stop on the first occurrence. By adding a specific helper for the latter case, the inner loop can be simplified and contains fewer conditionals, making it a bit faster. There's no caller that uses the cached state pointer so this reduces the argument count further. Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: warn on tree blocks which are not nodesize alignedQu Wenruo
A long time ago, we had some metadata chunks which started at sector boundary but not aligned to nodesize boundary. This led to some older filesystems which can have tree blocks only aligned to sectorsize, but not nodesize. Later 'btrfs check' gained the ability to detect and warn about such tree blocks, and kernel fixed the chunk allocation behavior, nowadays those tree blocks should be pretty rare. But in the future, if we want to migrate metadata to folio, we cannot have such tree blocks, as filemap_add_folio() requires the page index to be aligned with the folio number of pages. Such unaligned tree blocks can lead to VM_BUG_ON(). So this patch adds extra warning for those unaligned tree blocks, as a preparation for the future folio migration. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: check-integrity: remove btrfsic_unmount() functionQu Wenruo
The function btrfsic_mount() is part of the deprecated check-integrity functionality. Now let's remove the main entry point of check-integrity, and thankfully most of the check-integrity code is self-contained inside check-integrity.c, we can safely remove the function without huge changes to btrfs code base. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12btrfs: reformat remaining kdoc style commentsDavid Sterba
Function name in the comment does not bring much value to code not exposed as API and we don't stick to the kdoc format anymore. Update formatting of parameter descriptions. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-20btrfs: reset destination buffer when read_extent_buffer() gets invalid rangeQu Wenruo
Commit f98b6215d7d1 ("btrfs: extent_io: do extra check for extent buffer read write functions") changed how we handle invalid extent buffer range for read_extent_buffer(). Previously if the range is invalid we just set the destination to zero, but after the patch we do nothing and error out. This can lead to smatch static checker errors like: fs/btrfs/print-tree.c:186 print_uuid_item() error: uninitialized symbol 'subvol_id'. fs/btrfs/tests/extent-io-tests.c:338 check_eb_bitmap() error: uninitialized symbol 'has'. fs/btrfs/tests/extent-io-tests.c:353 check_eb_bitmap() error: uninitialized symbol 'has'. fs/btrfs/uuid-tree.c:203 btrfs_uuid_tree_remove() error: uninitialized symbol 'read_subid'. fs/btrfs/uuid-tree.c:353 btrfs_uuid_tree_iterate() error: uninitialized symbol 'subid_le'. fs/btrfs/uuid-tree.c:72 btrfs_uuid_tree_lookup() error: uninitialized symbol 'data'. fs/btrfs/volumes.c:7415 btrfs_dev_stats_value() error: uninitialized symbol 'val'. Fix those warnings by reverting back to the old memset() behavior. By this we keep the static checker happy and would still make a lot of noise when such invalid ranges are passed in. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: f98b6215d7d1 ("btrfs: extent_io: do extra check for extent buffer read write functions") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-13btrfs: don't clear uptodate on write errorsJosef Bacik
We have been consistently seeing hangs with generic/648 in our subpage GitHub CI setup. This is a classic deadlock, we are calling btrfs_read_folio() on a folio, which requires holding the folio lock on the folio, and then finding a ordered extent that overlaps that range and calling btrfs_start_ordered_extent(), which then tries to write out the dirty page, which requires taking the folio lock and then we deadlock. The hang happens because we're writing to range [1271750656, 1271767040), page index [77621, 77622], and page 77621 is !Uptodate. It is also Dirty, so we call btrfs_read_folio() for 77621 and which does btrfs_lock_and_flush_ordered_range() for that range, and we find an ordered extent which is [1271644160, 1271746560), page index [77615, 77621]. The page indexes overlap, but the actual bytes don't overlap. We're holding the page lock for 77621, then call btrfs_lock_and_flush_ordered_range() which tries to flush the dirty page, and tries to lock 77621 again and then we deadlock. The byte ranges do not overlap, but with subpage support if we clear uptodate on any portion of the page we mark the entire thing as not uptodate. We have been clearing page uptodate on write errors, but no other file system does this, and is in fact incorrect. This doesn't hurt us in the !subpage case because we can't end up with overlapped ranges that don't also overlap on the page. Fix this by not clearing uptodate when we have a write error. The only thing we should be doing in this case is setting the mapping error and carrying on. This makes it so we would no longer call btrfs_read_folio() on the page as it's uptodate and eliminates the deadlock. With this patch we're now able to make it through a full fstests run on our subpage blocksize VMs. Note for stable backports: this probably goes beyond 6.1 but the code has been cleaned up and clearing the uptodate bit must be verified on each version independently. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: zoned: defer advancing meta write pointerNaohiro Aota
We currently advance the meta_write_pointer in btrfs_check_meta_write_pointer(). That makes it necessary to revert it when locking the buffer failed. Instead, we can advance it just before sending the buffer. Also, this is necessary for the following commit. In the commit, it needs to release the zoned_meta_io_lock to allow IOs to come in and wait for them to fill the currently active block group. If we advance the meta_write_pointer before locking the extent buffer, the following extent buffer can pass the meta_write_pointer check, resulting in an unaligned write failure. Advancing the pointer is still thread-safe as the extent buffer is locked. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: zoned: return int from btrfs_check_meta_write_pointerNaohiro Aota
Now that we have writeback_control passed to btrfs_check_meta_write_pointer(), we can move the wbc condition in submit_eb_page() to btrfs_check_meta_write_pointer() and return int. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: zoned: introduce block group context to btrfs_eb_write_contextNaohiro Aota
For metadata write out on the zoned mode, we call btrfs_check_meta_write_pointer() to check if an extent buffer to be written is aligned to the write pointer. We look up a block group containing the extent buffer for every extent buffer, which takes unnecessary effort as the writing extent buffers are mostly contiguous. Introduce "zoned_bg" to cache the block group working on. Also, while at it, rename "cache" to "block_group". Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: introduce struct to consolidate extent buffer write contextNaohiro Aota
Introduce btrfs_eb_write_context to consolidate writeback_control and the exntent buffer context. This will help adding a block group context as well. While at it, move the eb context setting before btrfs_check_meta_write_pointer(). We can set it here because we anyway need to skip pages in the same eb if that eb is rejected by btrfs_check_meta_write_pointer(). Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: refactor main loop in memmove_extent_buffer()Qu Wenruo
[BACKGROUND] Currently memove_extent_buffer() does a loop where it strop at any page boundary inside [dst_offset, dst_offset + len) or [src_offset, src_offset + len). This is mostly allowing us to do copy_pages(), but if we're going to use folios we will need to handle multi-page (the old behavior) or single folio (the new optimization). The current code would be a burden for future changes. [ENHANCEMENT] Instead of sticking with copy_pages(), here we utilize the new __write_extent_buffer() helper to handle the writes. Unlike the refactoring in memcpy_extent_buffer(), we can not just rely on the write_extent_buffer() and only handle page boundaries inside src range. The function write_extent_buffer() itself is still doing forward writing, thus it cannot handle the following case: (already in the extent buffer memory operation tests, cross page overlapping run 2) Src Page boundary |///////| |///|////| Dst In the above case, if we just follow page boundary in the src range, we have no need to do any split, just one __write_extent_buffer() with use_memmove = true. But __write_extent_buffer() would split the dst range into two, so it first copies the beginning part of the src range into the first half of the dst range. After this operation, the beginning of the dst range is already updated, causing corruption. So we have to follow the old behavior of handling both page boundaries. And since we're the last caller of copy_pages(), we can remove it completely. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: refactor main loop in memcpy_extent_buffer()Qu Wenruo
[BACKGROUND] Currently memcpy_extent_buffer() does a loop where it would stop at any page boundary inside [dst_offset, dst_offset + len) or [src_offset, src_offset + len). This is mostly allowing us to do copy_pages(), but if we're going to use folios we will need to handle multi-page (the old behavior) or single folio (the new optimization). The current code would be a burden for future changes. [ENHANCEMENT] There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike regular memcpy(), this function can handle overlapping ranges. So here we extract write_extent_buffer() into a new internal helper, __write_extent_buffer(), and add a new parameter @use_memmove, to indicate whether we should use memmove() or regular memcpy(). Now we can go __write_extent_buffer() to handle writing into the dst range, with proper overlapping detection. This has a tiny change to the chance of calling memmove(). As the split only happens at the source range page boundaries, the memcpy/memmove() range would be slightly larger than the old code, thus slightly increase the chance we call memmove() other than memcopy(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: copy all pages at once at the end of btrfs_clone_extent_buffer()Qu Wenruo
btrfs_clone_extent_buffer() calls copy_page() at each iteration but we can copy all pages at the end in one go if there were no errors. This would make later conversion to folios easier. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: refactor main loop in copy_extent_buffer_full()Qu Wenruo
[BACKGROUND] copy_extent_buffer_full() currently does different handling for regular and subpage cases, for regular cases it does a page by page copying. For subpage cases, it just copies the content. This is fine for the page based extent buffer code, but for the incoming folio conversion, it can be a burden to add a new branch just to handle all the different combinations (subpage vs regular, one single folio vs multi pages). [ENHANCE] Instead of handling the different combinations, just go one single handling for all cases, utilizing write_extent_buffer() to do the copying. Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: use write_extent_buffer() to implement write_extent_buffer_*id()Qu Wenruo
Helpers write_extent_buffer_chunk_tree_uuid() and write_extent_buffer_fsid(), they can be implemented by write_extent_buffer(). These two helpers are not that frequently used, they only get called during initialization of a new tree block. There is not much need for those slightly optimized versions. And since they can be easily converted to one write_extent_buffer() call, define them as inline helpers. This would make later page/folio switch much easier, as all change only need to happen in write_extent_buffer(). Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: refactor extent buffer bitmaps operationsQu Wenruo
[BACKGROUND] Currently we handle extent bitmaps manually in extent_buffer_bitmap_set() and extent_buffer_bitmap_clear(). Although with various helpers like eb_bitmap_offset() it's still a little messy to read. The code seems to be a copy of bitmap_set(), but with all the cross-page handling embedded into the code. [ENHANCEMENT] This patch would enhance the readability by introducing two helpers: - memset_extent_buffer() To handle the byte aligned range, thus all the cross-page handling is done there. - extent_buffer_get_byte() This for the first and the last byte operations, which only need to grab one byte, thus no need for any cross-page handling. So we can split both extent_buffer_bitmap_set() and extent_buffer_bitmap_clear() into 3 parts: - Handle the first byte If the range fits inside the first byte, we can exit early. - Handle the byte aligned part This is the part which can have cross-page operations, and it would be handled by memset_extent_buffer(). - Handle the last byte This refactoring does not only make the code a little easier to read, but also makes later folio/page switch much easier, as the switch only needs to be done inside memset_extent_buffer() and extent_buffer_get_byte(). Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: don't redirty locked_page in run_delalloc_zonedChristoph Hellwig
extent_write_locked_range currently expects that either all or no pages are dirty when it is called. Bur run_delalloc_zoned is called directly in the writepages path, and has the dirty bit cleared only for locked_page and which the extent_write_cache_pages currently operates. It currently works around this by redirtying locked_page, but that is a bit inefficient and cumbersome. Pass a locked_page argument to run_delalloc_zoned so that clearing the dirty bit can be skipped on just that page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: don't redirty pages in compress_file_rangeChristoph Hellwig
compress_file_range needs to clear the dirty bit before handing off work to the compression worker threads to prevent processes coming in through mmap and changing the file contents while the compression is accessing the data (See commit 4adaa611020f ("Btrfs: fix race between mmap writes and compression"). But when compress_file_range decides to not compress the data, it falls back to submit_uncompressed_range which uses extent_write_locked_range to write the uncompressed data. extent_write_locked_range currently expects all pages to be marked dirty so that it can clear the dirty bit itself, and thus compress_file_range has to redirty the page range. Redirtying the page range is rather inefficient and also pointless, so instead pass a pages_dirty parameter to extent_write_locked_range and skip the redirty game entirely. Note that compress_file_range was even redirtying the locked_page twice given that extent_range_clear_dirty_for_io already redirties all pages in the range, which must include locked_page if there is one. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: reduce the number of arguments to btrfs_run_delalloc_rangeChristoph Hellwig
Instead of a separate page_started argument that tells the callers that btrfs_run_delalloc_range already started writeback by itself, overload the return value with a positive 1 in additio to 0 and a negative error code to indicate that is has already started writeback, and remove the nr_written argument as that caller can calculate it directly based on the range, and in fact already does so for the case where writeback wasn't started yet. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: improve the delalloc_to_write calculation in writepage_delallocChristoph Hellwig
Currently writepage_delalloc adds to delalloc_to_write in every loop operation. That is not only more work than doing it once after the loop, but can also over-increment the counter due to rounding errors when a new loop iteration starts with an offset into a page. Add a new page_start variable instead of recaculation that value over and over, move the delalloc_to_write calculation out of the loop, use the DIV_ROUND_UP helper instead of open coding it and remove the pointless found local variable. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: remove the return value from extent_write_locked_rangeChristoph Hellwig
The return value from extent_write_locked_range is ignored, and that's fine because the error reporting happens through the mapping and ordered_extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: remove end_extent_writepageChristoph Hellwig
end_extent_writepage is a small helper that combines a call to btrfs_mark_ordered_io_finished with conditional error-only calls to btrfs_page_clear_uptodate and mapping_set_error with a somewhat unfortunate calling convention that passes and inclusive end instead of the len expected by the underlying functions. Remove end_extent_writepage and open code it in the 4 callers. Out of those two already are error-only and thus don't need the extra conditional, and one already has the mapping_set_error, so a duplicate call can be avoided. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: remove btrfs_writepage_endio_finish_orderedChristoph Hellwig
btrfs_writepage_endio_finish_ordered is a small wrapper around btrfs_mark_ordered_io_finished that just changs the argument passing slightly, and adds a tracepoint. Move the tracpoint to btrfs_mark_ordered_io_finished, which means it now also covers the error handling in btrfs_cleanup_ordered_extent and switch all callers to just call btrfs_mark_ordered_io_finished directly. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: split page locking out of __process_pages_contigChristoph Hellwig
There is a lot of complexity in __process_pages_contig to deal with the PAGE_LOCK case that can return an error unlike all the other actions. Open code the page iteration for page locking in lock_delalloc_pages and remove all the now unused code from __process_pages_contig. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: move eb subpage preallocation out of the loopQu Wenruo
Initially we preallocate btrfs_subpage structure in the main loop of alloc_extent_buffer(). But later commit fbca46eb46ec ("btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine") has made sure we only go subpage routine if our nodesize is smaller than PAGE_SIZE. This means for that case, we only need to allocate the subpage structure once anyway. So this patch would make the preallocation out of the main loop. This would slightly reduce the workload when we hold the page lock, and make code a little easier to read. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-21btrfs: use folio_next_index() helper in extent_write_cache_pagesMinjie Du
Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using the existing helper folio_next_index(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Minjie Du <duminjie@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-17btrfs: only subtract from len_to_oe_boundary when it is tracking an extentChris Mason
bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone as we submit bios for writes. Every time we add a page to the bio, we decrement those bytes from len_to_oe_boundary, and then we submit the bio if we happen to hit zero. Most of the time, len_to_oe_boundary gets set to U32_MAX. submit_extent_page() adds pages into our bio, and the size of the bio ends up limited by: - Are we contiguous on disk? - Does bio_add_page() allow us to stuff more in? - is len_to_oe_boundary > 0? The len_to_oe_boundary math starts with U32_MAX, which isn't page or sector aligned, and subtracts from it until it hits zero. In the non-zoned case, the last IO we submit before we hit zero is going to be unaligned, triggering BUGs. This is hard to trigger because bio_add_page() isn't going to make a bio of U32_MAX size unless you give it a perfect set of pages and fully contiguous extents on disk. We can hit it pretty reliably while making large swapfiles during provisioning because the machine is freshly booted, mostly idle, and the disk is freshly formatted. It's also possible to trigger with reads when read_ahead_kb is set to 4GB. The code has been clean up and shifted around a few times, but this flaw has been lurking since the counter was added. I think the commit 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") ended up exposing the bug. The fix used here is to skip doing math on len_to_oe_boundary unless we've changed it from the default U32_MAX value. bio_add_page() is the real limit we want, and there's no reason to do extra math when block layer is doing it for us. Sample reproducer, note you'll need to change the path to the bdi and device: SUBVOL=/btrfs/swapvol SWAPFILE=$SUBVOL/swapfile SZMB=8192 mkfs.btrfs -f /dev/vdb mount /dev/vdb /btrfs btrfs subvol create $SUBVOL chattr +C $SUBVOL dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB sync echo 4 > /proc/sys/vm/drop_caches echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb while true; do echo 1 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/drop_caches dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock done Fixes: 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") CC: stable@vger.kernel.org # 6.4+ Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-10btrfs: don't wait for writeback on clean pages in extent_write_cache_pagesChristoph Hellwig
__extent_writepage could have started on more pages than the one it was called for. This happens regularly for zoned file systems, and in theory could happen for compressed I/O if the worker thread was executed very quickly. For such pages extent_write_cache_pages waits for writeback to complete before moving on to the next page, which is highly inefficient as it blocks the flusher thread. Port over the PageDirty check that was added to write_cache_pages in commit 515f4a037fb ("mm: write_cache_pages optimise page cleaning") to fix this. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-10btrfs: don't stop integrity writeback too earlyChristoph Hellwig
extent_write_cache_pages stops writing pages as soon as nr_to_write hits zero. That is the right thing for opportunistic writeback, but incorrect for data integrity writeback, which needs to ensure that no dirty pages are left in the range. Thus only stop the writeback for WB_SYNC_NONE if nr_to_write hits 0. This is a port of write_cache_pages changes in commit 05fe478dd04e ("mm: write_cache_pages integrity fix"). Note that I've only trigger the problem with other changes to the btrfs writeback code, but this condition seems worthwhile fixing anyway. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ updated comment ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: use btrfs_finish_ordered_extent to complete buffered writesChristoph Hellwig
Use the btrfs_finish_ordered_extent helper to complete compressed writes using the bbio->ordered pointer instead of requiring an rbtree lookup in the otherwise equivalent btrfs_mark_ordered_io_finished called from btrfs_writepage_endio_finish_ordered. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: open code end_extent_writepage in end_bio_extent_writepageChristoph Hellwig
This prepares for switching to more efficient ordered_extent processing and already removes the forth and back conversion from len to end back to len. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: add an ordered_extent pointer to struct btrfs_bioChristoph Hellwig
Add a pointer to the ordered_extent to the existing union in struct btrfs_bio, so all code dealing with data write bios can just use a pointer dereference to retrieve the ordered_extent instead of doing multiple rbtree lookups per I/O. The reference to this ordered_extent is dropped at end I/O time, which implies that an extra one must be acquired when the bio is split. This also requires moving the btrfs_extract_ordered_extent call into btrfs_split_bio so that the invariant of always having a valid ordered_extent reference for the btrfs_bio is kept. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: limit write bios to a single ordered extentChristoph Hellwig
Currently buffered writeback bios are allowed to span multiple ordered_extents, although that basically never actually happens since commit 4a445b7b6178 ("btrfs: don't merge pages into bio if their page offset is not contiguous"). Supporting bios than span ordered_extents complicates the file checksumming code, and prevents us from adding an ordered_extent pointer to the btrfs_bio structure. Use the existing code to limit a bio to single ordered_extent for zoned device writes for all writes. This allows to remove the REQ_BTRFS_ONE_ORDERED flags, and the handling of multiple ordered_extents in btrfs_csum_one_bio. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: don't treat zoned writeback as being from an async helper threadChristoph Hellwig
When extent_write_locked_range was originally added, it was only used writing back compressed pages from an async helper thread. But it is now also used for writing back pages on zoned devices, where it is called directly from the ->writepage context. In this case we want to be able to pass on the writeback_control instead of creating a new one, and more importantly want to use all the normal cgroup interaction instead of potentially deferring writeback to another helper. Fixes: 898793d992c2 ("btrfs: zoned: write out partially allocated region") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: only call __extent_writepage_io from extent_write_locked_rangeChristoph Hellwig
__extent_writepage does a lot of things that make no sense for extent_write_locked_range, given that extent_write_locked_range itself is called from __extent_writepage either directly or through a workqueue, and all this work has already been done in the first invocation and the pages haven't been unlocked since. Call __extent_writepage_io directly instead and open code the logic tracked in btrfs_bio_ctrl::extent_locked. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: move writeback_control::nr_to_write update to __extent_writepageChristoph Hellwig
Move the nr_to_write accounting from __extent_writepage_io to __extent_writepage_io as we'll grow another __extent_writepage_io that doesn't want this accounting soon. Also drop the obsolete comment - decrementing a counter in the on-stack writeback_control data structure doesn't need the page lock. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: remove non-standard extent handling in __extent_writepage_ioChristoph Hellwig
__extent_writepage_io is never called for compressed or inline extents, or holes. Remove the not quite working code for them and replace it with asserts that these cases don't happen. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: remove PAGE_SET_ERRORChristoph Hellwig
Now that the btrfs writeback code has stopped using PageError, using PAGE_SET_ERROR to just set the per-address_space error flag is confusing. Open code the mapping_set_error calls in the callers and remove the PAGE_SET_ERROR flag. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: stop setting PageError in the data I/O pathChristoph Hellwig
PageError is not used by the VFS/MM and deprecated because it uses up a page bit and has no coherent rules. Instead read errors are usually propagated by not setting or clearing the uptodate bit, and write errors are propagated through the address_space. Btrfs now only sets the flag and never clears it for data pages, so just remove all places setting it, and the subpage error bit. Note that the error propagation for superblock writes that work on the block device mapping still uses PageError for now, but that will be addressed in a separate series. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: don't check PageError in __extent_writepageChristoph Hellwig
__extent_writepage currenly sets PageError whenever any error happens, and the also checks for PageError to decide if to call error handling. This leads to very unclear responsibility for cleaning up on errors. In the VM and generic writeback helpers the basic idea is that once I/O is fired off all error handling responsibility is delegated to the end I/O handler. But if that end I/O handler sets the PageError bit, and the submitter checks it, the bit could in some cases leak into the submission context for fast enough I/O. Fix this by simply not checking PageError and just using the local ret variable to check for submission errors. This also fundamentally solves the long problem documented in a comment in __extent_writepage by never leaking the error bit into the submission context. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: don't check PageError in btrfs_verify_pageChristoph Hellwig
btrfs_verify_page is called from the readpage completion handler, which is only used to read pages, or parts of pages that aren't uptodate yet. The only case where PageError could be set on a page in btrfs is if we had a previous writeback error, but in that case we won't called readpage on it, as it has previously been marked uptodate. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: fix fsverify read error handling in end_page_readChristoph Hellwig
Also clear the uptodate bit to make sure the page isn't seen as uptodate in the page cache if fsverity verification fails. Fixes: 146054090b08 ("btrfs: initial fsverity support") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: factor out a btrfs_verify_page helperChristoph Hellwig
Split all the conditionals for the fsverity calls in end_page_read into a btrfs_verify_page helper to keep the code readable and make additional refactoring easier. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: fix range_end calculation in extent_write_locked_rangeChristoph Hellwig
The range_end field in struct writeback_control is inclusive, just like the end parameter passed to extent_write_locked_range. Not doing this could cause extra writeout, which is harmless but suboptimal. Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads") CC: stable@vger.kernel.org # 5.9+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: use the same uptodate variable for end_bio_extent_readpage()Qu Wenruo
In function end_bio_extent_readpage() we call endio_readpage_release_extent() to unlock the extent io tree. However we pass PageUptodate(page) as @uptodate parameter for it, while for previous end_page_read() call, we use a dedicated @uptodate local variable. This is not a big deal, as even for subpage cases, either the bio only covers part of the page, then the @uptodate is always false, and the subpage ranges can still be merged. But for the sake of consistency, always use @uptodate variable when possible. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>