summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent_io.c
AgeCommit message (Collapse)Author
2025-07-22btrfs: use clear_and_wake_up_bit() where open codedDavid Sterba
There are two cases open coding the clear and wake up pattern, we can use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22btrfs: index buffer_tree using node sizeDaniel Vacek
So far we've been deriving the buffer tree index using the sector size. But each extent buffer covers multiple sectors. This makes the buffer tree rather sparse. For example the typical and quite common configuration uses sector size of 4KiB and node size of 16KiB. In this case it means the buffer tree is using up to the maximum of 25% of it's slots. Or in other words at least 75% of the tree slots are wasted as never used. We can score significant memory savings on the required tree nodes by indexing the tree using the node size instead. As a result far less slots are wasted and the tree can now use up to all 100% of it's slots this way. Note: This works even with unaligned tree blocks as we can still get unique index by doing eb->start >> nodesize_shift. Getting some stats from running fio write test, there is a bit of variance. The values presented in the table below are medians from 5 test runs. The numbers are: - # of allocated ebs in the tree - # of leaf tree nodes - highest index in the tree (radix tree width)): ebs / leaves / Index | bare for-next | with fix ---------------------+--------------------+------------------- post mount | 16 / 11 / 10e5c | 16 / 10 / 4240 post test | 5810 / 891 / 11cfc | 4420 / 252 / 473a post rm | 574 / 300 / 10ef0 | 540 / 163 / 46e9 In this case (10GiB filesystem) the height of the tree is still 3 levels but the 4x width reduction is clearly visible as expected. But since the tree is more dense we can see the 54-72% reduction of leaf nodes. That's very close to ideal with this test. It means the tree is getting really dense with this kind of workload. Also, the fio results show no performance change. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22btrfs: use readahead_expand() on compressed extentsBoris Burkov
We recently received a report of poor performance doing sequential buffered reads of a file with compressed extents. With bs=128k, a naive sequential dd ran as fast on a compressed file as on an uncompressed (1.2GB/s on my reproducing system) while with bs<32k, this performance tanked down to ~300MB/s. i.e., slow: dd if=some-compressed-file of=/dev/null bs=4k count=X vs fast: dd if=some-compressed-file of=/dev/null bs=128k count=Y The cause of this slowness is overhead to do with looking up extent_maps to enable readahead pre-caching on compressed extents (add_ra_bio_pages()), as well as some overhead in the generic VFS readahead code we hit more in the slow case. Notably, the main difference between the two read sizes is that in the large sized request case, we call btrfs_readahead() relatively rarely while in the smaller request we call it for every compressed extent. So the fast case stays in the btrfs readahead loop: while ((folio = readahead_folio(rac)) != NULL) btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start); where the slower one breaks out of that loop every time. This results in calling add_ra_bio_pages a lot, doing lots of extent_map lookups, extent_map locking, etc. This happens because although add_ra_bio_pages() does add the appropriate un-compressed file pages to the cache, it does not communicate back to the ractl in any way. To solve this, we should be using readahead_expand() to signal to readahead to expand the readahead window. This change passes the readahead_control into the btrfs_bio_ctrl and in the case of compressed reads sets the expansion to the size of the extent_map we already looked up anyway. It skips the subpage case as that one already doesn't do add_ra_bio_pages(). With this change, whether we use bs=4k or bs=128k, btrfs expands the readahead window up to the largest compressed extent we have seen so far (in the trivial example: 128k) and the call stacks of the two modes look identical. Notably, we barely call add_ra_bio_pages at all. And the performance becomes identical as well. So this change certainly "fixes" this performance problem. Of course, it does seem to beg a few questions: 1. Will this waste too much page cache with a too large ra window? 2. Will this somehow cause bugs prevented by the more thoughtful checking in add_ra_bio_pages? 3. Should we delete add_ra_bio_pages? My stabs at some answers: 1. Hard to say. See attempts at generic performance testing below. Is there a "readahead_shrink" we should be using? Should we expand more slowly, by half the remaining em size each time? 2. I don't think so. Since the new behavior is indistinguishable from reading the file with a larger read size passed in, I don't see why one would be safe but not the other. 3. Probably! I tested that and it was fine in fstests, and it seems like the pages would get re-used just as well in the readahead case. However, it is possible some reads that use page cache but not btrfs_readahead() could suffer. I will investigate this further as a follow up. I tested the performance implications of this change in 3 ways (using compress-force=zstd:3 for compression): Directly test the affected workload of small sequential reads on a compressed file (improved from ~250MB/s to ~1.2GB/s) ==========for-next========== dd /mnt/lol/non-cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.02983 s, 712 MB/s dd /mnt/lol/non-cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 5.92403 s, 725 MB/s dd /mnt/lol/cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 17.8832 s, 240 MB/s dd /mnt/lol/cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.71001 s, 1.2 GB/s ==========ra-expand========== dd /mnt/lol/non-cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.09001 s, 705 MB/s dd /mnt/lol/non-cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.07664 s, 707 MB/s dd /mnt/lol/cmpr 4k 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.79531 s, 1.1 GB/s dd /mnt/lol/cmpr 128k 32768+0 records in 32768+0 records out 4294967296 bytes (4.3 GB, 4.0 GiB) copied, 3.69533 s, 1.2 GB/s Built the linux kernel from clean (no change) Ran fsperf. Mostly neutral results with some improvements and regressions here and there. Reported-by: Dimitrios Apostolou <jimis@gmx.net> Link: https://lore.kernel.org/linux-btrfs/34601559-6c16-6ccc-1793-20a97ca0dbba@gmx.net/ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: use pgoff_t for page index variablesDavid Sterba
Any conversion of offsets in the logical or the physical mapping space of the pages is done by a shift and the target type should be pgoff_t (type of struct page::index). Fix the locations where it's still unsigned long. Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: use our message helpers instead of pr_err/pr_warn/pr_infoDavid Sterba
Our message helpers accept NULL for the fs_info in the context that does not provide and print the common header of the message. The use of pr_* helpers is only for special reasons, like module loading, device scanning or multi-line output (print-tree). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: make extent_buffer_test_bit() return a boolean insteadFilipe Manana
All the callers want is to determine if a bit is set and all of them call the function and do a double negation (!!) on its result to get a boolean. So change it to return a boolean and simplify callers. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: use folio_end() where appropriateDavid Sterba
Simplify folio_pos() + folio_size() and use the new helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: use btrfs_root_id() where not done yetDavid Sterba
A few more remaining cases where we can use the helper. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: use refcount_t type for the extent buffer reference counterFilipe Manana
Instead of using a bare atomic, use the refcount_t type, which despite being a structure that contains only an atomic, has an API that checks for underflows and other hazards. This doesn't change the size of the extent_buffer structure. This removes the need to do things like this: WARN_ON(atomic_read(&eb->refs) == 0); if (atomic_dec_and_test(&eb->refs)) { (...) } And do just: if (refcount_dec_and_test(&eb->refs)) { (...) } Since refcount_dec_and_test() already triggers a warning when we decrement a ref count that has a value of 0 (or below zero). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: add comment for optimization in free_extent_buffer()Filipe Manana
There's this special atomic compare and exchange logic which serves to avoid locking the extent buffers refs_lock spinlock and therefore reduce lock contention, so add a comment to make it more obvious. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: reorganize logic at free_extent_buffer() for better readabilityFilipe Manana
It's hard to read the logic to break out of the while loop since it's a very long expression consisting of a logical or of two composite expressions, each one composed by a logical and. Further each one is also testing for the EXTENT_BUFFER_UNMAPPED bit, making it more verbose than necessary. So change from this: if ((!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && refs <= 3) || (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && refs == 1)) break; To this: if (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags)) { if (refs == 1) break; } else if (refs <= 3) { break; } At least on x86_64 using gcc 9.3.0, this doesn't change the object size. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21btrfs: rename btrfs_subpage structureQu Wenruo
With the incoming large data folios support, the structure name btrfs_subpage is no longer correct, as for we can have multiple blocks inside a large folio, and the block size is still page size. So to follow the schema of iomap, rename btrfs_subpage to btrfs_folio_state, along with involved enums. There are still exported functions with "btrfs_subpage_" prefix, and I believe for metadata the name "subpage" will stay forever as we will never allocate a folio larger than nodesize anyway. The full cleanup of the word "subpage" will happen in much smaller steps in the future. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19btrfs: fix double unlock of buffer_tree xarray when releasing subpage ebFilipe Manana
If we break out of the loop because an extent buffer doesn't have the bit EXTENT_BUFFER_TREE_REF set, we end up unlocking the xarray twice, once before we tested for the bit and break out of the loop, and once again after the loop. Fix this by testing the bit and exiting before unlocking the xarray. The time spent testing the bit is negligible and it's not worth trying to do that outside the critical section delimited by the xarray lock due to the code complexity required to avoid it (like using a local boolean variable to track whether the xarray is locked or not). The xarray unlock only needs to be done before calling release_extent_buffer(), as that needs to lock the xarray (through xa_cmpxchg_irq()) and does a more significant amount of work. Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/linux-btrfs/aDRNDU0GM1_D4Xnw@stanley.mountain/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-27btrfs: don't drop a reference if btrfs_check_write_meta_pointer() failsJosef Bacik
In the zoned mode there's a bug in the extent buffer tree conversion to xarray. The reference for eb is dropped and code continues but the references get dropped by releasing the batch. Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Fixes: 19d7f65f032f ("btrfs: convert the buffer_radix to an xarray") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: fix broken drop_caches on extent buffer foliosBoris Burkov
The (correct) commit e41c81d0d30e ("mm/truncate: Replace page_mapped() call in invalidate_inode_page()") replaced the page_mapped(page) check with a refcount check. However, this refcount check does not work as expected with drop_caches for btrfs's metadata pages. Btrfs has a per-sb metadata inode with cached pages, and when not in active use by btrfs, they have a refcount of 3. One from the initial call to alloc_pages(), one (nr_pages == 1) from filemap_add_folio(), and one from folio_attach_private(). We would expect such pages to get dropped by drop_caches. However, drop_caches calls into mapping_evict_folio() via mapping_try_invalidate() which gets a reference on the folio with find_lock_entries(). As a result, these pages have a refcount of 4, and fail this check. For what it's worth, such pages do get reclaimed under memory pressure, so I would say that while this behavior is surprising, it is not really dangerously broken. When I asked the mm folks about the expected refcount in this case, I was told that the correct thing to do is to donate the refcount from the original allocation to the page cache after inserting it. Therefore, attempt to fix this by adding a put_folio() to the critical spot in alloc_extent_buffer() where we are sure that we have really allocated and attached new pages. We must also adjust folio_detach_private() to properly handle being the last reference to the folio and not do a use-after-free after folio_detach_private(). extent_buffers allocated by clone_extent_buffer() and alloc_dummy_extent_buffer() are unmapped, so this transfer of ownership from allocation to insertion in the mapping does not apply to them. However, we can still folio_put() them safely once they are finished being allocated and have called folio_attach_private(). Finally, removing the generic put_folio() for the allocation from btrfs_detach_extent_buffer_folios() means we need to be careful to do the appropriate put_folio() in allocation failure paths in alloc_extent_buffer(), clone_extent_buffer() and alloc_dummy_extent_buffer(). Link: https://lore.kernel.org/linux-mm/ZrwhTXKzgDnCK76Z@casper.infradead.org/ Tested-by: Klara Modin <klarasmodin@gmail.com> Reviewed-by: Daniel Vacek <neelx@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: get rid of goto in alloc_test_extent_buffer()Daniel Vacek
The `free_eb` label is used only once. Simplify by moving the code inplace. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use buffer xarray for extent buffer writeback operationsJosef Bacik
Currently we have this ugly back and forth with the btree writeback where we find the folio, find the eb associated with that folio, and then attempt to writeback. This results in two different paths for subpage ebs and >= page size ebs. Clean this up by adding our own infrastructure around looking up tagged ebs and writing the ebs out directly. This allows us to unify the subpage and >= pagesize IO paths, resulting in a much cleaner writeback path for extent buffers. I ran this through fsperf on a VM with 8 CPUs and 16GiB of RAM. I used smallfiles100k, but reduced the files to 1k to make it run faster, the results are as follows, with the statistically significant improvements marked with *, there were no regressions. fsperf was run with -n 10 for both runs, so the baseline is the average 10 runs and the test is the average of 10 runs. smallfiles100k results metric baseline current stdev diff ================================================================================ avg_commit_ms 68.58 58.44 3.35 -14.79% * commits 270.60 254.70 16.24 -5.88% dev_read_iops 48 48 0 0.00% dev_read_kbytes 1044 1044 0 0.00% dev_write_iops 866117.90 850028.10 14292.20 -1.86% dev_write_kbytes 10939976.40 10605701.20 351330.32 -3.06% elapsed 49.30 33 1.64 -33.06% * end_state_mount_ns 41251498.80 35773220.70 2531205.32 -13.28% * end_state_umount_ns 1.90e+09 1.50e+09 14186226.85 -21.38% * max_commit_ms 139 111.60 9.72 -19.71% * sys_cpu 4.90 3.86 0.88 -21.29% write_bw_bytes 42935768.20 64318451.10 1609415.05 49.80% * write_clat_ns_mean 366431.69 243202.60 14161.98 -33.63% * write_clat_ns_p50 49203.20 20992 264.40 -57.34% * write_clat_ns_p99 827392 653721.60 65904.74 -20.99% * write_io_kbytes 2035940 2035940 0 0.00% write_iops 10482.37 15702.75 392.92 49.80% * write_lat_ns_max 1.01e+08 90516129 3910102.06 -10.29% * write_lat_ns_mean 366556.19 243308.48 14154.51 -33.62% * As you can see we get about a 33% decrease runtime, with a 50% throughput increase, which is pretty significant. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: set DIRTY and WRITEBACK tags on the buffer_treeJosef Bacik
In preparation for changing how we do writeout of extent buffers, start tagging the extent buffer xarray with DIRTY and WRITEBACK to make it easier to find extent buffers that are in either state. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: convert the buffer_radix to an xarrayJosef Bacik
In order to fully utilize xarray tagging to improve writeback we need to convert the buffer_radix to a proper xarray. This conversion is relatively straightforward as the radix code uses the xarray underneath. Using xarray directly allows for quite a lot less code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: subpage: reject tree blocks which are not nodesize alignedQu Wenruo
When btrfs subpage support (fs block < page size) was introduced, a subpage filesystem will only reject tree blocks which cross page boundaries. This used to be a compromise to simplify the tree block handling and still allowing subpage cases to read some old converted filesystems which did not have proper chunk alignment. But in practice, suppose we have the following unaligned tree block on a 64K page sized system: 0 32K 44K 60K 64K | |///////////////| | Although btrfs has no problem reading the tree block at [44K, 60K), if extent allocator is allocating another tree block, it may choose the range [60K, 74K), as extent allocator has no awareness if it's a subpage metadata request or not. Then we'd get -EINVAL from the following sequence: btrfs_alloc_tree_block() |- btrfs_reserve_extent() | Which returned range [60K, 74K) |- btrfs_init_new_buffer() |- btrfs_find_create_tree_block() |- alloc_extent_buffer() |- check_eb_alignment() Which returned -EINVAL, because the range crosses page boundary. This situation will not fix itself and should mostly mark the fs read-only. Thankfully we didn't really get such reports in the real world because: - The original unaligned tree block is only caused by older btrfs-convert It's before the btrfs-convert rework was done in v4.6, where converted btrfs filesystem can have metadata block groups which are not aligned to nodesize nor stripe size (64K). But after btrfs-progs v4.6, all chunks allocated will be stripe (64K) aligned, thus no more such problem. Considering how old the fix is (v4.6 was released almost 10 years ago), subpage support for btrfs was introduced in v5.15, it should be safe to reject those unaligned tree blocks. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: move folio initialization to one place in attach_eb_folio_to_filemap()Daniel Vacek
This is just a trivial change. The code looks a bit more readable this way, IMO. Move initialization of existing_folio to the beginning of the retry loop so it's set to NULL at one place. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: convert ASSERT(0) with handled errors to DEBUG_WARN()David Sterba
The use of ASSERT(0) is maybe useful for some cases but more like a notice for developers. Assertions can be compiled in independently so convert it to a debugging helper. The difference is that it's just a warning and will not end up in BUG(). The converted cases are in connection with proper error handling. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: convert WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG)) to DEBUG_WARNDavid Sterba
Use the conditional warning instead of typing the whole condition. Optional message is printed where it seems clear what could be the problem. Conversion is left out in btree_csum_one_bio() because of the additional condition. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: track the next file offset in struct btrfs_bio_ctrlChristoph Hellwig
The bio implementation is not something we should really mess around, and we shouldn't recalculate the pos from the folio over and over. Instead just track then end of the current bio in logical file offsets in the btrfs_bio_ctrl, which is much simpler and easier to read. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove the alignment checks in end_bbio_data_read()Christoph Hellwig
end_bbio_data_read() checks that each iterated folio fragment is aligned and justifies that with block drivers advancing the bio. But block driver only advance bi_iter, while end_bbio_data_read() uses bio_for_each_folio_all() to iterate the immutable bi_io_vec array that can't be changed by drivers at all. Furthermore btrfs has already did the alignment check of the file offset inside submit_one_sector(), and the size is fixed to fs block size, there is no need to re-do the alignment check again inside the endio function. So just remove the unnecessary alignment check along with the incorrect comment. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename remaining exported extent map functionsFilipe Manana
Rename all the exported functions from extent_map.h that don't have a 'btrfs_' prefix in their names, so that they are consistent with all the other functions, to make it clear they are btrfs specific functions and to avoid potential name collisions in the future with functions defined elsewhere in the kernel. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename functions to allocate and free extent mapsFilipe Manana
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename extent map functions to get block start, end and check if in treeFilipe Manana
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename exported extent map compression functionsFilipe Manana
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename free_extent_state() to include a btrfs prefixFilipe Manana
This is an exported function so it should have a 'btrfs_' prefix by convention, to make it clear it's btrfs specific and to avoid collisions with functions from elsewhere in the kernel. Rename the function to add 'btrfs_' prefix to it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename the functions to count, test and get bit ranges in io treesFilipe Manana
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a 'btrfs_' prefix to their names to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: rename the functions to clear bits for an extent rangeFilipe Manana
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. One of them has a double underscore prefix which is also discouraged. So remove double underscore prefix where applicable and add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: add btrfs prefix to main lock, try lock and unlock extent functionsFilipe Manana
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a prefix to their name. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use folio_contains() for EOF detectionQu Wenruo
Currently we use the following pattern to detect if the folio contains the end of a file: if (folio->index == end_index) folio_zero_range(); But that only works if the folio is page sized. For the following case, it will not work and leave the range beyond EOF uninitialized: The page size is 4K, and the fs block size is also 4K. 16K 20K 24K | | | | | EOF at 22K And we have a large folio sized 8K at file offset 16K. In that case, the old "folio->index == end_index" will not work, thus the range [22K, 24K) will not be zeroed out. Fix the following call sites which use the above pattern: - add_ra_bio_pages() - extent_writepage() Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove unnecessary early exits in delalloc folio lock and unlockQu Wenruo
Inside functions unlock_delalloc_folio() and lock_delalloc_folios(), we have the following early exits: if (index == locked_folio->index && end_index == index) return; This allows us to exit early if the range is inside the same locked folio. However the current check relies on page sized folios, if we have a large folio that contains @index but not at @index, then the early exit will no longer trigger. Furthermore without the above early check, the existing code can handle it well, as both __process_folios_contig() and lock_delalloc_folios() will skip any folio page lock/unlock if it's on the locked folio. Here we remove the early exits and let the existing code handle the same index case, to make the code a little simpler. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: prepare end_bbio_data_write() for large data foliosQu Wenruo
The function is doing an ASSERT() checking the folio order, but all later functions are handling large folios properly, thus we can safely remove that ASSERT(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: use clear_extent_bit() at try_release_extent_state()Filipe Manana
Instead of using __clear_extent_bit() we can use clear_extent_bit() since we pass a NULL value for the changeset argument. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: pass a pointer to get_range_bits() to cache first search resultFilipe Manana
Allow get_range_bits() to take an extent state pointer to pointer argument so that we can cache the first extent state record in the target range, so that a caller can use it for subsequent operations without doing a full tree search. Currently the only user is try_release_extent_state(), which then does a call to __clear_extent_bit() which can use such a cached state record. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: allow folios to be released while ordered extent is finishingFilipe Manana
When the release_folio callback (from struct address_space_operations) is invoked we don't allow the folio to be released if its range is currently locked in the inode's io_tree, as it may indicate the folio may be needed by the task that locked the range. However if the range is locked because an ordered extent is finishing, then we can safely allow the folio to be released because ordered extent completion doesn't need to use the folio at all. When we are under memory pressure, the kernel starts writeback of dirty pages (folios) with the goal of releasing the pages from the page cache after writeback completes, however this often is not possible on btrfs because: * Once the writeback completes we queue the ordered extent completion; * Once the ordered extent completion starts, we lock the range in the inode's io_tree (at btrfs_finish_one_ordered()); * If the release_folio callback is called while the folio's range is locked in the inode's io_tree, we don't allow the folio to be released, so the kernel has to try to release memory elsewhere, which may result in triggering more writeback or releasing other pages from the page cache which may be more useful to have around for applications. In contrast, when the release_folio callback is invoked after writeback finishes and before ordered extent completion starts or locks the range, we allow the folio to be released, as well as when the release_folio callback is invoked after ordered extent completion unlocks the range. Improve on this by detecting if the range is locked for ordered extent completion and if it is, allow the folio to be released. This detection is achieved by adding a new extent flag in the io_tree that is set when the range is locked during ordered extent completion. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: update comment for try_release_extent_state()Filipe Manana
Drop reference to pages from the comment since the function is fully folio aware and works regardless of how many pages are in the folio. Also while at it, capitalize the first word and make it more explicit that release_folio is a callback from struct address_space_operations. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove unused flag EXTENT_BUFFER_IN_TREEDaniel Vacek
This flag is set after inserting the eb to the buffer tree and cleared on it's removal. It was added in commit 34b41acec1ccc0 ("Btrfs: use a bit to track if we're in the radix tree") and wanted to make use of it, faa2dbf004e89e ("Btrfs: add sanity tests for new qgroup accounting code"). Both are 10+ years old, we can remove the flag. Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15btrfs: remove unused flag EXTENT_BUFFER_READ_ERRDaniel Vacek
This flag was added by commit 656f30dba7ab ("Btrfs: be aware of btree inode write errors to avoid fs corruption") but it stopped being used after commit 046b562b20a5 ("btrfs: use a separate end_io handler for read_extent_buffer"). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Daniel Vacek <neelx@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-02btrfs: open code folio_index() in btree_clear_folio_dirty_tag()Kairui Song
The folio_index() helper is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio->index instead. It can't be a swap cache folio here. Swap mapping may only call into fs through 'swap_rw' but btrfs does not use that method for swap. Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-23btrfs: adjust subpage bit start based on sectorsizeJosef Bacik
When running machines with 64k page size and a 16k nodesize we started seeing tree log corruption in production. This turned out to be because we were not writing out dirty blocks sometimes, so this in fact affects all metadata writes. When writing out a subpage EB we scan the subpage bitmap for a dirty range. If the range isn't dirty we do bit_start++; to move onto the next bit. The problem is the bitmap is based on the number of sectors that an EB has. So in this case, we have a 64k pagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4 bits for every node. With a 64k page size we end up with 4 nodes per page. To make this easier this is how everything looks [0 16k 32k 48k ] logical address [0 4 8 12 ] radix tree offset [ 64k page ] folio [ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers [ | | | | | | | | | | | | | | | | ] bitmap Now we use all of our addressing based on fs_info->sectorsize_bits, so as you can see the above our 16k eb->start turns into radix entry 4. When we find a dirty range for our eb, we correctly do bit_start += sectors_per_node, because if we start at bit 0, the next bit for the next eb is 4, to correspond to eb->start 16k. However if our range is clean, we will do bit_start++, which will now put us offset from our radix tree entries. In our case, assume that the first time we check the bitmap the block is not dirty, we increment bit_start so now it == 1, and then we loop around and check again. This time it is dirty, and we go to find that start using the following equation start = folio_start + bit_start * fs_info->sectorsize; so in the case above, eb->start 0 is now dirty, and we calculate start as 0 + 1 * fs_info->sectorsize = 4096 4096 >> 12 = 1 Now we're looking up the radix tree for 1, and we won't find an eb. What's worse is now we're using bit_start == 1, so we do bit_start += sectors_per_node, which is now 5. If that eb is dirty we will run into the same thing, we will look at an offset that is not populated in the radix tree, and now we're skipping the writeout of dirty extent buffers. The best fix for this is to not use sectorsize_bits to address nodes, but that's a larger change. Since this is a fs corruption problem fix it simply by always using sectors_per_node to increment the start bit. Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: prepare extent_io.c for future large folio supportQu Wenruo
When we're handling folios from filemap, we can no longer assume all folios are page sized. Thus for call sites assuming the folio is page sized, change the PAGE_SIZE usage to folio_size() instead. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: add a size parameter to btrfs_alloc_subpage()Qu Wenruo
Since we can no longer assume page sized folio for data filemap folios, allow btrfs_alloc_subpage() to accept a new parameter, @fsize, indicating the folio size. This doesn't follow the regular behavior of passing a folio directly, because this function is shared by both data and metadata folios, and for metadata folios we have extra allocation policy to ensure no large folios whose sizes are larger than nodesize (unless it's page sized). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: subpage: make btrfs_is_subpage() check against a folioQu Wenruo
To support large data folios, we can no longer assume every filemap folio is page sized. So btrfs_is_subpage() check must be done against a folio. Thankfully for metadata folios, we have the full control and ensure a large folio will not be large than nodesize, so btrfs_meta_is_subpage() doesn't need this change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: avoid linker error in btrfs_find_create_tree_block()Mark Harmstone
The inline function btrfs_is_testing() is hardcoded to return 0 if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set. Currently we're relying on the compiler optimizing out the call to alloc_test_extent_buffer() in btrfs_find_create_tree_block(), as it's not been defined (it's behind an #ifdef). Add a stub version of alloc_test_extent_buffer() to avoid linker errors on non-standard optimization levels. This problem was seen on GCC 14 with -O0 and is helps to see symbols that would be otherwise optimized out. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: reject out-of-band dirty folios during writebackQu Wenruo
[OUT-OF-BAND DIRTY FOLIOS] An out-of-band folio means the folio is marked dirty but without notifying the filesystem. This can lead to various problems, not limited to: - No folio::private to track per block status - No proper space reserved for such a dirty folio [HISTORY IN BTRFS] This used to be a problem related to get_user_page(), but with the introduction of pin_user_pages*(), we should no longer hit such case anymore. In btrfs, we have a long history of catching such out-of-band dirty folios by: - Mark the folio ordered during delayed allocation - Check the folio ordered flag during writeback If the folio has no ordered flag, it means it doesn't go through delayed allocation, thus it's definitely an out-of-band one. If we got one, we go through COW fixup, which will re-dirty the folio with proper handling in another workqueue. [PROBLEMS OF COW-FIXUP] Such workaround is a blockage for us to migrate to iomap (it requires extra flags to trace if a folio is dirtied by the fs or not) and I'd argue it's not data checksum safe, since if a folio can be marked dirty without informing the fs, the content can also change at any time. But with the introduction of pin_user_pages*() during v5.8 merge window, such out-of-band dirty folio such be treated as a bug. Ext4 has treated such case by warning and erroring out even before pin_user_pages*(). Furthermore, there are already proofs that such folio ordered flag tracking can be screwed up by incorrect error handling, check the commit messages of the following commits: 06f364284794 ("btrfs: do proper folio cleanup when cow_file_range() failed") c2b47df81c8e ("btrfs: do proper folio cleanup when run_delalloc_nocow() failed") [FIXES] Unlike btrfs, ext4 and xfs (iomap) never bother handling such out-of-band dirty folios. - Ext4 just warns and errors out - Iomap always follows the folio/block dirty flags And there is nothing really COW specific, xfs also supports COW too. Here we take one step towards ext4 by doing warning and erroring out. But since the cow fixup thing is introduced from the beginning, we keep the old behavior for non-experimental builds, and only do the new warning for experimental builds before we're 100% sure and remove cow fixup. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18btrfs: allow buffered write to avoid full page read if it's block alignedQu Wenruo
[BUG] Since the support of block size (sector size) < page size for btrfs, test case generic/563 fails with 4K block size and 64K page size: --- tests/generic/563.out 2024-04-25 18:13:45.178550333 +0930 +++ /home/adam/xfstests-dev/results//generic/563.out.bad 2024-09-30 09:09:16.155312379 +0930 @@ -3,7 +3,8 @@ read is in range write is in range write -> read/write -read is in range +read has value of 8388608 +read is NOT in range -33792 .. 33792 write is in range ... [CAUSE] The test case creates a 8MiB file, then does buffered write into the 8MiB using 4K block size, to overwrite the whole file. On 4K page sized systems, since the write range covers the full block and page, btrfs will not bother reading the page, just like what XFS and EXT4 do. But on 64K page sized systems, although the 4K sized write is still block aligned, it's not page aligned anymore, thus btrfs will read the full page, which will be accounted by cgroup and fail the test. As the test case itself expects such 4K block aligned write should not trigger any read. Such expected behavior is an optimization to reduce folio reads when possible, and unfortunately btrfs does not implement such optimization. [FIX] To skip the full page read, we need to do the following modification: - Do not trigger full page read as long as the buffered write is block aligned This is pretty simple by modifying the check inside prepare_uptodate_page(). - Skip already uptodate blocks during full page read Or we can lead to the following data corruption: 0 32K 64K |///////| | Where the file range [0, 32K) is dirtied by buffered write, the remaining range [32K, 64K) is not. When reading the full page, since [0,32K) is only dirtied but not written back, there is no data extent map for it, but a hole covering [0, 64k). If we continue reading the full page range [0, 64K), the dirtied range will be filled with 0 (since there is only a hole covering the whole range). This causes the dirtied range to get lost. With this optimization, btrfs can pass generic/563 even if the page size is larger than fs block size. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>