summaryrefslogtreecommitdiff
path: root/fs/btrfs/zoned.c
AgeCommit message (Collapse)Author
2023-07-20btrfs: zoned: do not enable async discardNaohiro Aota
The zoned mode need to reset a zone before using it. We rely on btrfs's original discard functionality (discarding unused block group range) to do the resetting. While the commit 63a7cb130718 ("btrfs: auto enable discard=async when possible") made the discard done in an async manner, a zoned reset do not need to be async, as it is fast enough. Even worth, delaying zone rests prevents using those zones again. So, let's disable async discard on the zoned mode. Fixes: 63a7cb130718 ("btrfs: auto enable discard=async when possible") CC: stable@vger.kernel.org # 6.3+ Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update message text ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: open code btrfs_map_sblockChristoph Hellwig
btrfs_map_sblock just hard codes three arguments and calls btrfs_map_sblock. Remove it as it doesn't provide any real value, but makes following the btrfs_map_block call chains harder. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: pass the new logical address to split_extent_mapChristoph Hellwig
split_extent_map splits off the first chunk of an extent map into a new one. One of the two users is the zoned I/O completion code that wants to rewrite the logical block start address right after this split. Pass in the logical address to be set in the split off first extent_map as an argument to avoid an extra extent tree lookup for this case. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: defer splitting of ordered extents until I/O completionChristoph Hellwig
The btrfs zoned completion code currently needs an ordered_extent and extent_map per bio so that it can account for the non-predictable write location from Zone Append. To archive that it currently splits the ordered_extent and extent_map at I/O submission time, and then records the actual physical address in the ->physical field of the ordered_extent. This patch instead switches to record the "original" physical address that the btrfs allocator assigned in spare space in the btrfs_bio, and then rewrites the logical address in the btrfs_ordered_sum structure at I/O completion time. This allows the ordered extent completion handler to simply walk the list of ordered csums and split the ordered extent as needed. This removes an extra ordered extent and extent_map lookup and manipulation during the I/O submission path, and instead batches it in the I/O completion path where we need to touch these anyway. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: optimize the logical to physical mapping for zoned writesChristoph Hellwig
The current code to store the final logical to physical mapping for a zone append write in the extent tree is rather inefficient. It first has to split the ordered extent so that there is one ordered extent per bio, so that it can look up the ordered extent on I/O completion in btrfs_record_physical_zoned and store the physical LBA returned by the block driver in the ordered extent. btrfs_rewrite_logical_zoned then has to do a lookup in the chunk tree to see what physical address the logical address for this bio / ordered extent is mapped to, and then rewrite it in the extent tree. To optimize this process, we can store the physical address assigned in the chunk tree to the original logical address and a pointer to btrfs_ordered_sum structure the in the btrfs_bio structure, and then use this information to rewrite the logical address in the btrfs_ordered_sum structure directly at I/O completion time in btrfs_record_physical_zoned. btrfs_rewrite_logical_zoned then simply updates the logical address in the extent tree and the ordered_extent itself. The code in btrfs_rewrite_logical_zoned now runs for all data I/O completions in zoned file systems, which is fine as there is no remapping to do for non-append writes to conventional zones or for relocation, and the overhead for quickly breaking out of the loop is very low. Because zoned file systems now need the ordered_sums structure to record the actual write location returned by zone append, allocate dummy structures without the csum array for them when the I/O doesn't use checksums, and free them when completing the ordered_extent. Note that the btrfs_bio doesn't grow as the new field are places into a union that is so far not used for data writes and has plenty of space left in it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: rename the bytenr field in struct btrfs_ordered_sum to logicalChristoph Hellwig
btrfs_ordered_sum::bytendr stores a logical address. Make that clear by renaming it to ->logical. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: drop gfp from parameter extent state helpersDavid Sterba
Now that all extent state bit helpers effectively take the GFP_NOFS mask (and GFP_NOWAIT is encoded in the bits) we can remove the parameter. This reduces stack consumption in many functions and simplifies a lot of code. Net effect on module on a release build: text data bss dec hex filename 1250432 20985 16088 1287505 13a551 pre/btrfs.ko 1247074 20985 16088 1284147 139833 post/btrfs.ko DELTA: -3358 Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: pass NOWAIT for set/clear extent bits as another bitDavid Sterba
The only flags we now pass to set_extent_bit/__clear_extent_bit are GFP_NOFS and GFP_NOWAIT (a few functions handling mappings). This requires an extra parameter to be passed everywhere but is almost always the same. Encode the GFP_NOWAIT as an artificial extent bit and extract the real bits and gfp mask in the lowest level helpers. Now the passed gfp mask is not actually used and can be removed. Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: open code set_extent_bits_nowaitDavid Sterba
The helper only passes GFP_NOWAIT as gfp flags and is used two times. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: don't hold an extra reference for redirtied buffersChristoph Hellwig
When btrfs_redirty_list_add redirties a buffer, it also acquires an extra reference that is released on transaction commit. But this is not required as buffers that are dirty or under writeback are never freed (look for calls to extent_buffer_under_io())). Remove the extra reference and the infrastructure used to drop it again. History behind redirty logic: In the first place, it used releasing_list to hold all the to-be-released extent buffers, and decided which buffers to re-dirty at the commit time. Then, in a later version, the behaviour got changed to re-dirty a necessary buffer and add re-dirtied one to the list in btrfs_free_tree_block(). In short, the list was there mostly for the patch series' historical reason. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [ add Naohiro's comment regarding history ] Signed-off-by: David Sterba <dsterba@suse.com>
2023-06-19btrfs: export bitmap_test_range_all_{set,zero}Naohiro Aota
bitmap_test_range_all_{set,zero} defined in subpage.c are useful for other components. Move them to misc.h and use them in zoned.c. Also, as find_next{,_zero}_bit take/return "unsigned long" instead of "unsigned int", convert the type to "unsigned long". While at it, also rewrite the "if (...) return true; else return false;" pattern and add const to the input bitmap. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10btrfs: zero the buffer before marking it dirty in btrfs_redirty_list_addChristoph Hellwig
btrfs_redirty_list_add zeroes the buffer data and sets the EXTENT_BUFFER_NO_CHECK to make sure writeback is fine with a bogus header. But it does that after already marking the buffer dirty, which means that writeback could already be looking at the buffer. Switch the order of operations around so that the buffer is only marked dirty when we're ready to write it. Fixes: d3575156f662 ("btrfs: zoned: redirty released extent buffers") CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-05-10btrfs: zoned: fix full zone super block reading on ZNSNaohiro Aota
When both of the superblock zones are full, we need to check which superblock is newer. The calculation of last superblock position is wrong as it does not consider zone_capacity and uses the length. Fixes: 9658b72ef300 ("btrfs: zoned: locate superblock position using zone capacity") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-28btrfs: zoned: fix wrong use of bitops API in btrfs_ensure_empty_zonesNaohiro Aota
find_next_bit and find_next_zero_bit take @size as the second parameter and @offset as the third parameter. They are specified opposite in btrfs_ensure_empty_zones(). Thanks to the later loop, it never failed to detect the empty zones. Fix them and (maybe) return the result a bit faster. Note: the naming is a bit confusing, size has two meanings here, bitmap and our range size. Fixes: 1cd6121f2a38 ("btrfs: zoned: implement zoned chunk allocator") CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-04-17btrfs: introduce btrfs_bio::fs_info memberQu Wenruo
Currently we're doing a lot of work for btrfs_bio: - Checksum verification for data read bios - Bio splits if it crosses stripe boundary - Read repair for data read bios However for the incoming scrub patches, we don't want this extra functionality at all, just plain logical + mirror -> physical mapping ability. Thus here we do the following changes: - Introduce btrfs_bio::fs_info This is for the new scrub specific btrfs_bio, which would not populate btrfs_bio::inode. Thus we need such new member to grab a fs_info This new member will always be populated. - Replace @inode argument with @fs_info for btrfs_bio_init() and its caller Since @inode is no longer a mandatory member, replace it with @fs_info, and let involved users populate @inode. - Skip checksum verification and generation if @bbio->inode is NULL - Add extra ASSERT()s To make sure: * bbio->inode is properly set for involved read repair path * if @file_offset is set, bbio->inode is also populated - Grab @fs_info from @bbio directly We can no longer go @bbio->inode->root->fs_info, as bbio->inode can be NULL. This involves: * btrfs_simple_end_io() * should_async_write() * btrfs_wq_submit_bio() * btrfs_use_zone_append() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-15btrfs: zoned: drop space_info->active_total_bytesNaohiro Aota
The space_info->active_total_bytes is no longer necessary as we now count the region of newly allocated block group as zone_unusable. Drop its usage. Fixes: 6a921de58992 ("btrfs: zoned: introduce space_info->active_total_bytes") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-15btrfs: zoned: count fresh BG region as zone unusableNaohiro Aota
The naming of space_info->active_total_bytes is misleading. It counts not only active block groups but also full ones which are previously active but now inactive. That confusion results in a bug not counting the full BGs into active_total_bytes on mount time. For a background, there are three kinds of block groups in terms of activation. 1. Block groups never activated 2. Block groups currently active 3. Block groups previously active and currently inactive (due to fully written or zone finish) What we really wanted to exclude from "total_bytes" is the total size of BGs #1. They seem empty and allocatable but since they are not activated, we cannot rely on them to do the space reservation. And, since BGs #1 never get activated, they should have no "used", "reserved" and "pinned" bytes. OTOH, BGs #3 can be counted in the "total", since they are already full we cannot allocate from them anyway. For them, "total_bytes == used + reserved + pinned + zone_unusable" should hold. Tracking #2 and #3 as "active_total_bytes" (current implementation) is confusing. And, tracking #1 and subtract that properly from "total_bytes" every time you need space reservation is cumbersome. Instead, we can count the whole region of a newly allocated block group as zone_unusable. Then, once that block group is activated, release [0 .. zone_capacity] from the zone_unusable counters. With this, we can eliminate the confusing ->active_total_bytes and the code will be common among regular and the zoned mode. Also, no additional counter is needed with this approach. Fixes: 6a921de58992 ("btrfs: zoned: introduce space_info->active_total_bytes") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-15btrfs: rename BTRFS_FS_NO_OVERCOMMIT to BTRFS_FS_ACTIVE_ZONE_TRACKINGJosef Bacik
This flag only gets set when we're doing active zone tracking, and we're going to need to use this flag for things related to this behavior. Rename the flag to represent what it actually means for the file system so it can be used in other ways and still make sense. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-03-15btrfs: zoned: fix btrfs_can_activate_zone() to support DUP profileNaohiro Aota
btrfs_can_activate_zone() returns true if at least one device has one zone available for activation. This is OK for the single profile, but not OK for DUP profile. We need two zones to create a DUP block group. Fix it by properly handling the case with the profile flags. Fixes: 265f7237dd25 ("btrfs: zoned: allow DUP on meta-data block groups") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: don't rely on unchanging ->bi_bdev for zone append remapsChristoph Hellwig
btrfs_record_physical_zoned relies on a bio->bi_bdev samples in the bio_end_io handler to find the reverse map for remapping the zone append write, but stacked block device drivers can and usually do change bi_bdev when sending on the bio to a lower device. This can happen e.g. with the nvme-multipath driver when a NVMe SSD sets the shared namespace bit. But there is no real need for the bdev in btrfs_record_physical_zoned, as it is only passed to btrfs_rmap_block, which uses it to pick the mapping to report if there are multiple reverse mappings. As zone writes can only do simple non-mirror writes right now, and anything more complex will use the stripe tree there is no chance of the multiple mappings case actually happening. Instead open code the subset of btrfs_rmap_block in btrfs_record_physical_zoned, which also removes a memory allocation and remove the bdev field in the ordered extent. Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: never return true for reads in btrfs_use_zone_appendChristoph Hellwig
Using Zone Append only makes sense for writes to the device, so check that in btrfs_use_zone_append. This avoids the possibility of artificially limited read size on zoned file systems. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: pass a btrfs_bio to btrfs_use_appendChristoph Hellwig
struct btrfs_bio has all the information needed for btrfs_use_append, so pass that instead of a btrfs_inode and file_offset. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: split zone append bios in btrfs_submit_bioChristoph Hellwig
The current btrfs zoned device support is a little cumbersome in the data I/O path as it requires the callers to not issue I/O larger than the supported ZONE_APPEND size of the underlying device. This leads to a lot of extra accounting. Instead change btrfs_submit_bio so that it can take write bios of arbitrary size and form from the upper layers, and just split them internally to the ZONE_APPEND queue limits. Then remove all the upper layer warts catering to limited write sized on zoned devices, including the extra refcount in the compressed_bio. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: calculate file system wide queue limit for zoned modeChristoph Hellwig
To be able to split a write into properly sized zone append commands, we need a queue_limits structure that contains the least common denominator suitable for all devices. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-15btrfs: handle recording of zoned writes in the storage layerChristoph Hellwig
Move the code that splits the ordered extents and records the physical location for them to the storage layer so that the higher level consumers don't have to care about physical block numbers at all. This will also allow to eventually remove accounting for the zone append write sizes in the upper layer with a little bit more block layer work. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: zoned: fix uninitialized variable warning in btrfs_get_dev_zonesNaohiro Aota
Fix an uninitialized warning we get with -Wmaybe-uninitialized where it thought zno may have been uninitialized, in both cases it depends on zinfo->zone_cache but we know the value won't change between checks. Reported-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/linux-btrfs/af6c527cbd8bdc782e50bd33996ee83acc3a16fb.1671221596.git.josef@toxicpanda.com/ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13btrfs: fix uninitialized variable warning in btrfs_sb_log_locationJosef Bacik
We only have 3 possible mirrors, and we have ASSERT()'s to make sure we're not passing in an invalid super mirror into this function, so technically this value isn't uninitialized. However -Wmaybe-uninitialized will complain, so set it to U64_MAX so if we don't have ASSERT()'s turned on it'll error out later on when it see's the zone is beyond our maximum zones. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-01-11btrfs: zoned: enable metadata over-commit for non-ZNS setupNaohiro Aota
The commit 79417d040f4f ("btrfs: zoned: disable metadata overcommit for zoned") disabled the metadata over-commit to track active zones properly. However, it also introduced a heavy overhead by allocating new metadata block groups and/or flushing dirty buffers to release the space reservations. Specifically, a workload (write only without any sync operations) worsen its performance from 343.77 MB/sec (v5.19) to 182.89 MB/sec (v6.0). The performance is still bad on current misc-next which is 187.95 MB/sec. And, with this patch applied, it improves back to 326.70 MB/sec (+73.82%). This patch introduces a new fs_info->flag BTRFS_FS_NO_OVERCOMMIT to indicate it needs to disable the metadata over-commit. The flag is enabled when a device with max active zones limit is loaded into a file-system. Fixes: 79417d040f4f ("btrfs: zoned: disable metadata overcommit for zoned") CC: stable@vger.kernel.org # 6.0+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: convert btrfs_block_group::seq_zone to runtime flagDavid Sterba
In zoned mode the sequential status of zone can be also tracked in the runtime flags of block group. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: zoned: use helper to check a power of two zone sizeDavid Sterba
We have a 64bit compatible helper to check if a value is a power of two, use it instead of open coding it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: update function commentsDavid Sterba
Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move accessor helpers into accessors.hJosef Bacik
This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move fs wide helpers out of ctree.hJosef Bacik
We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-23btrfs: use kvcalloc in btrfs_get_dev_zone_infoChristoph Hellwig
Otherwise the kernel memory allocator seems to be unhappy about failing order 6 allocations for the zones array, that cause 100% reproducible mount failures in my qemu setup: [26.078981] mount: page allocation failure: order:6, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null) [26.079741] CPU: 0 PID: 2965 Comm: mount Not tainted 6.1.0-rc5+ #185 [26.080181] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [26.080950] Call Trace: [26.081132] <TASK> [26.081291] dump_stack_lvl+0x56/0x6f [26.081554] warn_alloc+0x117/0x140 [26.081808] ? __alloc_pages_direct_compact+0x1b5/0x300 [26.082174] __alloc_pages_slowpath.constprop.0+0xd0e/0xde0 [26.082569] __alloc_pages+0x32a/0x340 [26.082836] __kmalloc_large_node+0x4d/0xa0 [26.083133] ? trace_kmalloc+0x29/0xd0 [26.083399] kmalloc_large+0x14/0x60 [26.083654] btrfs_get_dev_zone_info+0x1b9/0xc00 [26.083980] ? _raw_spin_unlock_irqrestore+0x28/0x50 [26.084328] btrfs_get_dev_zone_info_all_devices+0x54/0x80 [26.084708] open_ctree+0xed4/0x1654 [26.084974] btrfs_mount_root.cold+0x12/0xde [26.085288] ? lock_is_held_type+0xe2/0x140 [26.085603] legacy_get_tree+0x28/0x50 [26.085876] vfs_get_tree+0x1d/0xb0 [26.086139] vfs_kern_mount.part.0+0x6c/0xb0 [26.086456] btrfs_mount+0x118/0x3a0 [26.086728] ? lock_is_held_type+0xe2/0x140 [26.087043] legacy_get_tree+0x28/0x50 [26.087323] vfs_get_tree+0x1d/0xb0 [26.087587] path_mount+0x2ba/0xbe0 [26.087850] ? _raw_spin_unlock_irqrestore+0x38/0x50 [26.088217] __x64_sys_mount+0xfe/0x140 [26.088506] do_syscall_64+0x35/0x80 [26.088776] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: 5b316468983d ("btrfs: get zone information of zoned block devices") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-21btrfs: zoned: fix missing endianness conversion in sb_write_pointerChristoph Hellwig
generation is an on-disk __le64 value, so use btrfs_super_generation to convert it to host endian before comparing it. Fixes: 12659251ca5d ("btrfs: implement log-structured superblock for ZONED mode") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-11-07btrfs: zoned: clone zoned device info when cloning a deviceJohannes Thumshirn
When cloning a btrfs_device, we're not cloning the associated btrfs_zoned_device_info structure of the device in case of a zoned filesystem. Later on this leads to a NULL pointer dereference when accessing the device's zone_info for instance when setting a zone as active. This was uncovered by fstests' testcase btrfs/161. CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: zoned: refactor device checks in btrfs_check_zoned_modeChristoph Hellwig
btrfs_check_zoned_mode is really hard to follow, mostly due to the fact that a lot of the checks use duplicate conditions after support for zone emulation for conventional devices on file systems with the ZONED flag was added. Fix this by factoring out the check for host managed devices for !ZONED file systems into a separate helper and then simplifying the rest of the code. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: get rid of block group caching progress logicOmar Sandoval
struct btrfs_caching_ctl::progress and struct btrfs_block_group::last_byte_to_unpin were previously needed to ensure that unpin_extent_range() didn't return a range to the free space cache before the caching thread had a chance to cache that range. However, the commit "btrfs: fix space cache corruption and potential double allocations" made it so that we always synchronously cache the block group at the time that we pin the extent, so this machinery is no longer necessary. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: convert block group bit field to use bit helpersJosef Bacik
We use a bit field in the btrfs_block_group for different flags, however this is awkward because we have to hold the block_group->lock for any modification of any of these fields, and makes the code clunky for a few of these flags. Convert these to a properly flags setup so we can utilize the bit helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-13btrfs: zoned: wait for extent buffer IOs before finishing a zoneNaohiro Aota
Before sending REQ_OP_ZONE_FINISH to a zone, we need to ensure that ongoing IOs already finished. Or, we will see a "Zone Is Full" error for the IOs, as the ZONE_FINISH command makes the zone full. We ensure that with btrfs_wait_block_group_reservations() and btrfs_wait_ordered_roots() for a data block group. And, for a metadata block group, the comparison of alloc_offset vs meta_write_pointer mostly ensures IOs for the allocated region already sent. However, there still can be a little time frame where the IOs are sent but not yet completed. Introduce wait_eb_writebacks() to ensure such IOs are completed for a metadata block group. It walks the buffer_radix to find extent buffers in the block group and calls wait_on_extent_buffer_writeback() on them. Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") CC: stable@vger.kernel.org # 5.19+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-05btrfs: zoned: fix mounting with conventional zonesJohannes Thumshirn
Since commit 6a921de58992 ("btrfs: zoned: introduce space_info->active_total_bytes"), we're only counting the bytes of a block group on an active zone as usable for metadata writes. But on a SMR drive, we don't have active zones and short circuit some of the logic. This leads to an error on mount, because we cannot reserve space for metadata writes. Fix this by also setting the BLOCK_GROUP_FLAG_ZONE_IS_ACTIVE bit in the block-group's runtime flag if the zone is a conventional zone. Fixes: 6a921de58992 ("btrfs: zoned: introduce space_info->active_total_bytes") Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-05btrfs: zoned: set pseudo max append zone limit in zone emulation modeShin'ichiro Kawasaki
The commit 7d7672bc5d10 ("btrfs: convert count_max_extents() to use fs_info->max_extent_size") introduced a division by fs_info->max_extent_size. This max_extent_size is initialized with max zone append limit size of the device btrfs runs on. However, in zone emulation mode, the device is not zoned then its zone append limit is zero. This resulted in zero value of fs_info->max_extent_size and caused zero division error. Fix the error by setting non-zero pseudo value to max append zone limit in zone emulation mode. Set the pseudo value based on max_segments as suggested in the commit c2ae7b772ef4 ("btrfs: zoned: revive max_zone_append_bytes"). Fixes: 7d7672bc5d10 ("btrfs: convert count_max_extents() to use fs_info->max_extent_size") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-05btrfs: zoned: fix API misuse of zone finish waitingNaohiro Aota
The commit 2ce543f47843 ("btrfs: zoned: wait until zone is finished when allocation didn't progress") implemented a zone finish waiting mechanism to the write path of zoned mode. However, using wait_var_event()/wake_up_all() on fs_info->zone_finish_wait is wrong and wait_var_event() just hangs because no one ever wakes it up once it goes into sleep. Instead, we can simply use wait_on_bit_io() and clear_and_wake_up_bit() on fs_info->flags with a proper barrier installed. Fixes: 2ce543f47843 ("btrfs: zoned: wait until zone is finished when allocation didn't progress") CC: stable@vger.kernel.org # 5.16+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: zoned: wait until zone is finished when allocation didn't progressNaohiro Aota
When the allocated position doesn't progress, we cannot submit IOs to finish a block group, but there should be ongoing IOs that will finish a block group. So, in that case, we wait for a zone to be finished and retry the allocation after that. Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to indicate we need a zone finish to have proceeded. The flag is set when the allocator detected it cannot activate a new block group. And, it is cleared once a zone is finished. CC: stable@vger.kernel.org # 5.16+ Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: zoned: activate metadata block group on flush_spaceNaohiro Aota
For metadata space on zoned filesystem, reaching ALLOC_CHUNK{,_FORCE} means we don't have enough space left in the active_total_bytes. Before allocating a new chunk, we can try to activate an existing block group in this case. Also, allocating a chunk is not enough to grant a ticket for metadata space on zoned filesystem we need to activate the block group to increase the active_total_bytes. btrfs_zoned_activate_one_bg() implements the activation feature. It will activate a block group by (maybe) finishing a block group. It will give up activating a block group if it cannot finish any block group. CC: stable@vger.kernel.org # 5.16+ Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: zoned: introduce space_info->active_total_bytesNaohiro Aota
The active_total_bytes, like the total_bytes, accounts for the total bytes of active block groups in the space_info. With an introduction of active_total_bytes, we can check if the reserved bytes can be written to the block groups without activating a new block group. The check is necessary for metadata allocation on zoned filesystem. We cannot finish a block group, which may require waiting for the current transaction, from the metadata allocation context. Instead, we need to ensure the ongoing allocation (reserved bytes) fits in active block groups. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: zoned: finish least available block group on data bg allocationNaohiro Aota
When we run out of active zones and no sufficient space is left in any block groups, we need to finish one block group to make room to activate a new block group. However, we cannot do this for metadata block groups because we can cause a deadlock by waiting for a running transaction commit. So, do that only for a data block group. Furthermore, the block group to be finished has two requirements. First, the block group must not have reserved bytes left. Having reserved bytes means we have an allocated region but did not yet send bios for it. If that region is allocated by the thread calling btrfs_zone_finish(), it results in a deadlock. Second, the block group to be finished must not be a SYSTEM block group. Finishing a SYSTEM block group easily breaks further chunk allocation by nullifying the SYSTEM free space. In a certain case, we cannot find any zone finish candidate or btrfs_zone_finish() may fail. In that case, we fall back to split the allocation bytes and fill the last spaces left in the block groups. CC: stable@vger.kernel.org # 5.16+ Fixes: afba2bc036b0 ("btrfs: zoned: implement active zone tracking") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_sizeNaohiro Aota
On zoned filesystem, data write out is limited by max_zone_append_size, and a large ordered extent is split according the size of a bio. OTOH, the number of extents to be written is calculated using BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the metadata bytes to update and/or create the metadata items. The metadata reservation is done at e.g, btrfs_buffered_write() and then released according to the estimation changes. Thus, if the number of extent increases massively, the reserved metadata can run out. The increase of the number of extents easily occurs on zoned filesystem if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the following warning on a small RAM environment with disabling metadata over-commit (in the following patch). [75721.498492] ------------[ cut here ]------------ [75721.505624] BTRFS: block rsv 1 returned -28 [75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs] [75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G W 5.18.0-rc2-BTRFS-ZNS+ #109 [75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021 [75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] [75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs] [75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286 [75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000 [75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e [75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7 [75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28 [75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a [75721.701878] FS: 0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000 [75721.712601] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0 [75721.730499] Call Trace: [75721.735166] <TASK> [75721.739886] btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs] [75721.747545] ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs] [75721.756145] ? btrfs_get_32+0xea/0x2d0 [btrfs] [75721.762852] ? btrfs_get_32+0xea/0x2d0 [btrfs] [75721.769520] ? push_leaf_left+0x420/0x620 [btrfs] [75721.776431] ? memcpy+0x4e/0x60 [75721.781931] split_leaf+0x433/0x12d0 [btrfs] [75721.788392] ? btrfs_get_token_32+0x580/0x580 [btrfs] [75721.795636] ? push_for_double_split.isra.0+0x420/0x420 [btrfs] [75721.803759] ? leaf_space_used+0x15d/0x1a0 [btrfs] [75721.811156] btrfs_search_slot+0x1bc3/0x2790 [btrfs] [75721.818300] ? lock_downgrade+0x7c0/0x7c0 [75721.824411] ? free_extent_buffer.part.0+0x107/0x200 [btrfs] [75721.832456] ? split_leaf+0x12d0/0x12d0 [btrfs] [75721.839149] ? free_extent_buffer.part.0+0x14f/0x200 [btrfs] [75721.846945] ? free_extent_buffer+0x13/0x20 [btrfs] [75721.853960] ? btrfs_release_path+0x4b/0x190 [btrfs] [75721.861429] btrfs_csum_file_blocks+0x85c/0x1500 [btrfs] [75721.869313] ? rcu_read_lock_sched_held+0x16/0x80 [75721.876085] ? lock_release+0x552/0xf80 [75721.881957] ? btrfs_del_csums+0x8c0/0x8c0 [btrfs] [75721.888886] ? __kasan_check_write+0x14/0x20 [75721.895152] ? do_raw_read_unlock+0x44/0x80 [75721.901323] ? _raw_write_lock_irq+0x60/0x80 [75721.907983] ? btrfs_global_root+0xb9/0xe0 [btrfs] [75721.915166] ? btrfs_csum_root+0x12b/0x180 [btrfs] [75721.921918] ? btrfs_get_global_root+0x820/0x820 [btrfs] [75721.929166] ? _raw_write_unlock+0x23/0x40 [75721.935116] ? unpin_extent_cache+0x1e3/0x390 [btrfs] [75721.942041] btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs] [75721.949906] ? try_to_wake_up+0x30/0x14a0 [75721.955700] ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs] [75721.962661] ? rcu_read_lock_sched_held+0x16/0x80 [75721.969111] ? lock_acquire+0x41b/0x4c0 [75721.974982] finish_ordered_fn+0x15/0x20 [btrfs] [75721.981639] btrfs_work_helper+0x1af/0xa80 [btrfs] [75721.988184] ? _raw_spin_unlock_irq+0x28/0x50 [75721.994643] process_one_work+0x815/0x1460 [75722.000444] ? pwq_dec_nr_in_flight+0x250/0x250 [75722.006643] ? do_raw_spin_trylock+0xbb/0x190 [75722.013086] worker_thread+0x59a/0xeb0 [75722.018511] kthread+0x2ac/0x360 [75722.023428] ? process_one_work+0x1460/0x1460 [75722.029431] ? kthread_complete_and_exit+0x30/0x30 [75722.036044] ret_from_fork+0x22/0x30 [75722.041255] </TASK> [75722.045047] irq event stamp: 0 [75722.049703] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0 [75722.067533] softirqs last enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0 [75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0 [75722.085335] ---[ end trace 0000000000000000 ]--- To fix the estimation, we need to introduce fs_info->max_extent_size to replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for regular vs zoned filesystem. Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned filesystem, it is set to fs_info->max_zone_append_size. CC: stable@vger.kernel.org # 5.12+ Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: zoned: revive max_zone_append_bytesNaohiro Aota
This patch is basically a revert of commit 5a80d1c6a270 ("btrfs: zoned: remove max_zone_append_size logic"), but without unnecessary ASSERT and check. The max_zone_append_size will be used as a hint to estimate the number of extents to cover delalloc/writeback region in the later commits. The size of a ZONE APPEND bio is also limited by queue_max_segments(), so this commit considers it to calculate max_zone_append_size. Technically, a bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE" as an upper limit of an extent size to calculate the number of extents needed to write data. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-25btrfs: zoned: fix comment description for sb_write_pointer logicPankaj Raghav
Fix the comment to represent the actual logic used for sb_write_pointer - Empty[0] && In use[1] should be an invalid state instead of returning zone 0 wp - Empty[0] && Full[1] should be returning zone 0 wp instead of zone 1 wp - In use[0] && Empty[1] should be returning zone 0 wp instead of being an invalid state - In use[0] && Full[1] should be returning zone 0 wp instead of returning zone 1 wp - Full[0] && Empty[1] should be returning zone 1 wp instead of returning zone 0 wp - Full[0] && In use[1] should be returning zone 1 wp instead of returning zone 0 wp Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>