summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-05-13blk-throttle: Introduce flag "BIO_TG_BPS_THROTTLED"Zizhi Wo
Subsequent patches will split the single queue into separate bps and iops queues. To prevent IO that has already passed through the bps queue at a single tg level from being counted toward bps wait time again, we introduce "BIO_TG_BPS_THROTTLED" flag. Since throttle and QoS operate at different levels, we reuse the value as "BIO_QOS_THROTTLED". We set this flag when charge bps and clear it when charge iops, as the bio will move to the upper-level tg or be dispatched. This patch does not involve functional changes. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-5-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13blk-throttle: Split throtl_charge_bio() into bps and iops functionsZizhi Wo
Split throtl_charge_bio() to facilitate subsequent patches that will separately charge bps and iops after queue separation. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-4-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13blk-throttle: Refactor tg_dispatch_time by extracting tg_dispatch_bps/iops_timeZizhi Wo
tg_dispatch_time() contained both bps and iops throttling logic. We now split its internal logic into tg_dispatch_bps/iops_time() to improve code consistency for future separation of the bps and iops queues. Besides, merge time_before() from caller into throtl_extend_slice() to make code cleaner. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-3-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13blk-throttle: Rename tg_may_dispatch() to tg_dispatch_time()Zizhi Wo
tg_may_dispatch() can directly indicate whether bio can be dispatched by returning the time to wait, without the need for the redundant "wait" parameter. Remove it and modify the function's return type accordingly. Since we have determined by the return time whether bio can be dispatched, rename tg_may_dispatch() to tg_dispatch_time(). Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-2-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13Merge tag 'md-6.16-20250513' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.16/block Pull MD changes from Yu Kuai: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters. * tag 'md-6.16-20250513' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md: clean up accounting for issued sync IO md: fix is_mddev_idle() md: add a new api sync_io_depth md: record dm-raid gendisk in mddev block: export API to get the number of bdev inflight IO block: clean up blk_mq_in_flight_rw() block: WARN if bdev inflight counter is negative block: reuse part_in_flight_rw for part_in_flight blk-mq: remove blk_mq_in_flight()
2025-05-12block: unfreeze queue if realloc tag set fails during nr_hw_queues updateNilay Shroff
In __blk_mq_update_nr_hw_queues(), the current sequence involves: 1. unregistering sysfs/debugfs attributes 2. freeze the queue 3. reallocating the tag set 4. updating the queue map 5. reallocating hardware contexts 6. updating the elevator (which unfreeze the queue again) 7. re-register sysfs/debugfs attributes If tag set reallocation fails at step 3, the function skips steps 4–6 and proceeds directly to step 7, re-registering the sysfs/debugfs attributes without unfreezing the queue first. This is incorrect and can lead to a system hang or lockdep splat, as the queue remains frozen and is never properly unfrozen. This patch addresses the issue by explicitly unfreezing the queue before re-registering the sysfs/debugfs attributes in the event of a tag set reallocation failure. Fixes: 9dc7a882ce96 ("block: move hctx debugfs/sysfs registering out of freezing queue") Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250512092952.135887-1-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-10md: clean up accounting for issued sync IOYu Kuai
It's no longer used and can be removed, also remove the field 'gendisk->sync_io'. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-10-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10md: fix is_mddev_idle()Yu Kuai
If sync_speed is above speed_min, then is_mddev_idle() will be called for each sync IO to check if the array is idle, and inflight sync_io will be limited if the array is not idle. However, while mkfs.ext4 for a large raid5 array while recovery is in progress, it's found that sync_speed is already above speed_min while lots of stripes are used for sync IO, causing long delay for mkfs.ext4. Root cause is the following checking from is_mddev_idle(): t1: submit sync IO: events1 = completed IO - issued sync IO t2: submit next sync IO: events2 = completed IO - issued sync IO if (events2 - events1 > 64) For consequence, the more sync IO issued, the less likely checking will pass. And when completed normal IO is more than issued sync IO, the condition will finally pass and is_mddev_idle() will return false, however, last_events will be updated hence is_mddev_idle() can only return false once in a while. Fix this problem by changing the checking as following: 1) mddev doesn't have normal IO completed; 2) mddev doesn't have normal IO inflight; 3) if any member disks is partition, and all other partitions doesn't have IO completed. Also change rdev->last_events to unsigned long to cleanup type casting. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-9-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10md: add a new api sync_io_depthYu Kuai
Currently if sync speed is above speed_min and below speed_max, md_do_sync() will wait for all sync IOs to be done before issuing new sync IO, means sync IO depth is limited to just 1. This limit is too low, in order to prevent sync speed drop conspicuously after fixing is_mddev_idle() in the next patch, add a new api for limiting sync IO depth, the default value is 32. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-8-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10md: record dm-raid gendisk in mddevYu Kuai
Following patch will use gendisk to check if there are normal IO completed or inflight, to fix a problem in mdraid that foreground IO can be starved by background sync IO in later patches. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-7-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10block: export API to get the number of bdev inflight IOYu Kuai
- rename part_in_{flight, flight_rw} to bdev_count_{inflight, inflight_rw} - export bdev_count_inflight, to fix a problem in mdraid that foreground IO can be starved by background sync IO in later patches Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-6-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-10block: clean up blk_mq_in_flight_rw()Yu Kuai
Also add comment for part_inflight_show() for the difference between bio-based and rq-based device. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-4-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-05-10block: WARN if bdev inflight counter is negativeYu Kuai
Which means there is a bug for related bio-based disk driver, or blk-mq for rq-based disk, it's better not to hide the bug. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-3-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com>
2025-05-10block: reuse part_in_flight_rw for part_in_flightYu Kuai
They are almost identical, to make code cleaner. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-2-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-10blk-mq: remove blk_mq_in_flight()Yu Kuai
After commit 7be835694dae ("block: fix that util can be greater than 100%"), it's not used and can be removed. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-08block: move removing elevator after deleting disk->queue_kobjMing Lei
When blk_unregister_queue() is called from add_disk() failure path, there is race in registering/unregistering elevator queue kobject from the two code paths, because commit 559dc11143eb ("block: move elv_register[unregister]_queue out of elevator_lock") moves elevator queue register/unregister out of elevator lock. Fix the race by removing elevator after deleting disk->queue_kobj, because kobject_del(&disk->queue_kobj) drains in-progress sysfs show()/store() of all attributes. Fixes: 559dc11143eb ("block: move elv_register[unregister]_queue out of elevator_lock") Reported-by: Nilay Shroff <nilay@linux.ibm.com> Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250508085807.3175112-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-08block: don't quiesce queue for calling elevator_set_none()Ming Lei
blk_mq_freeze_queue() can't be called on quiesced queue, otherwise it may never return if there is any queued requests. Fix it by removing quiesce queue around elevator_set_none() because elevator_switch() does quiesce queue in case that we need to switch to none really. Fixes: 1e44bedbc921 ("block: unifying elevator change") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250508085807.3175112-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07fs: aio: initialize .ki_write_stream of read-write requestMing Lei
AIO needs to initialize .ki_write_stream explicitly for read/write request, otherwise random .ki_write_stream is used, and cause -EINVAL returned for aio write randomly. Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <kbusch@kernel.org> Cc: Kanchan Joshi <joshi.k@samsung.com> Fixes: c27683da6406 ("block: expose write streams for block device nodes") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250507133328.3040255-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07hfsplus: use bdev_rw_virt in hfsplus_submit_bioChristoph Hellwig
Replace the code building a bio from a kernel direct map address and submitting it synchronously with the bdev_rw_virt helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07btrfs: use bdev_rw_virt in scrub_one_superChristoph Hellwig
Replace the code building a bio from a kernel direct map address and submitting it synchronously with the bdev_rw_virt helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Link: https://lore.kernel.org/r/20250507120451.4000627-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07xfs: simplify building the bio in xlog_write_iclogChristoph Hellwig
Use the bio_add_virt_nofail and bio_add_vmalloc helpers to abstract away the details of the memory allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07xfs: simplify xfs_rw_bdevChristoph Hellwig
Delegate to bdev_rw_virt when operating on non-vmalloc memory and use bio_add_vmalloc_chunk to insulate xfs from the details of adding vmalloc memory to a bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07xfs: simplify xfs_buf_submit_bioChristoph Hellwig
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it and use bio_add_vmalloc to insulate xfs from the details of adding vmalloc memory to a bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07dm-integrity: use bio_add_virt_nofailChristoph Hellwig
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it, and do the same for the similar pattern using bio_add_page for adding the first segment after a bio allocation as that can't fail either. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07dm-bufio: use bio_add_virt_nofailChristoph Hellwig
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07PM: hibernate: split and simplify hib_submit_ioChristoph Hellwig
Split hib_submit_io into a sync and async version. The sync version is a small wrapper around bdev_rw_virt which implements all the logic to add a kernel direct mapping range to a bio and synchronously submits it, while the async version is slightly simplified using the bio_add_virt_nofail for adding the single range. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07zonefs: use bdev_rw_virt in zonefs_read_superChristoph Hellwig
Switch zonefs_read_super to allocate the superblock buffer using kmalloc which falls back to the page allocator for PAGE_SIZE allocation but gives us a kernel virtual address and then use bdev_rw_virt to perform the synchronous read into it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07gfs2: use bdev_rw_virt in gfs2_read_superChristoph Hellwig
Switch gfs2_read_super to allocate the superblock buffer using kmalloc which falls back to the page allocator for PAGE_SIZE allocation but gives us a kernel virtual address and then use bdev_rw_virt to perform the synchronous read into it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07rnbd-srv: use bio_add_virt_nofailChristoph Hellwig
Use the bio_add_virt_nofail to add a single kernel virtual address to a bio as that can't fail. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07bcache: use bio_add_virt_nofailChristoph Hellwig
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: simplify bio_map_kernChristoph Hellwig
Rewrite bio_map_kern using the new bio_add_* helpers and drop the kerneldoc comment that is superfluous for an internal helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: pass the operation to bio_{map,copy}_kernChristoph Hellwig
That way the bio can be allocated with the right operation already set and there is no need to pass the separated 'reading' argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: remove the q argument from blk_rq_map_kernChristoph Hellwig
Remove the q argument from blk_rq_map_kern and the internal helpers called by it as the queue can trivially be derived from the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: add a bio_add_vmalloc helpersChristoph Hellwig
Add a helper to add a vmalloc region to a bio, abstracting away the vmalloc addresses from the underlying pages and another one wrapping it for the simple case where all data fits into a single bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: add a bio_add_max_vecs helperChristoph Hellwig
Add a helper to check how many bio_vecs are needed to add a kernel virtual address range to a bio, accounting for the always contiguous direct mapping and vmalloc mappings that usually need a bio_vec per page sized chunk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: add a bdev_rw_virt helperChristoph Hellwig
Add a helper to perform synchronous I/O on a kernel direct map range. Currently this is implemented in various places in usually not very efficient ways, so provide a generic helper instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: add a bio_add_virt_nofail helperChristoph Hellwig
Add a helper to add a directly mapped kernel virtual address to a bio so that callers don't have to convert to pages or folios. For now only the _nofail variant is provided as that is what all the obvious callers want. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: fix warning on 'make htmldocs'Ming Lei
Fix the following warning when running 'make htmldocs': +WARNING: include/linux/blk-mq.h:532 struct member 'update_nr_hwq_lock' not described in 'blk_mq_tag_set' Fixes: 98e68f67020c ("block: prevent adding/deleting disk during updating nr_hw_queues") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/all/20250507163220.00141d77@canb.auug.org.au/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250507092537.3009112-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: fix incorrect sizeofKanchan Joshi
The plid array, head->plids, is meant to store placement IDs, each of type u16. But its size has been incorrectly calculated, as the size of the pointer is being used instead of the size of the object it points to. Use the sizeof(*head->plids) in kcalloc so that we don't allocate extra. Fixes: 38e8397dde63 ("nvme: use fdp streams if write stream is provided") Reported-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: fix write_stream_granularity initializationCaleb Sander Mateos
write_stream_granularity is set to max(info->runs, U32_MAX), which means that any RUNS value less than 2 ** 32 becomes U32_MAX, and any larger value is silently truncated to an unsigned int. Use min() instead to provide the correct semantics, capping RUNS values at U32_MAX. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 30b5f20bb2dd ("nvme: register fdp parameters with the block layer") Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250506175413.1936110-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: use fdp streams if write stream is providedKeith Busch
Maps a user requested write stream to an FDP placement ID if possible. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-12-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: register fdp parameters with the block layerKeith Busch
Register the device data placement limits if supported. This is just registering the limits with the block layer. Nothing beyond reporting these attributes is happening in this patch. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-11-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: add FDP definitionsChristoph Hellwig
Add the config feature result, config log page, and management receive commands needed for FDP. Partially based on a patch from Kanchan Joshi <joshi.k@samsung.com>. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-10-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: pass a void pointer to nvme_get/set_features for the resultChristoph Hellwig
That allows passing in structures instead of the u32 result, and thus reduce the amount of bit shifting and masking required to parse the result. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-9-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: add a nvme_get_log_lsi helperChristoph Hellwig
For log pages that need to pass in a LSI value, while at the same time not touching all the existing nvme_get_log callers. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-8-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06io_uring: enable per-io write streamsKeith Busch
Allow userspace to pass a per-I/O write stream in the SQE: __u8 write_stream; The __u8 type matches the size the filesystems and block layer support. Application can query the supported values from the block devices max_write_streams sysfs attribute. Unsupported values are ignored by file operations that do not support write streams or rejected with an error by those that support them. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-7-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: expose write streams for block device nodesChristoph Hellwig
Use the per-kiocb write stream if provided, or map temperature hints to write streams (which is a bit questionable, but this shows how it is done). Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> [kbusch: removed statx reporting] Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-6-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: introduce a write_stream_granularity queue limitChristoph Hellwig
Export the granularity that write streams should be discarded with, as it is essential for making good use of them. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-5-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: introduce max_write_streams queue limitKeith Busch
Drivers with hardware that support write streams need a way to export how many are available so applications can generically query this. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org> [hch: renamed hints to streams, removed stacking] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-4-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: add a bi_write_stream fieldChristoph Hellwig
Add the ability to pass a write stream for placement control in the bio. The new field fits in an existing hole, so does not change the size of the struct. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-3-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>