summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-05-14block: Remove obsolete configs BLK_MQ_{PCI,VIRTIO}Lukas Bulwahn
Commit 9bc1e897a821 ("blk-mq: remove unused queue mapping helpers") makes the two config options, BLK_MQ_PCI and BLK_MQ_VIRTIO, have no remaining effect. Remove the two obsolete config options. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250514065513.463941-1-lukas.bulwahn@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13block: remove the same_page output argument to bvec_try_merge_pageChristoph Hellwig
bvec_try_merge_page currently returns if the added page fragment is within the same page as the last page in the last current bio_vec. This information is used by __bio_iov_iter_get_pages so that we always have a single folio pin per page even when the page is split over multiple __bio_iov_iter_get_pages calls. Threading this through the entire lowlevel add page to bio logic is annoying and inefficient and leads to less code sharing than otherwise possible. Instead add code to __bio_iov_iter_get_pages that checks if the bio_vecs did not change and thus a merge into the last segment must have happened, and if there is an offset into the page for the currently added fragment, because if yes we must have already had a previous fragment of the same page in the last bio_vec. While this is still a bit ugly, it keeps the logic in the one place that needs it and allows for more code sharing. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250512042354.514329-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13blk-throttle: Prevents the bps restricted io from entering the bps queue againZizhi Wo
[BUG] There has an issue of io delayed dispatch caused by io splitting. Consider the following scenario: 1) If we set a BPS limit of 1MB/s and restrict the maximum IO size per dispatch to 4KB, submitting -two- 1MB IO requests results in completion times of 1s and 2s, which is expected. 2) However, if we additionally set an IOPS limit of 1,000,000/s with the same BPS limit of 1MB/s, submitting -two- 1MB IO requests again results in both completing in 2s, even though the IOPS constraint is being met. [CAUSE] This issue arises because BPS and IOPS currently share the same queue in the blkthrotl mechanism: 1) This issue does not occur when only BPS is limited because the split IOs return false in blk_should_throtl() and do not go through to throtl again. 2) For split IOs, even if they have been tagged with BIO_BPS_THROTTLED, they still get queued alternately in the same list due to continuous splitting and reordering. As a result, the two IO requests are both completed at the 2-second mark, causing an unintended delay. 3) It is not difficult to imagine that in this scenario, if N 1MB IOs are issued at once, all IOs will eventually complete together in N seconds. [FIX] With the queue separation introduced in the previous patches, we now have separate BPS and IOPS queues. For IOs that have already passed the BPS limitation, they do not need to re-enter the BPS queue and can directly placed to the IOPS queue. Since we have split the queues, when the IOPS queue is previously empty and a new bio is added to the first qnode->bios_iops list in the service_queue, we also need to update the disptime. This patch introduces "THROTL_TG_IOPS_WAS_EMPTY" flag to mark it. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-8-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13blk-throttle: Split the service queueZizhi Wo
This patch splits throtl_service_queue->nr_queued into "nr_queued_bps" and "nr_queued_iops", allowing separate accounting of BPS and IOPS queued bios. This prepares for future changes that need to check whether the BPS or IOPS queues are empty. To facilitate updating the number of IOs in the BPS and IOPS queues, the addition logic will be moved from throtl_add_bio_tg() to throtl_qnode_add_bio(), and similarly, the removal logic will be moved from tg_dispatch_one_bio() to throtl_pop_queued(). And introduce sq_queued() to calculate the total sum of sq->nr_queued. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-7-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13blk-throttle: Split the blkthrotl queueZizhi Wo
This patch splits the single queue into separate bps and iops queues. Now, an IO request must first pass through the bps queue, then the iops queue, and finally be dispatched. Due to the queue splitting, we need to modify the throtl add/peek/pop function. Additionally, the patch modifies the logic related to tg_dispatch_time(). If bio needs to wait for bps, function directly returns the bps wait time; otherwise, it charges bps and returns the iops wait time so that bio can be directly placed into the iops queue afterward. Note that this may lead to more frequent updates to disptime, but the overhead is negligible for the slow path. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-6-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13blk-throttle: Introduce flag "BIO_TG_BPS_THROTTLED"Zizhi Wo
Subsequent patches will split the single queue into separate bps and iops queues. To prevent IO that has already passed through the bps queue at a single tg level from being counted toward bps wait time again, we introduce "BIO_TG_BPS_THROTTLED" flag. Since throttle and QoS operate at different levels, we reuse the value as "BIO_QOS_THROTTLED". We set this flag when charge bps and clear it when charge iops, as the bio will move to the upper-level tg or be dispatched. This patch does not involve functional changes. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-5-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13blk-throttle: Split throtl_charge_bio() into bps and iops functionsZizhi Wo
Split throtl_charge_bio() to facilitate subsequent patches that will separately charge bps and iops after queue separation. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-4-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13blk-throttle: Refactor tg_dispatch_time by extracting tg_dispatch_bps/iops_timeZizhi Wo
tg_dispatch_time() contained both bps and iops throttling logic. We now split its internal logic into tg_dispatch_bps/iops_time() to improve code consistency for future separation of the bps and iops queues. Besides, merge time_before() from caller into throtl_extend_slice() to make code cleaner. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-3-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13blk-throttle: Rename tg_may_dispatch() to tg_dispatch_time()Zizhi Wo
tg_may_dispatch() can directly indicate whether bio can be dispatched by returning the time to wait, without the need for the redundant "wait" parameter. Remove it and modify the function's return type accordingly. Since we have determined by the return time whether bio can be dispatched, rename tg_may_dispatch() to tg_dispatch_time(). Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com> Link: https://lore.kernel.org/r/20250506020935.655574-2-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-13Merge tag 'md-6.16-20250513' of ↵Jens Axboe
https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.16/block Pull MD changes from Yu Kuai: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters. * tag 'md-6.16-20250513' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md: clean up accounting for issued sync IO md: fix is_mddev_idle() md: add a new api sync_io_depth md: record dm-raid gendisk in mddev block: export API to get the number of bdev inflight IO block: clean up blk_mq_in_flight_rw() block: WARN if bdev inflight counter is negative block: reuse part_in_flight_rw for part_in_flight blk-mq: remove blk_mq_in_flight()
2025-05-12block: unfreeze queue if realloc tag set fails during nr_hw_queues updateNilay Shroff
In __blk_mq_update_nr_hw_queues(), the current sequence involves: 1. unregistering sysfs/debugfs attributes 2. freeze the queue 3. reallocating the tag set 4. updating the queue map 5. reallocating hardware contexts 6. updating the elevator (which unfreeze the queue again) 7. re-register sysfs/debugfs attributes If tag set reallocation fails at step 3, the function skips steps 4–6 and proceeds directly to step 7, re-registering the sysfs/debugfs attributes without unfreezing the queue first. This is incorrect and can lead to a system hang or lockdep splat, as the queue remains frozen and is never properly unfrozen. This patch addresses the issue by explicitly unfreezing the queue before re-registering the sysfs/debugfs attributes in the event of a tag set reallocation failure. Fixes: 9dc7a882ce96 ("block: move hctx debugfs/sysfs registering out of freezing queue") Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250512092952.135887-1-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-10md: clean up accounting for issued sync IOYu Kuai
It's no longer used and can be removed, also remove the field 'gendisk->sync_io'. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-10-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10md: fix is_mddev_idle()Yu Kuai
If sync_speed is above speed_min, then is_mddev_idle() will be called for each sync IO to check if the array is idle, and inflight sync_io will be limited if the array is not idle. However, while mkfs.ext4 for a large raid5 array while recovery is in progress, it's found that sync_speed is already above speed_min while lots of stripes are used for sync IO, causing long delay for mkfs.ext4. Root cause is the following checking from is_mddev_idle(): t1: submit sync IO: events1 = completed IO - issued sync IO t2: submit next sync IO: events2 = completed IO - issued sync IO if (events2 - events1 > 64) For consequence, the more sync IO issued, the less likely checking will pass. And when completed normal IO is more than issued sync IO, the condition will finally pass and is_mddev_idle() will return false, however, last_events will be updated hence is_mddev_idle() can only return false once in a while. Fix this problem by changing the checking as following: 1) mddev doesn't have normal IO completed; 2) mddev doesn't have normal IO inflight; 3) if any member disks is partition, and all other partitions doesn't have IO completed. Also change rdev->last_events to unsigned long to cleanup type casting. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-9-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10md: add a new api sync_io_depthYu Kuai
Currently if sync speed is above speed_min and below speed_max, md_do_sync() will wait for all sync IOs to be done before issuing new sync IO, means sync IO depth is limited to just 1. This limit is too low, in order to prevent sync speed drop conspicuously after fixing is_mddev_idle() in the next patch, add a new api for limiting sync IO depth, the default value is 32. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-8-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10md: record dm-raid gendisk in mddevYu Kuai
Following patch will use gendisk to check if there are normal IO completed or inflight, to fix a problem in mdraid that foreground IO can be starved by background sync IO in later patches. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-7-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10block: export API to get the number of bdev inflight IOYu Kuai
- rename part_in_{flight, flight_rw} to bdev_count_{inflight, inflight_rw} - export bdev_count_inflight, to fix a problem in mdraid that foreground IO can be starved by background sync IO in later patches Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-6-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-10block: clean up blk_mq_in_flight_rw()Yu Kuai
Also add comment for part_inflight_show() for the difference between bio-based and rq-based device. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-4-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-05-10block: WARN if bdev inflight counter is negativeYu Kuai
Which means there is a bug for related bio-based disk driver, or blk-mq for rq-based disk, it's better not to hide the bug. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-3-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com>
2025-05-10block: reuse part_in_flight_rw for part_in_flightYu Kuai
They are almost identical, to make code cleaner. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-2-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-10blk-mq: remove blk_mq_in_flight()Yu Kuai
After commit 7be835694dae ("block: fix that util can be greater than 100%"), it's not used and can be removed. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-08block: move removing elevator after deleting disk->queue_kobjMing Lei
When blk_unregister_queue() is called from add_disk() failure path, there is race in registering/unregistering elevator queue kobject from the two code paths, because commit 559dc11143eb ("block: move elv_register[unregister]_queue out of elevator_lock") moves elevator queue register/unregister out of elevator lock. Fix the race by removing elevator after deleting disk->queue_kobj, because kobject_del(&disk->queue_kobj) drains in-progress sysfs show()/store() of all attributes. Fixes: 559dc11143eb ("block: move elv_register[unregister]_queue out of elevator_lock") Reported-by: Nilay Shroff <nilay@linux.ibm.com> Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250508085807.3175112-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-08block: don't quiesce queue for calling elevator_set_none()Ming Lei
blk_mq_freeze_queue() can't be called on quiesced queue, otherwise it may never return if there is any queued requests. Fix it by removing quiesce queue around elevator_set_none() because elevator_switch() does quiesce queue in case that we need to switch to none really. Fixes: 1e44bedbc921 ("block: unifying elevator change") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250508085807.3175112-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07fs: aio: initialize .ki_write_stream of read-write requestMing Lei
AIO needs to initialize .ki_write_stream explicitly for read/write request, otherwise random .ki_write_stream is used, and cause -EINVAL returned for aio write randomly. Cc: Christoph Hellwig <hch@lst.de> Cc: Keith Busch <kbusch@kernel.org> Cc: Kanchan Joshi <joshi.k@samsung.com> Fixes: c27683da6406 ("block: expose write streams for block device nodes") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250507133328.3040255-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07hfsplus: use bdev_rw_virt in hfsplus_submit_bioChristoph Hellwig
Replace the code building a bio from a kernel direct map address and submitting it synchronously with the bdev_rw_virt helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Yangtao Li <frank.li@vivo.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07btrfs: use bdev_rw_virt in scrub_one_superChristoph Hellwig
Replace the code building a bio from a kernel direct map address and submitting it synchronously with the bdev_rw_virt helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Link: https://lore.kernel.org/r/20250507120451.4000627-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07xfs: simplify building the bio in xlog_write_iclogChristoph Hellwig
Use the bio_add_virt_nofail and bio_add_vmalloc helpers to abstract away the details of the memory allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07xfs: simplify xfs_rw_bdevChristoph Hellwig
Delegate to bdev_rw_virt when operating on non-vmalloc memory and use bio_add_vmalloc_chunk to insulate xfs from the details of adding vmalloc memory to a bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07xfs: simplify xfs_buf_submit_bioChristoph Hellwig
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it and use bio_add_vmalloc to insulate xfs from the details of adding vmalloc memory to a bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07dm-integrity: use bio_add_virt_nofailChristoph Hellwig
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it, and do the same for the similar pattern using bio_add_page for adding the first segment after a bio allocation as that can't fail either. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07dm-bufio: use bio_add_virt_nofailChristoph Hellwig
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07PM: hibernate: split and simplify hib_submit_ioChristoph Hellwig
Split hib_submit_io into a sync and async version. The sync version is a small wrapper around bdev_rw_virt which implements all the logic to add a kernel direct mapping range to a bio and synchronously submits it, while the async version is slightly simplified using the bio_add_virt_nofail for adding the single range. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07zonefs: use bdev_rw_virt in zonefs_read_superChristoph Hellwig
Switch zonefs_read_super to allocate the superblock buffer using kmalloc which falls back to the page allocator for PAGE_SIZE allocation but gives us a kernel virtual address and then use bdev_rw_virt to perform the synchronous read into it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07gfs2: use bdev_rw_virt in gfs2_read_superChristoph Hellwig
Switch gfs2_read_super to allocate the superblock buffer using kmalloc which falls back to the page allocator for PAGE_SIZE allocation but gives us a kernel virtual address and then use bdev_rw_virt to perform the synchronous read into it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20250507120451.4000627-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07rnbd-srv: use bio_add_virt_nofailChristoph Hellwig
Use the bio_add_virt_nofail to add a single kernel virtual address to a bio as that can't fail. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07bcache: use bio_add_virt_nofailChristoph Hellwig
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: simplify bio_map_kernChristoph Hellwig
Rewrite bio_map_kern using the new bio_add_* helpers and drop the kerneldoc comment that is superfluous for an internal helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: pass the operation to bio_{map,copy}_kernChristoph Hellwig
That way the bio can be allocated with the right operation already set and there is no need to pass the separated 'reading' argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: remove the q argument from blk_rq_map_kernChristoph Hellwig
Remove the q argument from blk_rq_map_kern and the internal helpers called by it as the queue can trivially be derived from the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: add a bio_add_vmalloc helpersChristoph Hellwig
Add a helper to add a vmalloc region to a bio, abstracting away the vmalloc addresses from the underlying pages and another one wrapping it for the simple case where all data fits into a single bio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: add a bio_add_max_vecs helperChristoph Hellwig
Add a helper to check how many bio_vecs are needed to add a kernel virtual address range to a bio, accounting for the always contiguous direct mapping and vmalloc mappings that usually need a bio_vec per page sized chunk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: add a bdev_rw_virt helperChristoph Hellwig
Add a helper to perform synchronous I/O on a kernel direct map range. Currently this is implemented in various places in usually not very efficient ways, so provide a generic helper instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: add a bio_add_virt_nofail helperChristoph Hellwig
Add a helper to add a directly mapped kernel virtual address to a bio so that callers don't have to convert to pages or folios. For now only the _nofail variant is provided as that is what all the obvious callers want. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07block: fix warning on 'make htmldocs'Ming Lei
Fix the following warning when running 'make htmldocs': +WARNING: include/linux/blk-mq.h:532 struct member 'update_nr_hwq_lock' not described in 'blk_mq_tag_set' Fixes: 98e68f67020c ("block: prevent adding/deleting disk during updating nr_hw_queues") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/all/20250507163220.00141d77@canb.auug.org.au/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250507092537.3009112-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: fix incorrect sizeofKanchan Joshi
The plid array, head->plids, is meant to store placement IDs, each of type u16. But its size has been incorrectly calculated, as the size of the pointer is being used instead of the size of the object it points to. Use the sizeof(*head->plids) in kcalloc so that we don't allocate extra. Fixes: 38e8397dde63 ("nvme: use fdp streams if write stream is provided") Reported-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: fix write_stream_granularity initializationCaleb Sander Mateos
write_stream_granularity is set to max(info->runs, U32_MAX), which means that any RUNS value less than 2 ** 32 becomes U32_MAX, and any larger value is silently truncated to an unsigned int. Use min() instead to provide the correct semantics, capping RUNS values at U32_MAX. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 30b5f20bb2dd ("nvme: register fdp parameters with the block layer") Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250506175413.1936110-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: use fdp streams if write stream is providedKeith Busch
Maps a user requested write stream to an FDP placement ID if possible. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-12-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: register fdp parameters with the block layerKeith Busch
Register the device data placement limits if supported. This is just registering the limits with the block layer. Nothing beyond reporting these attributes is happening in this patch. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-11-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: add FDP definitionsChristoph Hellwig
Add the config feature result, config log page, and management receive commands needed for FDP. Partially based on a patch from Kanchan Joshi <joshi.k@samsung.com>. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-10-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: pass a void pointer to nvme_get/set_features for the resultChristoph Hellwig
That allows passing in structures instead of the u32 result, and thus reduce the amount of bit shifting and masking required to parse the result. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-9-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: add a nvme_get_log_lsi helperChristoph Hellwig
For log pages that need to pass in a LSI value, while at the same time not touching all the existing nvme_get_log callers. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-8-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>