summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-09-20ublk: don't access ublk_queue in ublk_register_io_buf()Caleb Sander Mateos
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_register_io_buf() is a frequent cache miss. Get the flags from the ublk_device instead, which is accessed earlier in ublk_ch_uring_cmd_local(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: pass ublk_device to ublk_register_io_buf()Caleb Sander Mateos
Avoid repeating the 2 dereferences to get the ublk_device from the io_uring_cmd by passing it from ublk_ch_uring_cmd_local() to ublk_register_io_buf(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't dereference ublk_queue in ublk_check_and_get_req()Caleb Sander Mateos
For ublk servers with many ublk queues, accessing the ublk_queue in ublk_ch_{read,write}_iter() is a frequent cache miss. Get the flags and queue depth from the ublk_device instead, which is accessed just before. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't dereference ublk_queue in ublk_ch_uring_cmd_local()Caleb Sander Mateos
For ublk servers with many ublk queues, accessing the ublk_queue to handle a ublk command is a frequent cache miss. Get the queue depth from the ublk_device instead, which is accessed just before. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: add helpers to check ublk_device flagsCaleb Sander Mateos
Introduce ublk_device analogues of the ublk_queue flag helpers: - ublk_support_zero_copy() -> ublk_dev_support_user_copy() - ublk_support_auto_buf_reg() -> ublk_dev_support_auto_buf_reg() - ublk_support_user_copy() -> ublk_dev_support_user_copy() - ublk_need_map_io() -> ublk_dev_need_map_io() - ublk_need_req_ref() -> ublk_dev_need_req_ref() - ublk_need_get_data() -> ublk_dev_need_get_data() These will be used in subsequent changes to avoid accessing the ublk_queue just for the flags, and instead use the ublk_device. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't pass ublk_queue to __ublk_fail_req()Caleb Sander Mateos
__ublk_fail_req() only uses the ublk_queue to get the ublk_device, which its caller already has. So just pass the ublk_device directly. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: don't pass q_id to ublk_queue_cmd_buf_size()Caleb Sander Mateos
ublk_queue_cmd_buf_size() only needs the queue depth, which is the same for all queues. Get the queue depth from the ublk_device instead so the q_id parameter can be dropped. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-20ublk: remove ubq check in ublk_check_and_get_req()Caleb Sander Mateos
ublk_get_queue() never returns a NULL pointer, so there's no need to check its return value in ublk_check_and_get_req(). Drop the check. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-19selftests: ublk: add test to verify that feat_map is completeUday Shankar
Add a test that verifies that the currently running kernel does not report support for any features that are unrecognized by kublk. This should catch cases where features are added without updating kublk's feat_map accordingly, which has happened multiple times in the past (see [1], [2]). Note that this new test may fail if the test suite is older than the kernel, and the newer kernel contains a newly introduced feature. I believe this is not a use case we currently care about - we only care about newer test suites passing on older kernels. [1] https://lore.kernel.org/linux-block/20250606214011.2576398-1-csander@purestorage.com/t/#u [2] https://lore.kernel.org/linux-block/2a370ab1-d85b-409d-b762-f9f3f6bdf705@nvidia.com/t/#m1c520a058448d594fd877f07804e69b28908533f Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-19selftests: ublk: kublk: add UBLK_F_BUF_REG_OFF_DAEMON to feat_mapUday Shankar
When UBLK_F_BUF_REG_OFF_DAEMON was added, we missed updating kublk's feat_map, which results in the feature being reported as "unknown." Add UBLK_F_BUF_REG_OFF_DAEMON to feat_map to fix this. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-19selftests: ublk: kublk: simplify feat_map definitionUday Shankar
Simplify the definition of feat_map by introducing a helper macro FEAT_NAME to avoid having to type the feature name twice. As a side effect, this changes the names in the feature list to be the full macro name instead of the abbreviated names that were used before, but this is a good change for clarity. Using the full feature macro names ruins the alignment of the output, so change the output format to put each feature's hex value before its name, as this is easier to align nicely. The output now looks as follows: root# ./kublk features ublk_drv features: 0x7fff 0x1 : UBLK_F_SUPPORT_ZERO_COPY 0x2 : UBLK_F_URING_CMD_COMP_IN_TASK 0x4 : UBLK_F_NEED_GET_DATA 0x8 : UBLK_F_USER_RECOVERY 0x10 : UBLK_F_USER_RECOVERY_REISSUE 0x20 : UBLK_F_UNPRIVILEGED_DEV 0x40 : UBLK_F_CMD_IOCTL_ENCODE 0x80 : UBLK_F_USER_COPY 0x100 : UBLK_F_ZONED 0x200 : UBLK_F_USER_RECOVERY_FAIL_IO 0x400 : UBLK_F_UPDATE_SIZE 0x800 : UBLK_F_AUTO_BUF_REG 0x1000 : UBLK_F_QUIESCE 0x2000 : UBLK_F_PER_IO_DAEMON 0x4000 : unknown Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-17blk-throttle: fix throtl_data leak during disk releaseYu Kuai
Tightening the throttle activation check in blk_throtl_activated() to require both q->td presence and policy bit set introduced a memory leak during disk release: blkg_destroy_all() clears the policy bit first during queue deactivation, causing subsequent blk_throtl_exit() to skip throtl_data cleanup when blk_throtl_activated() fails policy check. Idealy we should avoid modifying blk_throtl_exit() activation check because it's intuitive that blk-throtl start from blk_throtl_init() and end in blk_throtl_exit(). However, call blk_throtl_exit() before blkg_destroy_all() will make a long term deadlock problem easier to trigger[1], hence fix this problem by checking if q->td is NULL from blk_throtl_exit(), and remove policy deactivation as well since it's useless. [1] https://lore.kernel.org/all/CAHj4cs9p9H5yx+ywsb3CMUdbqGPhM+8tuBvhW=9ADiCjAqza9w@mail.gmail.com/#t Fixes: bd9fd5be6bc0 ("blk-throttle: fix access race during throttle policy activation") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs-p-ZwBEKigBj7T6hQCOo-H68-kVwCrV6ZvRovrr9Z+HA@mail.gmail.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-17blk-mq: Fix the blk_mq_tagset_busy_iter() documentationBart Van Assche
Commit 2dd6532e9591 ("blk-mq: Drop 'reserved' arg of busy_tag_iter_fn") removed the 'reserved' argument from tag iteration callback functions. Bring the blk_mq_tagset_busy_iter() documentation in sync with that change. Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: John Garry <john.g.garry@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16block: relax atomic write boundary vs chunk size checkJohn Garry
blk_validate_atomic_write_limits() ensures that any boundary fits into and is aligned to any chunk size. However, it should also be possible to fit the chunk size into any boundary. That check is already made in blk_stack_atomic_writes_boundary_head(). Relax the check in blk_validate_atomic_write_limits() by reusing (and renaming) blk_stack_atomic_writes_boundary_head(). Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16block: fix stacking of atomic writes when atomics are not supportedJohn Garry
Atomic writes support may not always be possible when stacking devices which support atomic writes. Such as case is a different atomic write boundary between stacked devices (which is not supported). In the case that atomic writes cannot supported, the top device queue HW limits are set to 0. However, in blk_stack_atomic_writes_limits(), we detect that we are stacking the first bottom device by checking the top device atomic_write_hw_max value == 0. This get confused with the case of atomic writes not supported, above. Make the distinction between stacking the first bottom device and no atomics supported by initializing stacked device atomic_write_hw_max = UINT_MAX and checking that for stacking the first bottom device. Fixes: d7f36dc446e8 ("block: Support atomic writes limits for stacked devices") Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-16block: update validation of atomic writes boundary for stacked devicesJohn Garry
In commit 63d092d1c1b1 ("block: use chunk_sectors when evaluating stacked atomic write limits"), it was missed to use a chunk sectors limit check in blk_stack_atomic_writes_boundary_head(), so update that function to do the proper check. Fixes: 63d092d1c1b1 ("block: use chunk_sectors when evaluating stacked atomic write limits") Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-15block/mq-deadline: Remove the redundant rb_entry_rq in the deadline_from_pos().chengkaitao
In commit(fde02699c242), the "if (blk_rq_is_seq_zoned_write(rq))" was removed, but the "rb_entry_rq(node)" and some other code were inadvertently left behind. This patch fixed it. Signed-off-by: chengkaitao <chengkaitao@kylinos.cn> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Li Nan <linan122@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/md-llbitmap: Use DIV_ROUND_UP_SECTOR_TNathan Chancellor
When building for 32-bit platforms, there are several link (if builtin) or modpost (if a module) errors due to dividends of type 'sector_t' in DIV_ROUND_UP: arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_resize': drivers/md/md-llbitmap.c:1017:(.text+0xae8): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.c:1020:(.text+0xb10): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_end_discard': drivers/md/md-llbitmap.c:1114:(.text+0xf14): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_start_discard': drivers/md/md-llbitmap.c:1097:(.text+0x1808): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o: in function `llbitmap_read_sb': drivers/md/md-llbitmap.c:867:(.text+0x2080): undefined reference to `__aeabi_uldivmod' arm-linux-gnueabi-ld: drivers/md/md-llbitmap.o:drivers/md/md-llbitmap.c:895: more undefined references to `__aeabi_uldivmod' follow Use DIV_ROUND_UP_SECTOR_T instead of DIV_ROUND_UP, which exists to handle this exact situation. Fixes: 5ab829f1971d ("md/md-llbitmap: introduce new lockless bitmap") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10blk-mq: fix stale nr_requests documentationYu Kuai
The nr_requests documentation is still the removed single queue, remove it and update to current blk-mq. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10blk-mq: remove blk_mq_tag_update_depth()Yu Kuai
This helper is not used now. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10blk-mq: fix potential deadlock while nr_requests grownYu Kuai
Allocate and free sched_tags while queue is freezed can deadlock[1], this is a long term problem, hence allocate memory before freezing queue and free memory after queue is unfreezed. [1] https://lore.kernel.org/all/0659ea8d-a463-47c8-9180-43c719e106eb@linux.ibm.com/ Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10blk-mq-sched: add new parameter nr_requests in blk_mq_alloc_sched_tags()Yu Kuai
This helper only support to allocate the default number of requests, add a new parameter to support specific number of requests. Prepare to fix potential deadlock in the case nr_requests grow. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10blk-mq: split bitmap grow and resize case in blk_mq_update_nr_requests()Yu Kuai
No functional changes are intended, make code cleaner and prepare to fix the grow case in following patches. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10blk-mq: cleanup shared tags case in blk_mq_update_nr_requests()Yu Kuai
For shared tags case, all hctx->sched_tags/tags are the same, it doesn't make sense to call into blk_mq_tag_update_depth() multiple times for the same tags. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10blk-mq: convert to serialize updating nr_requests with update_nr_hwq_lockYu Kuai
request_queue->nr_requests can be changed by: a) switch elevator by updating nr_hw_queues b) switch elevator by elevator sysfs attribute c) configue queue sysfs attribute nr_requests Current lock order is: 1) update_nr_hwq_lock, case a,b 2) freeze_queue 3) elevator_lock, case a,b,c And update nr_requests is seriablized by elevator_lock() already, however, in the case c, we'll have to allocate new sched_tags if nr_requests grow, and do this with elevator_lock held and queue freezed has the risk of deadlock. Hence use update_nr_hwq_lock instead, make it possible to allocate memory if tags grow, meanwhile also prevent nr_requests to be changed concurrently. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10blk-mq: check invalid nr_requests in queue_requests_store()Yu Kuai
queue_requests_store() is the only caller of blk_mq_update_nr_requests(), and blk_mq_update_nr_requests() is the only caller of blk_mq_tag_update_depth(), however, they all have checkings for nr_requests input by user. Make code cleaner by moving all the checkings to the top function: 1) nr_requests > reserved tags; 2) if there is elevator, 4 <= nr_requests <= 2048; 3) if elevator is none, 4 <= nr_requests <= tag_set->queue_depth; Meanwhile, case 2 is the only case tags can grow and -ENOMEM might be returned. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10blk-mq: remove useless checkings in blk_mq_update_nr_requests()Yu Kuai
1) queue_requests_store() is the only caller of blk_mq_update_nr_requests(), where queue is already freezed, no need to check mq_freeze_depth; 2) q->tag_set must be set for request based device, and queue_is_mq() is already checked in blk_mq_queue_attr_visible(), no need to check q->tag_set. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10blk-mq: remove useless checking in queue_requests_store()Yu Kuai
blk_mq_queue_attr_visible() already checked queue_is_mq(), no need to check this again in queue_requests_store(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10ublk: consolidate nr_io_ready and nr_queues_readyCaleb Sander Mateos
ublk_mark_io_ready() tracks whether all the ublk_device's I/Os have been fetched by incrementing ublk_queue's nr_io_ready count and incrementing ublk_device's nr_queues_ready count if the whole queue is ready. Simplify the logic by just tracking the total number of fetched I/Os on each ublk_device. When this count reaches nr_hw_queues * queue_depth, the ublk_device is ready to receive I/O. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/raid0: convert raid0_make_request() to use bio_submit_split_bioset()Yu Kuai
Currently, raid0_make_request() will remap the original bio to underlying disks to prevent reordered IO. Now that bio_submit_split_bioset() will put original bio to the head of current->bio_list, it's safe converting to use this helper and bio will still be ordered. CC: Jan Kara <jack@suse.cz> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10block: fix ordering of recursive split IOYu Kuai
Currently, split bio will be chained to original bio, and original bio will be resubmitted to the tail of current->bio_list, waiting for split bio to be issued. However, if split bio get split again, the IO order will be messed up. This problem, on the one hand, will cause performance degradation, especially for mdraid with large IO size; on the other hand, will cause write errors for zoned block devices[1]. For example, in raid456 IO will first be split by max_sector from md_submit_bio(), and then later be split again by chunksize for internal handling: For example, assume max_sectors is 1M, and chunksize is 512k 1) issue a 2M IO: bio issuing: 0+2M current->bio_list: NULL 2) md_submit_bio() split by max_sector: bio issuing: 0+1M current->bio_list: 1M+1M 3) chunk_aligned_read() split by chunksize: bio issuing: 0+512k current->bio_list: 1M+1M -> 512k+512k 4) after first bio issued, __submit_bio_noacct() will contuine issuing next bio: bio issuing: 1M+1M current->bio_list: 512k+512k bio issued: 0+512k 5) chunk_aligned_read() split by chunksize: bio issuing: 1M+512k current->bio_list: 512k+512k -> 1536k+512k bio issued: 0+512k 6) no split afterwards, finally the issue order is: 0+512k -> 1M+512k -> 512k+512k -> 1536k+512k This behaviour will cause large IO read on raid456 endup to be small discontinuous IO in underlying disks. Fix this problem by placing split bio to the head of current->bio_list. Test script: test on 8 disk raid5 with 64k chunksize dd if=/dev/md0 of=/dev/null bs=4480k iflag=direct Test results: Before this patch 1) iostat results: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util md0 52430.00 3276.87 0.00 0.00 0.62 64.00 32.60 80.10 sd* 4487.00 409.00 2054.00 31.40 0.82 93.34 3.68 71.20 2) blktrace G stage: 8,0 0 486445 11.357392936 843 G R 14071424 + 128 [dd] 8,0 0 486451 11.357466360 843 G R 14071168 + 128 [dd] 8,0 0 486454 11.357515868 843 G R 14071296 + 128 [dd] 8,0 0 486468 11.357968099 843 G R 14072192 + 128 [dd] 8,0 0 486474 11.358031320 843 G R 14071936 + 128 [dd] 8,0 0 486480 11.358096298 843 G R 14071552 + 128 [dd] 8,0 0 486490 11.358303858 843 G R 14071808 + 128 [dd] 3) io seek for sdx: Noted io seek is the result from blktrace D stage, statistic of: ABS((offset of next IO) - (offset + len of previous IO)) Read|Write seek cnt 55175, zero cnt 25079 >=(KB) .. <(KB) : count ratio |distribution | 0 .. 1 : 25079 45.5% |########################################| 1 .. 2 : 0 0.0% | | 2 .. 4 : 0 0.0% | | 4 .. 8 : 0 0.0% | | 8 .. 16 : 0 0.0% | | 16 .. 32 : 0 0.0% | | 32 .. 64 : 12540 22.7% |##################### | 64 .. 128 : 2508 4.5% |##### | 128 .. 256 : 0 0.0% | | 256 .. 512 : 10032 18.2% |################# | 512 .. 1024 : 5016 9.1% |######### | After this patch: 1) iostat results: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util md0 87965.00 5271.88 0.00 0.00 0.16 61.37 14.03 90.60 sd* 6020.00 658.44 5117.00 45.95 0.44 112.00 2.68 86.50 2) blktrace G stage: 8,0 0 206296 5.354894072 664 G R 7156992 + 128 [dd] 8,0 0 206305 5.355018179 664 G R 7157248 + 128 [dd] 8,0 0 206316 5.355204438 664 G R 7157504 + 128 [dd] 8,0 0 206319 5.355241048 664 G R 7157760 + 128 [dd] 8,0 0 206333 5.355500923 664 G R 7158016 + 128 [dd] 8,0 0 206344 5.355837806 664 G R 7158272 + 128 [dd] 8,0 0 206353 5.355960395 664 G R 7158528 + 128 [dd] 8,0 0 206357 5.356020772 664 G R 7158784 + 128 [dd] 3) io seek for sdx Read|Write seek cnt 28644, zero cnt 21483 >=(KB) .. <(KB) : count ratio |distribution | 0 .. 1 : 21483 75.0% |########################################| 1 .. 2 : 0 0.0% | | 2 .. 4 : 0 0.0% | | 4 .. 8 : 0 0.0% | | 8 .. 16 : 0 0.0% | | 16 .. 32 : 0 0.0% | | 32 .. 64 : 7161 25.0% |############## | BTW, this looks like a long term problem from day one, and large sequential IO read is pretty common case like video playing. And even with this patch, in this test case IO is merged to at most 128k is due to block layer plug limit BLK_PLUG_FLUSH_SIZE, increase such limit can get even better performance. However, we'll figure out how to do this properly later. [1] https://lore.kernel.org/all/e40b076d-583d-406b-b223-005910a9f46f@acm.org/ Fixes: d89d87965dcb ("When stacked block devices are in-use (e.g. md or dm), the recursive calls") Reported-by: Tie Ren <tieren@fnnas.com> Closes: https://lore.kernel.org/all/7dro5o7u5t64d6bgiansesjavxcuvkq5p2pok7dtwkav7b7ape@3isfr44b6352/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10block: skip unnecessary checks for split bioYu Kuai
Lots of checks are already done while submitting this bio the first time, and there is no need to check them again when this bio is resubmitted after split. Hence open code should_fail_bio() and blk_throtl_bio() that are still necessary from submit_bio_split_bioset(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10blk-crypto: convert to use bio_submit_split_bioset()Yu Kuai
Unify bio split code, prepare to fix ordering of split IO. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/md-linear: convert to use bio_submit_split_bioset()Yu Kuai
Unify bio split code, prepare to fix reordered split IO. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/raid5: convert to use bio_submit_split_bioset()Yu Kuai
Unify bio split code, prepare to fix ordering of split IO. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/raid10: convert read/write to use bio_submit_split_bioset()Yu Kuai
Unify bio split code, prepare to fix ordering of split IO, the error path is modified a bit, however no functional changes are intended: - bio_submit_split_bioset() can fail the original bio directly by split error, set R10BIO_Uptodate in this case to notify raid_end_bio_io() that the original bio is returned already. - set R10BIO_Uptodate and set error value to -EIO is useless now, for r10_bio without R10BIO_Uptodate, -EIO will be returned for original bio. And discard is not handled, because discard is only split for unaligned head and tail, and this can be considered slow path, the reorder here does not matter much. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/raid10: add a new r10bio flag R10BIO_ReturnedYu Kuai
The new helper bio_submit_split_bioset() can failed the orginal bio on split errors, prepare to handle this case in raid_end_bio_io(). The flag name is refer to the r1bio flag name. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/raid1: convert to use bio_submit_split_bioset()Yu Kuai
Unify bio split code, and prepare to fix ordering of split IO. Noted that bio_submit_split_bioset() can fail the original bio directly by split error, set R1BIO_Returned in this case to notify raid_end_bio_io() that the original bio is returned already. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset()Yu Kuai
Unify bio split code, and prepare to fix ordering of split IO Noted commit 319ff40a5427 ("md/raid0: Fix performance regression for large sequential writes") already fix ordering of split IO by remapping bio to underlying disks before resubmitting it, with the respect md_submit_bio() already split it by sectors, and raid0_make_request() will split at most once for unaligned IO. This is a bit hacky and we'll convert this to solution in general later. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10block: factor out a helper bio_submit_split_bioset()Yu Kuai
No functional changes are intended, some drivers like mdraid will split bio by internal processing, prepare to unify bio split codes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10blk-crypto: fix missing blktrace bio split eventsYu Kuai
trace_block_split() is missing, resulting in blktrace inability to catch BIO split events and making it harder to analyze the BIO sequence. Cc: stable@vger.kernel.org Fixes: 488f6682c832 ("block: blk-crypto-fallback for Inline Encryption") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10md: fix mssing blktrace bio split eventsYu Kuai
If bio is split by internal handling like chunksize or badblocks, the corresponding trace_block_split() is missing, resulting in blktrace inability to catch BIO split events and making it harder to analyze the BIO sequence. Cc: stable@vger.kernel.org Fixes: 4b1faf931650 ("block: Kill bio_pair_split()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10blk-mq: add QUEUE_FLAG_BIO_ISSUE_TIMEYu Kuai
bio->issue_time_ns is initialized for every bio, however, it's only used by blk-iolatency. Add a new queue_flag and only set this flag when blk-iolatency is enabled, so that extra blk_time_get_ns() can be saved for disks that blk-iolatency is not enabled. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10block: initialize bio issue time in blk_mq_submit_bio()Yu Kuai
bio->issue_time_ns is only used by blk-iolatency, which can only be enabled for rq-based disk, hence it's not necessary to initialize the time for bio-based disk. Meanwhile, if bio is split by blk_crypto_fallback_split_bio_if_needed(), the issue time is not initialized for new split bio, this can be fixed as well. Noted the next patch will optimize better that bio issue time will only be used when blk-iolatency is really enabled by the disk. Fixes: 488f6682c832 ("block: blk-crypto-fallback for Inline Encryption") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-10block: cleanup bio_issueYu Kuai
Now that bio->bi_issue is only used by blk-iolatency to get bio issue time, replace bio_issue with u64 time directly and remove bio_issue to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09Merge tag 'md-6.18-20250909' of ↵Jens Axboe
gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux into for-6.18/block Pull MD changes from Yu Kuai: "Redundant data is used to enhance data fault tolerance, and the storage method for redundant data vary depending on the RAID levels. And it's important to maintain the consistency of redundant data. Bitmap is used to record which data blocks have been synchronized and which ones need to be resynchronized or recovered. Each bit in the bitmap represents a segment of data in the array. When a bit is set, it indicates that the multiple redundant copies of that data segment may not be consistent. Data synchronization can be performed based on the bitmap after power failure or readding a disk. If there is no bitmap, a full disk synchronization is required. Due to known performance issues with md-bitmap and the unreasonable implementations: - self-managed IO submitting like filemap_write_page(); - global spin_lock I have decided not to continue optimizing based on the current bitmap implementation, this new bitmap is invented without locking from IO fast path and can be used with fast disks. Key features for the new bitmap: - IO fastpath is lockless, if user issues lots of write IO to the same bitmap bit in a short time, only the first write has additional overhead to update bitmap bit, no additional overhead for the following writes; - support only resync or recover written data, means in the case creating new array or replacing with a new disk, there is no need to do a full disk resync/recovery;" * tag 'md-6.18-20250909' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux: (24 commits) md/md-llbitmap: introduce new lockless bitmap md/md-bitmap: make method bitmap_ops->daemon_work optional md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER md/md-bitmap: add a new method blocks_synced() in bitmap_operations md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations md/md-bitmap: delay registration of bitmap_ops until creating bitmap md/md-bitmap: add a new sysfs api bitmap_type md: add a new mddev field 'bitmap_id' md/md-bitmap: support discard for bitmap ops md: factor out a helper raid_is_456() md: add a new parameter 'offset' to md_super_write() md/md-bitmap: introduce CONFIG_MD_BITMAP md: check before referencing mddev->bitmap_ops md/dm-raid: check before referencing mddev->bitmap_ops md/raid5: check before referencing mddev->bitmap_ops md/raid10: check before referencing mddev->bitmap_ops md/raid1: check before referencing mddev->bitmap_ops md/raid1: check bitmap before behind write md/md-bitmap: handle the case bitmap is not enabled before end_sync() md/md-bitmap: handle the case bitmap is not enabled before start_sync() ...
2025-09-09blk-map: provide the bdev to bio if one existsKeith Busch
We can now safely provide a block device when extracting user pages for driver and user passthrough commands. Set the bdev so the caller doesn't have to do that later. This has an additional benefit of being able to extract P2P pages in the passthrough path. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09blk-mq-dma: bring back p2p request flagsKeith Busch
We only need to consider data and metadata dma mapping types separately. The request and bio integrity payload have enough flag bits to internally track the mapping type for each. Use these so the caller doesn't need to track them, and provide separete request and integrity helpers to the common code. This will make it easier to scale new mappings, like the proposed MMIO attribute, without burdening the caller to track such things. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09blk-integrity: enable p2p source and destinationKeith Busch
Set the extraction flags to allow p2p pages for the metadata buffer if the block device allows it. Similar to data payloads, ensure the bio does not use merging if we see a p2p page. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-09iov_iter: remove iov_iter_is_alignedKeith Busch
No more callers. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>