summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-05-06fs: add a write stream field to the kiocbChristoph Hellwig
Prepare for io_uring passthrough of write streams. The write stream field in the kiocb structure fits into an existing 2-byte hole, so its size is not changed. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-2-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: only update request sector if neededJohannes Thumshirn
In case of a ZONE APPEND write, regardless of native ZONE APPEND or the emulation layer in the zone write plugging code, the sector the data got written to by the device needs to be updated in the bio. At the moment, this is done for every native ZONE APPEND write and every request that is flagged with 'BIO_ZONE_WRITE_PLUGGING'. But thus superfluously updates the sector for regular writes to a zoned block device. Check if a bio is a native ZONE APPEND write or if the bio is flagged as 'BIO_EMULATES_ZONE_APPEND', meaning the block layer's zone write plugging code handles the ZONE APPEND and translates it into a regular write and back. Only if one of these two criterion is met, update the sector in the bio upon completion. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/dea089581cb6b777c1cd1500b38ac0b61df4b2d1.1746530748.git.jth@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: move wbt_enable_default() out of queue freezing from sched ->exit()Ming Lei
scheduler's ->exit() is called with queue frozen and elevator lock is held, and wbt_enable_default() can't be called with queue frozen, otherwise the following lockdep warning is triggered: #6 (&q->rq_qos_mutex){+.+.}-{4:4}: #5 (&eq->sysfs_lock){+.+.}-{4:4}: #4 (&q->elevator_lock){+.+.}-{4:4}: #3 (&q->q_usage_counter(io)#3){++++}-{0:0}: #2 (fs_reclaim){+.+.}-{0:0}: #1 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: #0 (&q->debugfs_mutex){+.+.}-{4:4}: Fix the issue by moving wbt_enable_default() out of bfq's exit(), and call it from elevator_change_done(). Meantime add disk->rqos_state_mutex for covering wbt state change, which matches the purpose more than ->elevator_lock. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-26-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: move hctx cpuhp add/del out of queue freezingMing Lei
Move hctx cpuhp add/del out of queue freezing for not connecting freeze lock with cpuhp locks, then lockdep warning can be avoided. This way is safe because both needn't queue to be frozen and scheduler switch isn't allowed, with same reason for moving hctx debugfs/sysfs register out of queue freeze. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-25-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: don't acquire ->elevator_lock in blk_mq_map_swqueue and ↵Ming Lei
blk_mq_realloc_hw_ctxs Both blk_mq_map_swqueue() and blk_mq_realloc_hw_ctxs() are called before the request queue is added to tagset list, so the two won't run concurrently with blk_mq_update_nr_hw_queues(). When the two functions are only called from queue initialization or blk_mq_update_nr_hw_queues(), elevator switch can't happen. So remove ->elevator_lock uses from the two functions. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-24-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: move hctx debugfs/sysfs registering out of freezing queueMing Lei
Move hctx debugfs/sysfs register out of freezing queue in __blk_mq_update_nr_hw_queues(), so that the following lockdep dependency can be killed: #2 (&q->q_usage_counter(io)#16){++++}-{0:0}: #1 (fs_reclaim){+.+.}-{0:0}: #0 (&sb->s_type->i_mutex_key#3){+.+.}-{4:4}: //debugfs And registering/un-registering hctx debugfs/sysfs does not require queue to be frozen: - hctx sysfs attributes show() are drained when removing kobject, and there isn't store() implementation for hctx sysfs attributes - debugfs entry read() is drained too when removing debugfs directory, and there isn't write() implementation for hctx debugfs too - so it is safe to register/unregister hctx sysfs/debugfs without freezing queue because the cod paths changes nothing, and we just need to keep hctx live Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-23-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: move elv_register[unregister]_queue out of elevator_lockMing Lei
Move elv_register[unregister]_queue out of ->elevator_lock & queue freezing, so we can kill many lockdep warnings. elv_register[unregister]_queue() is serialized, and just dealing with sysfs/ debugfs things, no need to be done with queue frozen: - when it is called from adding disk, elevator switch isn't possible because ->queue_kobj isn't added yet - when it is called from deleting disk, disable_elv_switch() is responsible for preventing new elevator switch and draining old elevator switch. - when it is called from blk_mq_update_nr_hw_queues(), adding/removing disk and elevator switch can't be allowed or in-progress With this change, elevator's ->exit() is called before calling elv_unregister_queue, then user may call into ->show()/store() of elevator's sysfs attributes, and we have covered this issue by adding `ELEVATOR_FLAG_DYNG`. For blk-mq debugfs, hctx->sched_tags is always checked with ->elevator_lock by debugfs code, meantime hctx->sched_tags is updated with ->elevator_lock, so there isn't such issue. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-22-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: add new helper for disabling elevator switch when deleting diskMing Lei
Add new helper disable_elv_switch() and new flag QUEUE_FLAG_NO_ELV_SWITCH for disabling elevator switch before deleting disk: - originally flag QUEUE_FLAG_REGISTERED is added for preventing elevator switch during removing disk, but this flag has been used widely for other purposes, so add one new flag for disabling elevator switch only - for avoiding deadlock risk, we have to move elevator queue register/unregister out of elevator lock and queue freeze, which will be done in next patch. However, this way adds small race window between elevator switch and deleting ->queue_kobj, in which elevator queue register/unregister could be run concurrently. The added helper will be used for avoiding the race in the following patch. - drain in-progress elevator switch before deleting disk Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Link: https://lore.kernel.org/r/20250505141805.2751237-21-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: fail to show/store elevator sysfs attribute if elevator is dyingMing Lei
Prepare for moving elv_register[unregister]_queue out of elevator_lock & queue freezing, so we may have to call elv_unregister_queue() after elevator ->exit() is called, then there is small window for user to call into ->show()/store(), and user-after-free can be caused. Fail to show/store elevator sysfs attribute if elevator is dying by adding one new flag of ELEVATOR_FLAG_DYNG, which is protected by elevator ->sysfs_lock. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20250505141805.2751237-20-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: remove elevator queue's type check in elv_attr_show/store()Ming Lei
elevatore queue's type is assigned since its allocation, and never get cleared until it is released. So its ->type is always not NULL, remove the unnecessary check. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-19-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: pass elevator_queue to elv_register_queue & unregister_queueMing Lei
Pass elevator_queue reference to elv_register_queue() & elv_unregister_queue(). No functional change, and prepare for moving the two out of elevator lock & freezing queue, when we need to store the old & new elevator queue in `struct elv_change_ctx` instance, then both two can co-exist for short while, so we have to pass the exact elevator_queue instance to elv_register_queue & unregister_queue. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-18-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: unifying elevator changeMing Lei
Elevator change is one well-define behavior: - tear down current elevator if it exists - setup new elevator It is supposed to cover any case for changing elevator by single internal API, typically the following cases: - setup default elevator in add_disk() - switch to none in del_disk() - reset elevator in blk_mq_update_nr_hw_queues() - switch elevator in sysfs `store` elevator attribute This patch uses elevator_change() to cover all above cases: - every elevator switch is serialized with each other: add_disk/del_disk/ store elevator is serialized already, blk_mq_update_nr_hw_queues() uses srcu for syncing with the other three cases - for both add_disk()/del_disk(), queue freeze works at atomic mode or has been froze, so the freeze in elevator_change() won't add extra delay - `struct elev_change_ctx` instance holds any info for changing elevator Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-17-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: add `struct elv_change_ctx` for unifying elevator changeMing Lei
Add `struct elv_change_ctx` and prepare for unifying elevator change by elevator_change(). With this way, any input & output parameter can be provided & observed in top helper. This way helps to move kobject add/delete & debugfs register/unregister out of ->elevator_lock & freezing queue. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-16-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: move queue freezing & elevator_lock into elevator_change()Ming Lei
Move queue freezing & elevator_lock into elevator_change(), and prepare for using elevator_change() for setting up & tearing down default elevator too. Also add lockdep_assert_held() in __elevator_change() because either read or write lock is required for changing elevator. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-15-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: simplify elevator reattachment for updating nr_hw_queuesMing Lei
In blk_mq_update_nr_hw_queues(), nr_hw_queues changes and elevator data depends on it, and elevator has to be reattached, so call elevator_switch() to force attachment. Add elv_update_nr_hw_queues() simply for blk_mq_update_nr_hw_queues() to reattach elevator, since elevator switch isn't likely when running blk_mq_update_nr_hw_queues(). This way removes the current switch none and switch back code. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-14-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: move blk_queue_registered() check into elv_iosched_store()Ming Lei
Move blk_queue_registered() check into elv_iosched_store() and prepare for using elevator_change() for covering any kind of elevator change in adding/deleting disk and updating nr_hw_queue. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-13-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: fold elevator_disable into elevator_switchChristoph Hellwig
This removes duplicate code, and keeps the callers tidy. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-12-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: look up the elevator type in elevator_switchChristoph Hellwig
That makes the function nicely self-contained and can be used to avoid code duplication. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-11-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: don't allow to switch elevator if updating nr_hw_queues is in-progressMing Lei
Elevator switch code is another `nr_hw_queue` reader in non-fast-IO code path, so it can't be done if updating `nr_hw_queues` is in-progress. Take same approach with not allowing add/del disk when updating nr_hw_queues is in-progress, by grabbing read lock of set->update_nr_hwq_sema. Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/linux-block/aAWv3NPtNIKKvJZc@fedora/ [1] Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Closes: https://lore.kernel.org/linux-block/mz4t4tlwiqjijw3zvqnjb7ovvvaegkqganegmmlc567tt5xj67@xal5ro544cnc/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-10-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: prevent adding/deleting disk during updating nr_hw_queuesMing Lei
Both adding/deleting disk code are reader of `nr_hw_queues`, so we can't allow them in-progress when updating nr_hw_queues, kernel panic and kasan has been reported in [1]. Prevent adding/deleting disk during updating nr_hw_queues by adding rw_semaphore to tagset, write lock is grabbed in blk_mq_update_nr_hw_queues(), and read lock is acquired when adding/deleting disk. Also mark GFP_NOIO allocation scope for adding/deleting disk because blk_mq_update_nr_hw_queues() is part of some driver's error handler. This way avoids lot of trouble. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Reported-by: Nilay Shroff <nilay@linux.ibm.com> Closes: https://lore.kernel.org/linux-block/a5896cdb-a59a-4a37-9f99-20522f5d2987@linux.ibm.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-9-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: add helper add_disk_final()Ming Lei
Add helper add_disk_final() for scanning partitions, announcing disk and handling the last thing for adding disk. No functional change, and prepare for prevent adding disk from happening when updating nr_hw_queues. Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250505141805.2751237-8-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: move sched debugfs register into elvevator_register_queueMing Lei
sched debugfs shares same lifetime with scheduler's kobject, and same lock(elevator lock), so move sched debugfs register/unregister into elevator_register_queue() and elevator_unregister_queue(). Then we needn't blk_mq_debugfs_register() for us to register sched debugfs any more. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-7-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: add two helpers for registering/un-registering sched debugfsMing Lei
Add blk_mq_sched_reg_debugfs()/blk_mq_sched_unreg_debugfs() to clean up sched init/exit code a bit. Register & unregister debugfs for sched & sched_hctx order is changed a bit, but it is safe because sched & sched_hctx is guaranteed to be ready when exporting via debugfs. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-6-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: use q->elevator with ->elevator_lock held in elv_iosched_show()Ming Lei
Use q->elevator with ->elevator_lock held in elv_iosched_show(), since the local cached elevator reference may become stale after getting ->elevator_lock. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: don't call freeze queue in elevator_switch() and elevator_disable()Ming Lei
Both elevator_switch() and elevator_disable() are only called from the two code paths, in which queue is guaranteed to be frozen. So don't call freeze queue in the two functions, also add asserts for queue freeze. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250505141805.2751237-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: move ELEVATOR_FLAG_DISABLE_WBT a request queue flagMing Lei
ELEVATOR_FLAG_DISABLE_WBT is only used by BFQ to disallow wbt when BFQ is in use. The flag is set in BFQ's init(), and cleared in BFQ's exit(). Making it as request queue flag, so that we can avoid to deal with elevator switch race. Also it isn't graceful to checking one scheduler flag in wbt_enable_default(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06block: move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue()Ming Lei
Move blk_mq_add_queue_tag_set() after blk_mq_map_swqueue(), and publish this request queue to tagset after everything is setup. This way is safe because BLK_MQ_F_TAG_QUEUE_SHARED isn't used by blk_mq_map_swqueue(), and this flag is mainly checked in fast IO code path. Prepare for removing ->elevator_lock from blk_mq_map_swqueue() which is supposed to be called when elevator switch can't be done. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reported-by: Nilay Shroff <nilay@linux.ibm.com> Closes: https://lore.kernel.org/linux-block/567cb7ab-23d6-4cee-a915-c8cdac903ddd@linux.ibm.com/ Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505141805.2751237-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06brd: fix discard end sectorYu Kuai
brd_do_discard() just aligned start sector to page, this can only work if the discard size if at least one page. For example: blkdiscard /dev/ram0 -o 5120 -l 1024 In this case, size = (1024 - (8192 - 5120)), which is a huge value. Fix the problem by round_down() the end sector. Fixes: 9ead7efc6f3f ("brd: implement discard support") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250506061756.2970934-4-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06brd: fix aligned_sector from brd_do_discard()Yu Kuai
The calculation is just wrong, fix it by round_up(). Fixes: 9ead7efc6f3f ("brd: implement discard support") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250506061756.2970934-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06brd: protect page with rcuYu Kuai
Currently, after fetching the page by xa_load() in IO path, there is no protection and page can be freed concurrently by discard: cpu0 brd_submit_bio brd_do_bvec page = brd_lookup_page cpu1 brd_submit_bio brd_do_discard page = __xa_erase() __free_page() // page UAF Fix the problem by protecting page with rcu. Meanwhile, if page is already freed, also prevent BUG_ON() by skipping the write, and user will get zero data later if there is no page. Fixes: 9ead7efc6f3f ("brd: implement discard support") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250506061756.2970934-2-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06ublk: consolidate UBLK_IO_FLAG_OWNED_BY_SRV checksCaleb Sander Mateos
Every ublk I/O command except UBLK_IO_FETCH_REQ checks that the ublk_io has UBLK_IO_FLAG_OWNED_BY_SRV set. Consolidate the separate checks into a single one in __ublk_ch_uring_cmd(), analogous to those for UBLK_IO_FLAG_ACTIVE and UBLK_IO_FLAG_NEED_GET_DATA. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250505172624.1121839-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05blk-throttle: Add an additional overflow check to the call ↵Zizhi Wo
calculate_bytes/io_allowed Now the tg->[bytes/io]_disp type is signed, and calculate_bytes/io_allowed return type is unsigned. Even if the bps/iops limit is not set to max, the return value of the function may still exceed INT_MAX or LLONG_MAX, which can cause overflow in outer variables. In such cases, we can add additional checks accordingly. And in throtl_trim_slice(), if the BPS/IOPS limit is set to max, there's no need to call calculate_bytes/io_allowed(). Introduces the helper functions throtl_trim_bps/iops to simplifies the process. For cases when the calculated trim value exceeds INT_MAX (causing an overflow), we reset tg->[bytes/io]_disp to zero, so return original tg->[bytes/io]_disp because it is the size that is actually trimmed. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250417132054.2866409-4-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05blk-throttle: Delete unnecessary carryover-related fields from throtl_grpZizhi Wo
We no longer need carryover_[bytes/ios] in tg, so it is removed. The related comments about carryover in tg are also merged into [bytes/io]_disp, and modify other related comments. Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250417132054.2866409-3-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05blk-throttle: Fix wrong tg->[bytes/io]_disp update in __tg_update_carryover()Zizhi Wo
In commit 6cc477c36875 ("blk-throttle: carry over directly"), the carryover bytes/ios was be carried to [bytes/io]_disp. However, its update mechanism has some issues. In __tg_update_carryover(), we calculate "bytes" and "ios" to represent the carryover, but the computation when updating [bytes/io]_disp is incorrect. And if the sq->nr_queued is empty, we may not update tg->[bytes/io]_disp to 0 in tg_update_carryover(). We should set it to 0 in non carryover case. This patch fixes the issue. Fixes: 6cc477c36875 ("blk-throttle: carry over directly") Signed-off-by: Zizhi Wo <wozizhi@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250417132054.2866409-2-wozizhi@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05selftests: ublk: kublk: fix include pathUday Shankar
Building kublk currently fails (with a "could not find linux/ublk_cmd.h" error message) if kernel headers are not installed in a system-global location (i.e. somewhere in the compiler's default include search path). This failure is unnecessary, as make kselftest installs kernel headers in the build tree - kublk's build just isn't looking for them properly. There is an include path in kublk's CFLAGS which is probably intended to find the kernel headers installed in the build tree; fix it so that it can actually find them. This introduces some macro redefinition issues between glibc-provided headers and kernel headers; fix those by eliminating one include in kublk. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250429-ublk_selftests-v2-3-e970b6d9e4f4@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05selftests: ublk: make test_generic_06 silent on successUday Shankar
Convention dictates that tests should not log anything on success. Make test_generic_06 follow this convention. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250429-ublk_selftests-v2-2-e970b6d9e4f4@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05selftests: ublk: kublk: build with -Werror iff WERROR!=0Uday Shankar
Compiler warnings can catch bugs at compile time; thus, heeding them is usually a good idea. Turn warnings into errors by default for the kublk build so that anyone making changes is forced to heed them. Compiler warnings can also sometimes produce annoying false positives, so provide a flag WERROR that the developer can use as follows to have the build and selftests run go through even if there are warnings: make WERROR=0 TARGETS=ublk kselftest Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250429-ublk_selftests-v2-1-e970b6d9e4f4@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05mm: remove NR_BOUNCE zone statChristoph Hellwig
The stat is always 0 now, so remove it and hardwire the user visible output to 0. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05block: remove bounce buffering supportChristoph Hellwig
The block layer bounce buffering support is unused now, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05scsi: remove the no_highmem flag in the hostChristoph Hellwig
All users are gone now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05usb-storage: reject probe of device one non-DMA HCDs when using highmemChristoph Hellwig
usb-storage is the last user of the block layer bounce buffering now, and only uses it for HCDs that do not support DMA on highmem configs. Remove this support and fail the probe so that the block layer bounce buffering can go away. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Alan Stern <stern@rowland.harvard.edu> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05scsi: make ppa depend on !HIGHMEMChristoph Hellwig
This is one of the last drivers depending on the block layer bounce buffering code. Restrict it to run on non-highmem configs so that the bounce buffering code can be removed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05scsi: make imm depend on !HIGHMEMChristoph Hellwig
This is one of the last drivers depending on the block layer bounce buffering code. Restrict it to run on non-highmem configs so that the bounce buffering code can be removed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05scsi: make aha152x depend on !HIGHMEMChristoph Hellwig
This is one of the last drivers depending on the block layer bounce buffering code. Restrict it to run on non-highmem configs so that the bounce buffering code can be removed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250505081138.3435992-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05Merge branch 'block-6.15' into for-6.16/blockJens Axboe
Merge 6.15 block fixes in, once again, to resolve conflicts with the fixes for ublk that went into mainline and the 6.16 ublk updates. * block-6.15: nvmet-auth: always free derived key data nvmet-tcp: don't restore null sk_state_change nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS nvme-tcp: fix premature queue removal and I/O failover nvme-pci: add quirks for WDC Blue SN550 15b7:5009 nvme-pci: add quirks for device 126f:1001 nvme-pci: fix queue unquiesce check on slot_reset ublk: remove the check of ublk_need_req_ref() from __ublk_check_and_get_req ublk: enhance check for register/unregister io buffer command ublk: decouple zero copy from user copy selftests: ublk: fix UBLK_F_NEED_GET_DATA Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02block: use writeback_iterChristoph Hellwig
Use writeback_iter instead of the deprecated write_cache_pages wrapper in blkdev_writepages. This removes an indirect call per folio. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20250424082752.1967679-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02ublk: store request pointer in ublk_ioCaleb Sander Mateos
A ublk_io is converted to a request in several places in the I/O path by using blk_mq_tag_to_rq() to look up the (qid, tag) on the ublk device's tagset. This involves a bunch of dereferences and a tag bounds check. To make this conversion cheaper, store the request pointer in ublk_io. Overlap this storage with the io_uring_cmd pointer. This is safe because the io_uring_cmd pointer is only valid if UBLK_IO_FLAG_ACTIVE is set on the ublk_io, the request pointer is valid if UBLK_IO_FLAG_OWNED_BY_SRV, and these flags are mutually exclusive. Suggested-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-10-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02ublk: check UBLK_IO_FLAG_OWNED_BY_SRV in ublk_abort_queue()Caleb Sander Mateos
ublk_abort_queue() currently checks whether the UBLK_IO_FLAG_ACTIVE flag is cleared to tell whether to abort each ublk_io in the queue. But it's possible for a ublk_io to not be ACTIVE but also not have a request in flight, such as when no fetch request has yet been submitted for a tag or when a fetch request is cancelled. So ublk_abort_queue() must additionally check for an inflight request. Simplify this code by checking for UBLK_IO_FLAG_OWNED_BY_SRV instead, which indicates precisely whether a request is currently inflight. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-9-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02ublk: don't call ublk_dispatch_req() for NEED_GET_DATACaleb Sander Mateos
ublk_dispatch_req() currently handles 3 different cases: incoming ublk requests that don't need to wait for a data buffer, incoming requests that do need to wait for a buffer, and resuming those requests once the buffer is provided. But the call site that provides a data buffer (UBLK_IO_NEED_GET_DATA) is separate from those for incoming requests. So simplify the function by splitting the UBLK_IO_NEED_GET_DATA case into its own function ublk_get_data(). This avoids several redundant checks in the UBLK_IO_NEED_GET_DATA case, and streamlines the incoming request cases. Don't call ublk_fill_io_cmd() for UBLK_IO_NEED_GET_DATA, as it's no longer necessary to set io->cmd or the UBLK_IO_FLAG_ACTIVE flag for ublk_dispatch_req(). Since UBLK_IO_NEED_GET_DATA no longer relies on ublk_dispatch_req() calling io_uring_cmd_done(), return the UBLK_IO_RES_OK status directly from the ->uring_cmd() handler. If ublk_start_io() fails, don't complete the UBLK_IO_NEED_GET_DATA command, matching the existing behavior. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-8-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-02ublk: factor out ublk_start_io() helperCaleb Sander Mateos
In preparation for calling it from outside ublk_dispatch_req(), factor out the code responsible for setting up an incoming ublk I/O request. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250430225234.2676781-7-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>