summaryrefslogtreecommitdiff
path: root/block/blk-mq.c
AgeCommit message (Collapse)Author
2023-06-26Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Various cleanups all around (Irvin, Chaitanya, Christophe) - Better struct packing (Christophe JAILLET) - Reduce controller error logs for optional commands (Keith) - Support for >=64KiB block sizes (Daniel Gomez) - Fabrics fixes and code organization (Max, Chaitanya, Daniel Wagner) - bcache updates via Coly: - Fix a race at init time (Mingzhe Zou) - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye) - use page pinning in the block layer for dio (David) - convert old block dio code to page pinning (David, Christoph) - cleanups for pktcdvd (Andy) - cleanups for rnbd (Guoqing) - use the unchecked __bio_add_page() for the initial single page additions (Johannes) - fix overflows in the Amiga partition handling code (Michael) - improve mq-deadline zoned device support (Bart) - keep passthrough requests out of the IO schedulers (Christoph, Ming) - improve support for flush requests, making them less special to deal with (Christoph) - add bdev holder ops and shutdown methods (Christoph) - fix the name_to_dev_t() situation and use cases (Christoph) - decouple the block open flags from fmode_t (Christoph) - ublk updates and cleanups, including adding user copy support (Ming) - BFQ sanity checking (Bart) - convert brd from radix to xarray (Pankaj) - constify various structures (Thomas, Ivan) - more fine grained persistent reservation ioctl capability checks (Jingbo) - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan, Jordy, Li, Min, Yu, Zhong, Waiman) * tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits) scsi/sg: don't grab scsi host module reference ext4: Fix warning in blkdev_put() block: don't return -EINVAL for not found names in devt_from_devname cdrom: Fix spectre-v1 gadget block: Improve kernel-doc headers blk-mq: don't insert passthrough request into sw queue bsg: make bsg_class a static const structure ublk: make ublk_chr_class a static const structure aoe: make aoe_class a static const structure block/rnbd: make all 'class' structures const block: fix the exclusive open mask in disk_scan_partitions block: add overflow checks for Amiga partition support block: change all __u32 annotations to __be32 in affs_hardblocks.h block: fix signed int overflow in Amiga partition support block: add capacity validation in bdev_add_partition() block: fine-granular CAP_SYS_ADMIN for Persistent Reservation block: disallow Persistent Reservation on partitions reiserfs: fix blkdev_put() warning from release_journal_dev() block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions() block: document the holder argument to blkdev_get_by_path ...
2023-06-25blk-mq: fix two misuses on RQF_USE_SCHEDMing Lei
Request allocated from sched tags can't be issued via ->queue_rqs() directly, since driver tag isn't allocated yet. This is the 1st misuse of RQF_USE_SCHED for figuring out plug->has_elevator. Request allocated from sched tags can't be ended by blk_mq_end_request_batch() too, fix the 2nd RQF_USE_SCHED misuse in blk_mq_add_to_batch(). Without this patch, NVMe uring cmd passthrough IO workload can run into hang easily with real io scheduler. Fixes: dd6216bb16e8 ("blk-mq: make sure elevator callbacks aren't called for passthrough request") Reported-by: Guangwu Zhang <guazhang@redhat.com> Closes: https://lore.kernel.org/linux-block/CAGS2=YrBjpLPOKa-gzcKuuOG60AGth5794PNCDwatdnnscB9ug@mail.gmail.com/ Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230624130105.1443879-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-21blk-mq: don't insert passthrough request into sw queueMing Lei
In case of real io scheduler, q->elevator is set, so blk_mq_run_hw_queue() may just check if scheduler queue has request to dispatch, see __blk_mq_sched_dispatch_requests(). Then IO hang may be caused because all passthorugh requests may stay in sw queue. And any passthrough request should have been inserted to hctx->dispatch always. Reported-by: Guangwu Zhang <guazhang@redhat.com> Fixes: d97217e7f024 ("blk-mq: don't queue plugged passthrough requests into scheduler") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230621132208.1142318-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-16blk-mq: fix NULL dereference on q->elevator in blk_mq_elv_switch_noneMing Lei
After grabbing q->sysfs_lock, q->elevator may become NULL because of elevator switch. Fix the NULL dereference on q->elevator by checking it with lock. Reported-by: Guangwu Zhang <guazhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230616132354.415109-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-14blk-mq: check on cpu id when there is only one ctx mappingEd Tsai
commit f168420c62e7 ("blk-mq: don't redirect completion for hctx withs only one ctx mapping") When nvme applies a 1:1 mapping of hctx and ctx, there will be no remote request. But for ufs, the submission and completion queues could be asymmetric. (e.g. Multiple SQs share one CQ) Therefore, 1:1 mapping of hctx and ctx won't complete request on the submission cpu. In this situation, this nr_ctx check could violate the QUEUE_FLAG_SAME_FORCE, as a result, check on cpu id when there is only one ctx mapping. Signed-off-by: Ed Tsai <ed.tsai@mediatek.com> Signed-off-by: Po-Wen Kao <powen.kao@mediatek.com> Suggested-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230614002529.6636-1-ed.tsai@mediatek.com [axboe: fixed up indentation] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-03blk-mq: fix blk_mq_hw_ctx active request accountingTian Lan
The nr_active counter continues to increase over time which causes the blk_mq_get_tag to hang until the thread is rescheduled to a different core despite there are still tags available. kernel-stack INFO: task inboundIOReacto:3014879 blocked for more than 2 seconds Not tainted 6.1.15-amd64 #1 Debian 6.1.15~debian11 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:inboundIOReacto state:D stack:0 pid:3014879 ppid:4557 flags:0x00000000 Call Trace: <TASK> __schedule+0x351/0xa20 scheduler+0x5d/0xe0 io_schedule+0x42/0x70 blk_mq_get_tag+0x11a/0x2a0 ? dequeue_task_stop+0x70/0x70 __blk_mq_alloc_requests+0x191/0x2e0 kprobe output showing RQF_MQ_INFLIGHT bit is not cleared before __blk_mq_free_request being called. 320 320 kworker/29:1H __blk_mq_free_request rq_flags 0x220c0 in-flight 1 b'__blk_mq_free_request+0x1 [kernel]' b'bt_iter+0x50 [kernel]' b'blk_mq_queue_tag_busy_iter+0x318 [kernel]' b'blk_mq_timeout_work+0x7c [kernel]' b'process_one_work+0x1c4 [kernel]' b'worker_thread+0x4d [kernel]' b'kthread+0xe6 [kernel]' b'ret_from_fork+0x1f [kernel]' Signed-off-by: Tian Lan <tian.lan@twosigma.com> Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter") Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230513221227.497327-1-tilan7663@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-24block: introduce block_io_start/block_io_done tracepointsHengqi Chen
Currently, several BCC ([0]) tools (biosnoop/biostacks/biotop) use kprobes to blk_account_io_start/blk_account_io_done to implement their functionalities. This is fragile because the target kernel functions may be renamed ([1]) or inlined ([2]). So introduce two new tracepoints for such use cases. [0]: https://github.com/iovisor/bcc [1]: https://github.com/iovisor/bcc/issues/3954 [2]: https://github.com/iovisor/bcc/issues/4261 Tested-by: Francis Laniel <flaniel@linux.microsoft.com> Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com> Tested-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230520084057.1467003-1-hengqi.chen@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19blk-mq: don't use the requeue list to queue flush commandsChristoph Hellwig
Currently both requeues of commands that were already sent to the driver and flush commands submitted from the flush state machine share the same requeue_list struct request_queue, despite requeues doing head insertions and flushes not. Switch to using two separate lists instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230519044050.107790-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19blk-mq: use the I/O scheduler for writes from the flush state machineBart Van Assche
Send write requests issued by the flush state machine through the normal I/O submission path including the I/O scheduler (if present) so that I/O scheduler policies are applied to writes with the FUA flag set. Separate the I/O scheduler members from the flush members in struct request since now a request may pass through both an I/O scheduler and the flush machinery. Note that the actual flush requests, which have no bio attached to the request still bypass the I/O schedulers. Signed-off-by: Bart Van Assche <bvanassche@acm.org> [hch: rebased] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230519044050.107790-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19blk-mq: defer to the normal submission path for non-flush flush commandsChristoph Hellwig
If blk_insert_flush decides that a command does not need to use the flush state machine, return false and let blk_mq_submit_bio handle it the normal way (including using an I/O scheduler) instead of doing a bypass insert. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230519044050.107790-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18blk-mq: make sure elevator callbacks aren't called for passthrough requestChristoph Hellwig
In case of q->elevator, passthrough request can still be marked as RQF_ELV, so some elevator callbacks will be called for them. Fix this by splitting RQF_SCHED_TAGS, which is set for all requests that are issued on a queue that uses an I/O scheduler, and RQF_USE_SCHED for non-flush, non-passthrough requests on such a queue. Roughly based on two different patches from Ming Lei <ming.lei@redhat.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230518053101.760632-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18blk-mq: remove RQF_ELVPRIVChristoph Hellwig
RQF_ELVPRIV is set for all non-flush requests that have RQF_ELV set. Expand this condition in the two users of the flag and remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230518053101.760632-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18blk-mq: don't queue plugged passthrough requests into schedulerMing Lei
Passthrough requests should never be queued to the I/O scheduler, as scheduling these opaque requests doesn't make sense, and I/O schedulers might require req->bio to be always valid. We never let passthrough requests insert into the scheduler before commit 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging"), restore this behavior even for passthrough requests issued under a plug. [hch: use blk_mq_insert_requests for passthrough requests, fix up the commit message and comments] Reported-by: Guangwu Zhang <guazhang@redhat.com> Closes: https://lore.kernel.org/linux-block/CAGS2=YosaYaUTEMU3uaf+y=8MqSrhL7sYsJn8EwbaM=76p_4Qg@mail.gmail.com/ Investigated-by: Yu Kuai <yukuai1@huaweicloud.com> Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230518053101.760632-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-26Merge tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - drbd patches, bringing us closer to unifying the out-of-tree version and the in tree one (Andreas, Christoph) - support for auto-quiesce for the s390 dasd driver (Stefan) - MD pull request via Song: - md/bitmap: Optimal last page size (Jon Derrick) - Various raid10 fixes (Yu Kuai, Li Nan) - md: add error_handlers for raid0 and linear (Mariusz Tkaczyk) - NVMe pull request via Christoph: - Drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas) - Validate nvmet module parameters (Chaitanya Kulkarni) - Fence TCP socket on receive error (Chris Leech) - Fix async event trace event (Keith Busch) - Minor cleanups (Chaitanya Kulkarni, zhenwei pi) - Fix and cleanup nvmet Identify handling (Damien Le Moal, Christoph Hellwig) - Fix double blk_mq_complete_request race in the timeout handler (Lei Yin) - Fix irq locking in nvme-fcloop (Ming Lei) - Remove queue mapping helper for rdma devices (Sagi Grimberg) - use structured request attribute checks for nbd (Jakub) - fix blk-crypto race conditions between keyslot management (Eric) - add sed-opal support for reading read locking range attributes (Ondrej) - make fault injection configurable for null_blk (Akinobu) - clean up the request insertion API (Christoph) - clean up the queue running API (Christoph) - blkg config helper cleanups (Tejun) - lazy init support for blk-iolatency (Tejun) - various fixes and tweaks to ublk (Ming) - remove hybrid polling. It hasn't really been useful since we got async polled IO support, and these days we don't support sync polled IO at all (Keith) - misc fixes, cleanups, improvements (Zhong, Ondrej, Colin, Chengming, Chaitanya, me) * tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux: (118 commits) nbd: fix incomplete validation of ioctl arg ublk: don't return 0 in case of any failure sed-opal: geometry feature reporting command null_blk: Always check queue mode setting from configfs block: ublk: switch to ioctl command encoding blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush block, bfq: Fix division by zero error on zero wsum fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m block: store bdev->bd_disk->fops->submit_bio state in bdev block: re-arrange the struct block_device fields for better layout md/raid5: remove unused working_disks variable md/raid10: don't call bio_start_io_acct twice for bio which experienced read error md/raid10: fix memleak of md thread md/raid10: fix memleak for 'conf->bio_split' md/raid10: fix leak of 'r10bio->remaining' for recovery md/raid10: don't BUG_ON() in raise_barrier() md: fix soft lockup in status_resync md: add error_handlers for raid0 and linear md: Use optimal I/O size for last bitmap page md: Fix types in sb writer ...
2023-04-20Revert "block: Merge bio before checking ->cached_rq"Ming Lei
This reverts commit 23f3e3272e7a4d9fb870485cd6df1e4f9539282c. blk-mq sched bio merge still needs request to grab queue usage counter, so we can't simply call blk_mq_attempt_bio_merge() when queue usage counter isn't held. Fixes: 23f3e3272e7a ("block: Merge bio before checking ->cached_rq") Cc: Xiao Ni <xni@redhat.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230420112018.1108058-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: remove __blk_mq_run_hw_queueChristoph Hellwig
__blk_mq_run_hw_queue just contains a WARN_ON_ONCE for calls from interrupt context and a blk_mq_run_dispatch_ops-protected call to blk_mq_sched_dispatch_requests. Open code the call to blk_mq_sched_dispatch_requests in both callers, and move the WARN_ON_ONCE to blk_mq_run_hw_queue where it can be extended to all !async calls, while the other call is from workqueue context and thus obviously does not need the assert. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413060651.694656-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: move the !async handling out of __blk_mq_delay_run_hw_queueChristoph Hellwig
Only blk_mq_run_hw_queue can call __blk_mq_delay_run_hw_queue with async=false, so move the handling there. With this __blk_mq_delay_run_hw_queue can be merged into blk_mq_delay_run_hw_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413060651.694656-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: move the blk_mq_hctx_stopped check in __blk_mq_delay_run_hw_queueChristoph Hellwig
For the in-context dispatch, blk_mq_hctx_stopped is alredy checked in blk_mq_sched_dispatch_requests under blk_mq_run_dispatch_ops() protection. For the async dispatch case having a check before scheduling the work still makes sense to avoid needless workqueue scheduling, so just keep it for that case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413060651.694656-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: remove the blk_mq_hctx_stopped check in blk_mq_run_work_fnChristoph Hellwig
blk_mq_hctx_stopped is already checked in blk_mq_sched_dispatch_requests under blk_mq_run_dispatch_ops() protection, so remove the duplicate check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413060651.694656-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: pass a flags argument to blk_mq_add_to_requeue_listChristoph Hellwig
Replace the boolean at_head argument with the same flags that are already passed to blk_mq_insert_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-21-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: pass a flags argument to elevator_type->insert_requestsChristoph Hellwig
Instead of passing a bool at_head, pass down the full flags from the blk_mq_insert_request interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: pass a flags argument to blk_mq_request_bypass_insertChristoph Hellwig
Replace the boolean at_head argument with the same flags that are already passed to blk_mq_insert_request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: pass a flags argument to blk_mq_insert_requestChristoph Hellwig
Replace the at_head bool with a flags argument that so far only contains a single BLK_MQ_INSERT_AT_HEAD value. This makes it much easier to grep for head insertions into the blk-mq dispatch queues. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: don't kick the requeue_list in blk_mq_add_to_requeue_listChristoph Hellwig
blk_mq_add_to_requeue_list takes a bool parameter to control how to kick the requeue list at the end of the function. Move the call to blk_mq_kick_requeue_list to the callers that want it instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: don't run the hw_queue from blk_mq_request_bypass_insertChristoph Hellwig
blk_mq_request_bypass_insert takes a bool parameter to control how to run the queue at the end of the function. Move the blk_mq_run_hw_queue call to the callers that want it instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: don't run the hw_queue from blk_mq_insert_requestChristoph Hellwig
blk_mq_insert_request takes two bool parameters to control how to run the queue at the end of the function. Move the blk_mq_run_hw_queue call to the callers that want it instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: fold __blk_mq_try_issue_directly into its two callersChristoph Hellwig
Due to the wildly different behavior based on the bypass_insert argument, not a whole lot of code in __blk_mq_try_issue_directly is actually shared between blk_mq_try_issue_directly and blk_mq_request_issue_directly. Remove __blk_mq_try_issue_directly and fold the code into the two callers instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: factor out a blk_mq_get_budget_and_tag helperChristoph Hellwig
Factor out a helper from __blk_mq_try_issue_directly in preparation of folding that function into its two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: refactor the DONTPREP/SOFTBARRIER andling in blk_mq_requeue_workChristoph Hellwig
Split the RQF_DONTPREP and RQF_SOFTBARRIER in separate branches to make the code more readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-12-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: refactor passthrough vs flush handling in blk_mq_insert_requestChristoph Hellwig
While both passthrough and flush requests call directly into blk_mq_request_bypass_insert, the parameters aren't the same. Split the handling into two separate conditionals and turn the whole function into an if/elif/elif/else flow instead of the gotos. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: fold __blk_mq_insert_req_list into blk_mq_insert_requestChristoph Hellwig
Remove this very small helper and fold it into the only caller. Note that this moves the trace_block_rq_insert out of ctx->lock, matching the other calls to this tracepoint. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: fold __blk_mq_insert_request into blk_mq_insert_requestChristoph Hellwig
There is no good point in keeping the __blk_mq_insert_request around for two function calls and a singler caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: move blk_mq_sched_insert_request to blk-mq.cChristoph Hellwig
blk_mq_sched_insert_request is the main request insert helper and not directly I/O scheduler related. Move blk_mq_sched_insert_request to blk-mq.c, rename it to blk_mq_insert_request and mark it static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: fold blk_mq_sched_insert_requests into blk_mq_dispatch_plug_listChristoph Hellwig
blk_mq_dispatch_plug_list is the only caller of blk_mq_sched_insert_requests, and it makes sense to just fold it there as blk_mq_sched_insert_requests isn't specific to I/O schedulers despite the name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: move more logic into blk_mq_insert_requestsChristoph Hellwig
Move all logic related to the direct insert (including the call to blk_mq_run_hw_queue) into blk_mq_insert_requests to streamline the code flow up a bit, and to allow marking blk_mq_try_issue_list_directly static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: include <linux/blk-mq.h> in block/blk-mq.hChristoph Hellwig
block/blk-mq.h needs various definitions from <linux/blk-mq.h>, include it there instead of relying on the source files to include both. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: remove blk-mq-tag.hChristoph Hellwig
blk-mq-tag.h is always included by blk-mq.h, and causes recursive inclusion hell with further changes. Just merge it into blk-mq.h instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-13blk-mq: don't plug for head insertions in blk_execute_rq_nowaitChristoph Hellwig
Plugs never insert at head, so don't plug for head insertions. Fixes: 1c2d2fff6dc0 ("block: wire-up support for passthrough plugging") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230413064057.707578-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-04blk-mq: directly poll requestsKeith Busch
Polling needs a bio with a valid bi_bdev, but neither of those are guaranteed for polled driver requests. Make request based polling directly use blk-mq's polling function instead. When executing a request from a polled hctx, we know the request's cookie, and that it's from a live blk-mq queue that supports polling, so we can safely skip everything that bio_poll provides. Cc: stable@kernel.org Reported-by: Martin Belanger <Martin.Belanger@dell.com> Reported-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Tested-by: Daniel Wagner <dwagner@suse.de> Revieded-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Link: https://lore.kernel.org/r/20230331180056.1155862-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-27block: open code __blk_account_io_done()Chaitanya Kulkarni
There is only one caller for __blk_account_io_done(), the function is small enough to fit in its caller blk_account_io_done(). Remove the function and opencode in the its caller blk_account_io_done(). Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230327073427.4403-2-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-27block: open code __blk_account_io_start()Chaitanya Kulkarni
There is only one caller for __blk_account_io_start(), the function is small enough to fit in its caller blk_account_io_start(). Remove the function and opencode in the its caller blk_account_io_start(). Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230327073427.4403-2-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-20blk-mq: remove hybrid pollingKeith Busch
io_uring provides the only way user space can poll completions, and that always sets BLK_POLL_NOSLEEP. This effectively makes hybrid polling dead code, so remove it and everything supporting it. Hybrid polling was effectively killed off with 9650b453a3d4b1, "block: ignore RWF_HIPRI hint for sync dio", but still potentially reachable through io_uring until d729cf9acb93119, "io_uring: don't sleep when polling for I/O", but hybrid polling probably should not have been reachable through that async interface from the beginning. Fixes: 9650b453a3d4 ("block: ignore RWF_HIPRI hint for sync dio") Fixes: d729cf9acb93 ("io_uring: don't sleep when polling for I/O") Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20230320194926.3353144-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-16blk-mq: return actual keyslot error in blk_insert_cloned_request()Eric Biggers
To avoid hiding information, pass on the error code from blk_crypto_rq_get_keyslot() instead of always using BLK_STS_IOERR. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-16blk-crypto: remove blk_crypto_insert_cloned_request()Eric Biggers
blk_crypto_insert_cloned_request() is the same as blk_crypto_rq_get_keyslot(), so just use that directly. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-16blk-mq: release crypto keyslot before reporting I/O completeEric Biggers
Once all I/O using a blk_crypto_key has completed, filesystems can call blk_crypto_evict_key(). However, the block layer currently doesn't call blk_crypto_put_keyslot() until the request is being freed, which happens after upper layers have been told (via bio_endio()) the I/O has completed. This causes a race condition where blk_crypto_evict_key() can see 'slot_refs != 0' without there being an actual bug. This makes __blk_crypto_evict_key() hit the 'WARN_ON_ONCE(atomic_read(&slot->slot_refs) != 0)' and return without doing anything, eventually causing a use-after-free in blk_crypto_reprogram_all_keys(). (This is a very rare bug and has only been seen when per-file keys are being used with fscrypt.) There are two options to fix this: either release the keyslot before bio_endio() is called on the request's last bio, or make __blk_crypto_evict_key() ignore slot_refs. Let's go with the first solution, since it preserves the ability to report bugs (via WARN_ON_ONCE) where a key is evicted while still in-use. Fixes: a892c8d52c02 ("block: Inline encryption support for blk-mq") Cc: stable@vger.kernel.org Reviewed-by: Nathan Huckleberry <nhuck@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230315183907.53675-2-ebiggers@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-14block: do not reverse request order when flushing plug listJan Kara
Commit 26fed4ac4eab ("block: flush plug based on hardware and software queue order") changed flushing of plug list to submit requests one device at a time. However while doing that it also started using list_add_tail() instead of list_add() used previously thus effectively submitting requests in reverse order. Also when forming a rq_list with remaining requests (in case two or more devices are used), we effectively reverse the ordering of the plug list for each device we process. Submitting requests in reverse order has negative impact on performance for rotational disks (when BFQ is not in use). We observe 10-25% regression in random 4k write throughput, as well as ~20% regression in MariaDB OLTP benchmark on rotational storage on btrfs filesystem. Fix the problem by preserving ordering of the plug list when inserting requests into the queuelist as well as by appending to requeue_list instead of prepending to it. Fixes: 26fed4ac4eab ("block: flush plug based on hardware and software queue order") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230313093002.11756-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-03Merge tag 'block-6.3-2023-03-03' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: - NVMe pull request via Christoph: - Don't access released socket during error recovery (Akinobu Mita) - Bring back auto-removal of deleted namespaces during sequential scan (Christoph Hellwig) - Fix an error code in nvme_auth_process_dhchap_challenge (Dan Carpenter) - Show well known discovery name (Daniel Wagner) - Add a missing endianess conversion in effects masking (Keith Busch) - Fix for a regression introduced in blk-rq-qos during init in this merge window (Breno) - Reorder a few fields in struct blk_mq_tag_set, eliminating a few holes and shrinking it (Christophe) - Remove redundant bdev_get_queue() NULL checks (Juhyung) - Add sed-opal single user mode support flag (Luca) - Remove SQE128 check in ublk as it isn't needed, saving some memory (Ming) - Op specific segment checking for cloned requests (Uday) - Exclusive open partition scan fixes (Yu) - Loop offset/size checking before assigning them in the device (Zhong) - Bio polling fixes (me) * tag 'block-6.3-2023-03-03' of git://git.kernel.dk/linux: blk-mq: enforce op-specific segment limits in blk_insert_cloned_request nvme-fabrics: show well known discovery name nvme-tcp: don't access released socket during error recovery nvme-auth: fix an error code in nvme_auth_process_dhchap_challenge() nvme: bring back auto-removal of deleted namespaces during sequential scan blk-iocost: Pass gendisk to ioc_refresh_params nvme: fix sparse warning on effects masking block: be a bit more careful in checking for NULL bdev while polling block: clear bio->bi_bdev when putting a bio back in the cache loop: loop_set_status_from_info() check before assignment ublk: remove check IO_URING_F_SQE128 in ublk_ch_uring_cmd block: remove more NULL checks after bdev_get_queue() blk-mq: Reorder fields in 'struct blk_mq_tag_set' block: fix scan partition for exclusively open device again block: Revert "block: Do not reread partition table on exclusively open device" sed-opal: add support flag for SUM in status ioctl
2023-03-02blk-mq: enforce op-specific segment limits in blk_insert_cloned_requestUday Shankar
The block layer might merge together discard requests up until the max_discard_segments limit is hit, but blk_insert_cloned_request checks the segment count against max_segments regardless of the req op. This can result in errors like the following when discards are issued through a DM device and max_discard_segments exceeds max_segments for the queue of the chosen underlying device. blk_insert_cloned_request: over max segments limit. (256 > 129) Fix this by looking at the req_op and enforcing the appropriate segment limit - max_discard_segments for REQ_OP_DISCARDs and max_segments for everything else. Signed-off-by: Uday Shankar <ushankar@purestorage.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230301000655.48112-1-ushankar@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-20Merge tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe updates via Christoph: - Small improvements to the logging functionality (Amit Engel) - Authentication cleanups (Hannes Reinecke) - Cleanup and optimize the DMA mapping cod in the PCIe driver (Keith Busch) - Work around the command effects for Format NVM (Keith Busch) - Misc cleanups (Keith Busch, Christoph Hellwig) - Fix and cleanup freeing single sgl (Keith Busch) - MD updates via Song: - Fix a rare crash during the takeover process - Don't update recovery_cp when curr_resync is ACTIVE - Free writes_pending in md_stop - Change active_io to percpu - Updates to drbd, inching us closer to unifying the out-of-tree driver with the in-tree one (Andreas, Christoph, Lars, Robert) - BFQ update adding support for multi-actuator drives (Paolo, Federico, Davide) - Make brd compliant with REQ_NOWAIT (me) - Fix for IOPOLL and queue entering, fixing stalled IO waiting on timeouts (me) - Fix for REQ_NOWAIT with multiple bios (me) - Fix memory leak in blktrace cleanup (Greg) - Clean up sbitmap and fix a potential hang (Kemeng) - Clean up some bits in BFQ, and fix a bug in the request injection (Kemeng) - Clean up the request allocation and issue code, and fix some bugs related to that (Kemeng) - ublk updates and fixes: - Add support for unprivileged ublk (Ming) - Improve device deletion handling (Ming) - Misc (Liu, Ziyang) - s390 dasd fixes (Alexander, Qiheng) - Improve utility of request caching and fixes (Anuj, Xiao) - zoned cleanups (Pankaj) - More constification for kobjs (Thomas) - blk-iocost cleanups (Yu) - Remove bio splitting from drivers that don't need it (Christoph) - Switch blk-cgroups to use struct gendisk. Some of this is now incomplete as select late reverts were done. (Christoph) - Add bvec initialization helpers, and convert callers to use that rather than open-coding it (Christoph) - Misc fixes and cleanups (Jinke, Keith, Arnd, Bart, Li, Martin, Matthew, Ulf, Zhong) * tag 'for-6.3/block-2023-02-16' of git://git.kernel.dk/linux: (169 commits) brd: use radix_tree_maybe_preload instead of radix_tree_preload block: use proper return value from bio_failfast() block: bio-integrity: Copy flags when bio_integrity_payload is cloned block: Fix io statistics for cgroup in throttle path brd: mark as nowait compatible brd: check for REQ_NOWAIT and set correct page allocation mask brd: return 0/-error from brd_insert_page() block: sync mixed merged request's failfast with 1st bio's Revert "blk-cgroup: pin the gendisk in struct blkcg_gq" Revert "blk-cgroup: pass a gendisk to blkg_lookup" Revert "blk-cgroup: delay blk-cgroup initialization until add_disk" Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release" Revert "blk-cgroup: move the cgroup information to struct gendisk" nvme-pci: remove iod use_sgls nvme-pci: fix freeing single sgl block: ublk: check IO buffer based on flag need_get_data s390/dasd: Fix potential memleak in dasd_eckd_init() s390/dasd: sort out physical vs virtual pointers usage block: Remove the ALLOC_CACHE_SLACK constant block: make kobj_type structures constant ...
2023-02-09block: Merge bio before checking ->cached_rqXiao Ni
It checks if plug->cached_rq is empty before merging bio. But the merge action doesn't have relationship with plug->cached_rq, it trys to merge bio with requests within plug->mq_list. Now it checks if ->cached_rq is empty before merging bio. If it's empty, it will miss the merge chances. So move the merge function before checking ->cached_rq. Signed-off-by: Xiao Ni <xni@redhat.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230209031930.27354-1-xni@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>