summaryrefslogtreecommitdiff
path: root/include/linux/blk-mq.h
AgeCommit message (Collapse)Author
2024-01-17blk-mq: Remove the hctx 'run' debugfs attributeBart Van Assche
Nobody uses the debugfs hctx 'run' attribute. Hence remove this attribute and also the code that updates the corresponding member variable. Suggested-by: Jens Axboe <axboe@kernel.dk> Cc: Gabriel Ryan <gabe@cs.columbia.edu> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240117203609.4122520-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-22block: skip start/end time stamping for passthrough IOKundan Kumar
commit 41fa722239b4 ("blk-mq: do not include passthrough requests in I/O accounting")' disables I/O accounting for passthrough requests. Since tools like 'iostat' do not show anything useful for passthrough I/O, it's wasteful to do start/end time-stamping. So do away with that. Avoiding the time-stamping improves the I/O performance by ~7% Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20231222101707.6921-1-kundan.kumar@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-22blk-mq: remove RQF_MQ_INFLIGHTChengming Zhou
Since the previous patch change to only account active requests when we really allocate the driver tag, the RQF_MQ_INFLIGHT can be removed and no double account problem. 1. none elevator: flush request will use the first pending request's driver tag, won't double account. 2. other elevator: flush request will be accounted when allocate driver tag when issue, and will be unaccounted when it put the driver tag. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230913151616.3164338-3-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-29Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: "Pretty quiet round for this release. This contains: - Add support for zoned storage to ublk (Andreas, Ming) - Series improving performance for drivers that mark themselves as needing a blocking context for issue (Bart) - Cleanup the flush logic (Chengming) - sed opal keyring support (Greg) - Fixes and improvements to the integrity support (Jinyoung) - Add some exports for bcachefs that we can hopefully delete again in the future (Kent) - deadline throttling fix (Zhiguo) - Series allowing building the kernel without buffer_head support (Christoph) - Sanitize the bio page adding flow (Christoph) - Write back cache fixes (Christoph) - MD updates via Song: - Fix perf regression for raid0 large sequential writes (Jan) - Fix split bio iostat for raid0 (David) - Various raid1 fixes (Heinz, Xueshi) - raid6test build fixes (WANG) - Deprecate bitmap file support (Christoph) - Fix deadlock with md sync thread (Yu) - Refactor md io accounting (Yu) - Various non-urgent fixes (Li, Yu, Jack) - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li, Ming, Nitesh, Ruan, Tejun, Thomas, Xu)" * tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits) block: use strscpy() to instead of strncpy() block: sed-opal: keyring support for SED keys block: sed-opal: Implement IOC_OPAL_REVERT_LSP block: sed-opal: Implement IOC_OPAL_DISCOVERY blk-mq: prealloc tags when increase tagset nr_hw_queues blk-mq: delete redundant tagset map update when fallback blk-mq: fix tags leak when shrink nr_hw_queues ublk: zoned: support REQ_OP_ZONE_RESET_ALL md: raid0: account for split bio in iostat accounting md/raid0: Fix performance regression for large sequential writes md/raid0: Factor out helper for mapping and submitting a bio md raid1: allow writebehind to work on any leg device set WriteMostly md/raid1: hold the barrier until handle_read_error() finishes md/raid1: free the r1bio before waiting for blocked rdev md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io() blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init drivers/rnbd: restore sysfs interface to rnbd-client md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid() raid6: test: only check for Altivec if building on powerpc hosts raid6: test: make sure all intermediate and artifact files are .gitignored ...
2023-07-20blk-mq: delete dead struct blk_mq_hw_ctx->queued fieldChengming Zhou
This counter is not used anywhere, so delete it. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230720095512.1403123-1-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17blk-flush: reuse rq queuelist in flush state machineChengming Zhou
Since we don't need to maintain inflight flush_data requests list anymore, we can reuse rq->queuelist for flush pending list. Note in mq_flush_data_end_io(), we need to re-initialize rq->queuelist before reusing it in the state machine when end, since the rq->rq_next also reuse it, may have corrupted rq->queuelist by the driver. This patch decrease the size of struct request by 16 bytes. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230717040058.3993930-5-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-17blk-mq: use percpu csd to remote complete instead of per-rq csdChengming Zhou
If request need to be completed remotely, we insert it into percpu llist, and smp_call_function_single_async() if llist is empty previously. We don't need to use per-rq csd, percpu csd is enough. And the size of struct request is decreased by 24 bytes. This way is cleaner, and looks correct, given block softirq is guaranteed to be scheduled to consume the list if one new request is added to this percpu list, either smp_call_function_single_async() returns -EBUSY or 0. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230717040058.3993930-2-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-07-10block: remove dead struc request->completion_data fieldJens Axboe
It's no longer used. While in there, also update the comment as to why it can coexist with the rb_node. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-28block: add request polling helperKeith Busch
Provide a direct request polling will for drivers. The interface does not require a bio, and can skip the overhead associated with polling those. The biggest gain from skipping the relatively expensive xarray lookup unnecessary when you already have the request. With this, the simple rq/qc conversion functions have only one caller each, so open code this and remove the helpers. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230612190343.2087040-2-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-25blk-mq: fix two misuses on RQF_USE_SCHEDMing Lei
Request allocated from sched tags can't be issued via ->queue_rqs() directly, since driver tag isn't allocated yet. This is the 1st misuse of RQF_USE_SCHED for figuring out plug->has_elevator. Request allocated from sched tags can't be ended by blk_mq_end_request_batch() too, fix the 2nd RQF_USE_SCHED misuse in blk_mq_add_to_batch(). Without this patch, NVMe uring cmd passthrough IO workload can run into hang easily with real io scheduler. Fixes: dd6216bb16e8 ("blk-mq: make sure elevator callbacks aren't called for passthrough request") Reported-by: Guangwu Zhang <guazhang@redhat.com> Closes: https://lore.kernel.org/linux-block/CAGS2=YrBjpLPOKa-gzcKuuOG60AGth5794PNCDwatdnnscB9ug@mail.gmail.com/ Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230624130105.1443879-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-06-12blk-mq: fix potential io hang by wrong 'wake_batch'Yu Kuai
In __blk_mq_tag_busy/idle(), updating 'active_queues' and calculating 'wake_batch' is not atomic: t1: t2: _blk_mq_tag_busy blk_mq_tag_busy inc active_queues // assume 1->2 inc active_queues // 2 -> 3 blk_mq_update_wake_batch // calculate based on 3 blk_mq_update_wake_batch /* calculate based on 2, while active_queues is actually 3. */ Fix this problem by protecting them wih 'tags->lock', this is not a hot path, so performance should not be concerned. And now that all writers are inside the lock, switch 'actives_queues' from atomic to unsigned int. Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230610023043.2559121-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-23block: remove redundant req_op in blk_rq_is_passthroughLi Nan
op &= REQ_OP_MASK in blk_op_is_passthrough() is exactly what req_op() do. Therefore, it is redundant to call req_op() for blk_op_is_passthrough(). Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20230522085355.1740772-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19blk-mq: don't use the requeue list to queue flush commandsChristoph Hellwig
Currently both requeues of commands that were already sent to the driver and flush commands submitted from the flush state machine share the same requeue_list struct request_queue, despite requeues doing head insertions and flushes not. Switch to using two separate lists instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230519044050.107790-8-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-19blk-mq: use the I/O scheduler for writes from the flush state machineBart Van Assche
Send write requests issued by the flush state machine through the normal I/O submission path including the I/O scheduler (if present) so that I/O scheduler policies are applied to writes with the FUA flag set. Separate the I/O scheduler members from the flush members in struct request since now a request may pass through both an I/O scheduler and the flush machinery. Note that the actual flush requests, which have no bio attached to the request still bypass the I/O schedulers. Signed-off-by: Bart Van Assche <bvanassche@acm.org> [hch: rebased] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230519044050.107790-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18block: Introduce blk_rq_is_seq_zoned_write()Bart Van Assche
Introduce the function blk_rq_is_seq_zoned_write(). This function will be used in later patches to preserve the order of zoned writes that require write serialization. This patch includes an optimization: instead of using rq->q->disk->part0->bd_queue to check whether or not the queue is associated with a zoned block device, use rq->q->disk->queue. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20230517174230.897144-6-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18blk-mq: make sure elevator callbacks aren't called for passthrough requestChristoph Hellwig
In case of q->elevator, passthrough request can still be marked as RQF_ELV, so some elevator callbacks will be called for them. Fix this by splitting RQF_SCHED_TAGS, which is set for all requests that are issued on a queue that uses an I/O scheduler, and RQF_USE_SCHED for non-flush, non-passthrough requests on such a queue. Roughly based on two different patches from Ming Lei <ming.lei@redhat.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230518053101.760632-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-18blk-mq: remove RQF_ELVPRIVChristoph Hellwig
RQF_ELVPRIV is set for all non-flush requests that have RQF_ELV set. Expand this condition in the two users of the flag and remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20230518053101.760632-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-04-26Merge tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - drbd patches, bringing us closer to unifying the out-of-tree version and the in tree one (Andreas, Christoph) - support for auto-quiesce for the s390 dasd driver (Stefan) - MD pull request via Song: - md/bitmap: Optimal last page size (Jon Derrick) - Various raid10 fixes (Yu Kuai, Li Nan) - md: add error_handlers for raid0 and linear (Mariusz Tkaczyk) - NVMe pull request via Christoph: - Drop redundant pci_enable_pcie_error_reporting (Bjorn Helgaas) - Validate nvmet module parameters (Chaitanya Kulkarni) - Fence TCP socket on receive error (Chris Leech) - Fix async event trace event (Keith Busch) - Minor cleanups (Chaitanya Kulkarni, zhenwei pi) - Fix and cleanup nvmet Identify handling (Damien Le Moal, Christoph Hellwig) - Fix double blk_mq_complete_request race in the timeout handler (Lei Yin) - Fix irq locking in nvme-fcloop (Ming Lei) - Remove queue mapping helper for rdma devices (Sagi Grimberg) - use structured request attribute checks for nbd (Jakub) - fix blk-crypto race conditions between keyslot management (Eric) - add sed-opal support for reading read locking range attributes (Ondrej) - make fault injection configurable for null_blk (Akinobu) - clean up the request insertion API (Christoph) - clean up the queue running API (Christoph) - blkg config helper cleanups (Tejun) - lazy init support for blk-iolatency (Tejun) - various fixes and tweaks to ublk (Ming) - remove hybrid polling. It hasn't really been useful since we got async polled IO support, and these days we don't support sync polled IO at all (Keith) - misc fixes, cleanups, improvements (Zhong, Ondrej, Colin, Chengming, Chaitanya, me) * tag 'for-6.4/block-2023-04-21' of git://git.kernel.dk/linux: (118 commits) nbd: fix incomplete validation of ioctl arg ublk: don't return 0 in case of any failure sed-opal: geometry feature reporting command null_blk: Always check queue mode setting from configfs block: ublk: switch to ioctl command encoding blk-mq: fix the blk_mq_add_to_requeue_list call in blk_kick_flush block, bfq: Fix division by zero error on zero wsum fault-inject: fix build error when FAULT_INJECTION_CONFIGFS=y and CONFIGFS_FS=m block: store bdev->bd_disk->fops->submit_bio state in bdev block: re-arrange the struct block_device fields for better layout md/raid5: remove unused working_disks variable md/raid10: don't call bio_start_io_acct twice for bio which experienced read error md/raid10: fix memleak of md thread md/raid10: fix memleak for 'conf->bio_split' md/raid10: fix leak of 'r10bio->remaining' for recovery md/raid10: don't BUG_ON() in raise_barrier() md: fix soft lockup in status_resync md: add error_handlers for raid0 and linear md: Use optimal I/O size for last bitmap page md: Fix types in sb writer ...
2023-03-20blk-mq: remove hybrid pollingKeith Busch
io_uring provides the only way user space can poll completions, and that always sets BLK_POLL_NOSLEEP. This effectively makes hybrid polling dead code, so remove it and everything supporting it. Hybrid polling was effectively killed off with 9650b453a3d4b1, "block: ignore RWF_HIPRI hint for sync dio", but still potentially reachable through io_uring until d729cf9acb93119, "io_uring: don't sleep when polling for I/O", but hybrid polling probably should not have been reachable through that async interface from the beginning. Fixes: 9650b453a3d4 ("block: ignore RWF_HIPRI hint for sync dio") Fixes: d729cf9acb93 ("io_uring: don't sleep when polling for I/O") Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20230320194926.3353144-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-03-14block: do not reverse request order when flushing plug listJan Kara
Commit 26fed4ac4eab ("block: flush plug based on hardware and software queue order") changed flushing of plug list to submit requests one device at a time. However while doing that it also started using list_add_tail() instead of list_add() used previously thus effectively submitting requests in reverse order. Also when forming a rq_list with remaining requests (in case two or more devices are used), we effectively reverse the ordering of the plug list for each device we process. Submitting requests in reverse order has negative impact on performance for rotational disks (when BFQ is not in use). We observe 10-25% regression in random 4k write throughput, as well as ~20% regression in MariaDB OLTP benchmark on rotational storage on btrfs filesystem. Fix the problem by preserving ordering of the plug list when inserting requests into the queuelist as well as by appending to requeue_list instead of prepending to it. Fixes: 26fed4ac4eab ("block: flush plug based on hardware and software queue order") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230313093002.11756-1-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-02-17blk-mq: Reorder fields in 'struct blk_mq_tag_set'Christophe JAILLET
Group some variables based on their sizes to reduce hole and avoid padding. On x86_64, this shrinks the size of 'struct blk_mq_tag_set' from 304 to 296 bytes. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/6f249f9b02a3490283ef0278096556de41aa0cf0.1676626130.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-12-13Merge tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - Support some passthrough commands without CAP_SYS_ADMIN (Kanchan Joshi) - Refactor PCIe probing and reset (Christoph Hellwig) - Various fabrics authentication fixes and improvements (Sagi Grimberg) - Avoid fallback to sequential scan due to transient issues (Uday Shankar) - Implement support for the DEAC bit in Write Zeroes (Christoph Hellwig) - Allow overriding the IEEE OUI and firmware revision in configfs for nvmet (Aleksandr Miloserdov) - Force reconnect when number of queue changes in nvmet (Daniel Wagner) - Minor fixes and improvements (Uros Bizjak, Joel Granados, Sagi Grimberg, Christoph Hellwig, Christophe JAILLET) - Fix and cleanup nvme-fc req allocation (Chaitanya Kulkarni) - Use the common tagset helpers in nvme-pci driver (Christoph Hellwig) - Cleanup the nvme-pci removal path (Christoph Hellwig) - Use kstrtobool() instead of strtobool (Christophe JAILLET) - Allow unprivileged passthrough of Identify Controller (Joel Granados) - Support io stats on the mpath device (Sagi Grimberg) - Minor nvmet cleanup (Sagi Grimberg) - MD pull requests via Song: - Code cleanups (Christoph) - Various fixes - Floppy pull request from Denis: - Fix a memory leak in the init error path (Yuan) - Series fixing some batch wakeup issues with sbitmap (Gabriel) - Removal of the pktcdvd driver that was deprecated more than 5 years ago, and subsequent removal of the devnode callback in struct block_device_operations as no users are now left (Greg) - Fix for partition read on an exclusively opened bdev (Jan) - Series of elevator API cleanups (Jinlong, Christoph) - Series of fixes and cleanups for blk-iocost (Kemeng) - Series of fixes and cleanups for blk-throttle (Kemeng) - Series adding concurrent support for sync queues in BFQ (Yu) - Series bringing drbd a bit closer to the out-of-tree maintained version (Christian, Joel, Lars, Philipp) - Misc drbd fixes (Wang) - blk-wbt fixes and tweaks for enable/disable (Yu) - Fixes for mq-deadline for zoned devices (Damien) - Add support for read-only and offline zones for null_blk (Shin'ichiro) - Series fixing the delayed holder tracking, as used by DM (Yu, Christoph) - Series enabling bio alloc caching for IRQ based IO (Pavel) - Series enabling userspace peer-to-peer DMA (Logan) - BFQ waker fixes (Khazhismel) - Series fixing elevator refcount issues (Christoph, Jinlong) - Series cleaning up references around queue destruction (Christoph) - Series doing quiesce by tagset, enabling cleanups in drivers (Christoph, Chao) - Series untangling the queue kobject and queue references (Christoph) - Misc fixes and cleanups (Bart, David, Dawei, Jinlong, Kemeng, Ye, Yang, Waiman, Shin'ichiro, Randy, Pankaj, Christoph) * tag 'for-6.2/block-2022-12-08' of git://git.kernel.dk/linux: (247 commits) blktrace: Fix output non-blktrace event when blk_classic option enabled block: sed-opal: Don't include <linux/kernel.h> sed-opal: allow using IOC_OPAL_SAVE for locking too blk-cgroup: Fix typo in comment block: remove bio_set_op_attrs nvmet: don't open-code NVME_NS_ATTR_RO enumeration nvme-pci: use the tagset alloc/free helpers nvme: add the Apple shared tag workaround to nvme_alloc_io_tag_set nvme: only set reserved_tags in nvme_alloc_io_tag_set for fabrics controllers nvme: consolidate setting the tagset flags nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set block: bio_copy_data_iter nvme-pci: split out a nvme_pci_ctrl_is_dead helper nvme-pci: return early on ctrl state mismatch in nvme_reset_work nvme-pci: rename nvme_disable_io_queues nvme-pci: cleanup nvme_suspend_queue nvme-pci: remove nvme_pci_disable nvme-pci: remove nvme_disable_admin_queue nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl nvme: use nvme_wait_ready in nvme_shutdown_ctrl ...
2022-11-02blk-mq: add tagset quiesce interfaceChao Leng
Drivers that have shared tagsets may need to quiesce potentially a lot of request queues that all share a single tagset (e.g. nvme). Add an interface to quiesce all the queues on a given tagset. This interface is useful because it can speedup the quiesce by doing it in parallel. Because some queues should not need to be quiesced (e.g. the nvme connect_q) when quiescing the tagset, introduce a QUEUE_FLAG_SKIP_TAGSET_QUIESCE flag to allow this new interface to ski quiescing a particular queue. Signed-off-by: Chao Leng <lengchao@huawei.com> [hch: simplify for the per-tag_set srcu_struct] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02blk-mq: pass a tagset to blk_mq_wait_quiesce_doneChristoph Hellwig
Nothing in blk_mq_wait_quiesce_done needs the request_queue now, so just pass the tagset, and move the non-mq check into the only caller that needs it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-11-02blk-mq: move the srcu_struct used for quiescing to the tagsetChristoph Hellwig
All I/O submissions have fairly similar latencies, and a tagset-wide quiesce is a fairly common operation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Chao Leng <lengchao@huawei.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-12-hch@lst.de [axboe: fix whitespace] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-27blk-mq: don't add non-pt request with ->end_io to batchMing Lei
dm-rq implements ->end_io callback for request issued to underlying queue, and it isn't passthrough request. Commit ab3e1d3bbab9 ("block: allow end_io based requests in the completion batch handling") doesn't clear rq->bio and rq->__data_len for request with ->end_io in blk_mq_end_request_batch(), and this way is actually dangerous, but so far it is only for nvme passthrough request. dm-rq needs to clean up remained bios in case of partial completion, and req->bio is required, then use-after-free is triggered, so the underlying clone request can't be completed in blk_mq_end_request_batch. Fix panic by not adding such request into batch list, and the issue can be triggered simply by exposing nvme pci to dm-mpath simply. Fixes: ab3e1d3bbab9 ("block: allow end_io based requests in the completion batch handling") Cc: dm-devel@redhat.com Cc: Mike Snitzer <snitzer@kernel.org> Reported-by: Changhui Zhong <czhong@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20221027085709.513175-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-10-25block: Remove request.write_hintBart Van Assche
Commit c75e707fe1aa ("block: remove the per-bio/request write hint") removed all code that uses the struct request write_hint member. Hence also remove 'write_hint' itself. Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20221025191755.1711437-2-bvanassche@acm.org Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30block: add blk_rq_map_user_ioAnuj Gupta
Create a helper blk_rq_map_user_io for mapping of vectored as well as non-vectored requests. This will help in saving dupilcation of code at few places in scsi and nvme. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220930062749.152261-4-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30block: allow end_io based requests in the completion batch handlingJens Axboe
With end_io handlers now being able to potentially pass ownership of the request upon completion, we can allow requests with end_io handlers in the batch completion handling. Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Co-developed-by: Stefan Roesch <shr@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30block: change request end_io handler to pass back a return valueJens Axboe
Everything is just converted to returning RQ_END_IO_NONE, and there should be no functional changes with this patch. In preparation for allowing the end_io handler to pass ownership back to the block layer, rather than retain ownership of the request. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30Merge branch 'for-6.1/io_uring' into for-6.1/passthroughJens Axboe
* for-6.1/io_uring: (56 commits) io_uring/net: fix notif cqe reordering io_uring/net: don't update msg_name if not provided io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL io_uring/rw: defer fsnotify calls to task context io_uring/net: fix fast_iov assignment in io_setup_async_msg() io_uring/net: fix non-zc send with address io_uring/net: don't skip notifs for failed requests io_uring/rw: don't lose short results on io_setup_async_rw() io_uring/rw: fix unexpected link breakage io_uring/net: fix cleanup double free free_iov init io_uring: fix CQE reordering io_uring/net: fix UAF in io_sendrecv_fail() selftest/net: adjust io_uring sendzc notif handling io_uring: ensure local task_work marks task as running io_uring/net: zerocopy sendmsg io_uring/net: combine fail handlers io_uring/net: rename io_sendzc() io_uring/net: support non-zerocopy sendto io_uring/net: refactor io_setup_async_addr io_uring/net: don't lose partial send_zc on fail ...
2022-09-21block: export blk_rq_is_pollKanchan Joshi
This is in preparation to support iopoll for nvme passthrough. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220823161443.49436-4-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21block: Fix the enum blk_eh_timer_return documentationBart Van Assche
The documentation of the blk_eh_timer_return enumeration values does not reflect correctly how e.g. the SCSI core uses these values. Fix the documentation. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Fixes: 88b0cfad2888 ("block: document the blk_eh_timer_return values") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20220920200626.3422296-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22block: shrink rq_map_data a bitJens Axboe
We don't need full ints for several of these members. Change the page_order and nr_entries to unsigned shorts, and the true/false from_user and null_mapped to booleans. This shrinks the struct from 32 to 24 bytes on 64-bit archs. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-22block: Change the return type of blk_mq_map_queues() into voidBart Van Assche
Since blk_mq_map_queues() and the .map_queues() callbacks always return 0, change their return type into void. Most callers ignore the returned value anyway. Cc: Christoph Hellwig <hch@lst.de> Cc: Jason Wang <jasowang@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Doug Gilbert <dgilbert@interlog.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: John Garry <john.garry@huawei.com> Acked-by: Md Haris Iqbal <haris.iqbal@ionos.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Link: https://lore.kernel.org/r/20220815170043.19489-3-bvanassche@acm.org [axboe: fold in fix from Bart] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-08-18blk-mq: remove unused function blk_mq_queue_stopped()Yu Kuai
blk_mq_queue_stopped() doesn't have any caller, which was found by code coverage test, thus remove it. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20220818063555.3741222-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14block: Use the new blk_opf_t typeBart Van Assche
Use the new blk_opf_t type for arguments and variables that represent request flags or a bitwise combination of a request operation and request flags. Rename the function arguments and also a structure member that hold a request operation and flags from 'rw' into 'opf'. This patch does not change any functionality. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-7-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-14block: Change the type of req_op() and bio_op() into enum req_opBart Van Assche
Improve static type checking by changing the type of the value returned by req_op() and bio_op() from unsigned int into enum req_op. Insert 'default: break;' in switch statements on the enum req_op type to prevent that the compiler warns about these switch statements. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Tim Waugh <tim@cyberelk.net> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-5-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06block: move zone related fields to struct gendiskChristoph Hellwig
Move the zone related fields that are currently stored in struct request_queue to struct gendisk as these are part of the highlevel block layer API and are only used for non-passthrough I/O that requires the gendisk. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20220706070350.1703384-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06blk-mq: Drop 'reserved' arg of busy_tag_iter_fnJohn Garry
We no longer use the 'reserved' arg in busy_tag_iter_fn for any iter function so it may be dropped. Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> #nvme Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/1657109034-206040-6-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06blk-mq: Drop blk_mq_ops.timeout 'reserved' argJohn Garry
With new API blk_mq_is_reserved_rq() we can tell if a request is from the reserved pool, so stop passing 'reserved' arg. There is actually only a single user of that arg for all the callback implementations, which can use blk_mq_is_reserved_rq() instead. This will also allow us to stop passing the same 'reserved' around the blk-mq iter functions next. Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # For MMC Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/1657109034-206040-4-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-07-06blk-mq: Add a flag for reserved requestsJohn Garry
Add a flag for reserved requests so that drivers may know this for any special handling. Signed-off-by: John Garry <john.garry@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/1657109034-206040-3-git-send-email-john.garry@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28blk-mq: cleanup disk sysfs registrationChristoph Hellwig
Pass a gendisk to the sysfs register/unregister functions and give them descriptive names. Also move the unregistration helper next to the one doing the registration. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220628171850.1313069-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-06-28block: simplify disk shutdownChristoph Hellwig
Set the queue dying flag and call blk_mq_exit_queue from del_gendisk for all disks that do not have separately allocated queues, and thus remove the need to call blk_cleanup_queue for them. Rename blk_cleanup_disk to blk_mq_destroy_queue to make it clear that this function is intended only for separately allocated blk-mq queues. This saves an extra queue freeze for devices without a separately allocated queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20220619060552.1850436-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-28blk-mq: remove the done argument to blk_execute_rq_nowaitChristoph Hellwig
Let the caller set it together with the end_io_data instead of passing a pointless argument. Note the the target code did in fact already set it and then just overrode it again by calling blk_execute_rq_nowait. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20220524121530.943123-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-05-08blk-mq: remove the error_count from struct requestWilly Tarreau
The last two users were floppy.c and ataflop.c respectively, it was verified that no other drivers makes use of this, so let's remove it. Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Minh Yuan <yuanmingbuaa@gmail.com> Cc: Denis Efremov <efremov@linux.com>, Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-08blk-mq: manage hctx map via xarrayMing Lei
First code becomes more clean by switching to xarray from plain array. Second use-after-free on q->queue_hw_ctx can be fixed because queue_for_each_hw_ctx() may be run when updating nr_hw_queues is in-progress. With this patch, q->hctx_table is defined as xarray, and this structure will share same lifetime with request queue, so queue_for_each_hw_ctx() can use q->hctx_table to lookup hctx reliably. Reported-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220308073219.91173-7-ming.lei@redhat.com [axboe: fix blk_mq_hw_ctx forward declaration] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-16blk-mq: remove the request_queue argument to blk_insert_cloned_requestChristoph Hellwig
The request must be submitted to the queue it was allocated for, so remove the extra request_queue argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Link: https://lore.kernel.org/r/20220215100540.3892965-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-09block: fix old-style declarationYang Li
Move the 'inline' keyword to the front of 'void'. Remove a warning found by clang(make W=1 LLVM=1) ./include/linux/blk-mq.h:259:1: warning: ‘inline’ is not at beginning of declaration Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Link: https://lore.kernel.org/r/20220107005228.103927-1-yang.lee@linux.alibaba.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-01-05block: introduce rq_list_moveKeith Busch
When iterating a list, a particular request may need to be moved for special handling. Provide a helper function to achieve that so drivers don't need to reimplement rqlist manipulation. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220105170518.3181469-4-kbusch@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>