summaryrefslogtreecommitdiff
path: root/include/linux/blk-mq.h
AgeCommit message (Collapse)Author
2020-06-24blk-mq: move failure injection out of blk_mq_complete_requestChristoph Hellwig
Move the call to blk_should_fake_timeout out of blk_mq_complete_request and into the drivers, skipping call sites that are obvious error handlers, and remove the now superflous blk_mq_force_complete_rq helper. This ensures we don't keep injecting errors into completions that just terminate the Linux request after the hardware has been reset or the command has been aborted. Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-mq: drain I/O when all CPUs in a hctx are offlineMing Lei
Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup up queue mapping. Thomas mentioned the following point[1]: "That was the constraint of managed interrupts from the very beginning: The driver/subsystem has to quiesce the interrupt line and the associated queue _before_ it gets shutdown in CPU unplug and not fiddle with it until it's restarted by the core when the CPU is plugged in again." However, current blk-mq implementation doesn't quiesce hw queue before the last CPU in the hctx is shutdown. Even worse, CPUHP_BLK_MQ_DEAD is a cpuhp state handled after the CPU is down, so there isn't any chance to quiesce the hctx before shutting down the CPU. Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs where the last CPU goes away, and wait for completion of in-flight requests. This guarantees that there is no inflight I/O before shutting down the managed IRQ. Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need to wait for completion of in-flight requests from these drivers to avoid a potential dead-lock. It is safe to do this for stacking drivers as those do not use interrupts at all and their I/O completions are triggered by underlying devices I/O completion. [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/ [hch: different retry mechanism, merged two patches, minor cleanups] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29blk-mq: blk-mq: provide forced completion methodKeith Busch
Drivers may need to bypass error injection for error recovery. Rename __blk_mq_complete_request() to blk_mq_force_complete_rq() and export that function so drivers may skip potential fake timeouts after they've reclaimed lost requests. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-25block: bypass ->make_request_fn for blk-mq driversChristoph Hellwig
Call blk_mq_make_request when no ->make_request_fn is set. This is safe now that blk_alloc_queue always sets up the pointer for make_request based drivers. This avoids an indirect call in the blk-mq driver I/O fast path, which is rather expensive due to spectre mitigations. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-20blk-mq: Add blk_mq_delay_run_hw_queues() API callDouglas Anderson
We have: * blk_mq_run_hw_queue() * blk_mq_delay_run_hw_queue() * blk_mq_run_hw_queues() ...but not blk_mq_delay_run_hw_queues(), presumably because nobody needed it before now. Since we need it for a later patch in this series, add it. Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-18blk-mq: Replace zero-length array with flexible-array memberGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
2020-03-27block: add a blk_mq_init_queue_data helperChristoph Hellwig
This allows a driver to pass a queuedata member before ->init_hctx is called. null_blk currently open codes this logic, but I'd rather have it in the core to ease future maintainance. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10blk-mq: Fix a comment in include/linux/blk-mq.hBart Van Assche
The 'hctx_list' member of struct blk_mq_hw_ctx is not a list head but instead an entry in q->unused_hctx_list. Fix the comment above this struct member. Fixes: d386732bc142 ("blk-mq: fill header with kernel-doc") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: André Almeida <andrealmeid@collabora.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-13blk-mq: Delete blk_mq_has_free_tags() and blk_mq_can_queue()John Garry
These functions are not referenced, so delete them. Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-01blk-mq: Make blk_mq_run_hw_queue() return voidJohn Garry
Since commit 97889f9ac24f ("blk-mq: remove synchronize_rcu() from blk_mq_del_queue_tag_set()"), the return value of blk_mq_run_hw_queue() is never checked, so make it return void, which very marginally simplifies the code. Reviewed-by: Bob Liu <bob.liu@oracle.com> Signed-off-by: John Garry <john.garry@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25blk-mq: fill header with kernel-docAndré Almeida
Insert documentation for structs, enums and functions at header file. Format existing and new comments at struct blk_mq_ops as kernel-doc comments. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-07blk-mq: Inline status checkersPavel Begunkov
blk_mq_request_completed() and blk_mq_request_started() are short, inline it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-07block: Document all members of blk_mq_tag_set and bkl_mq_queue_mapBart Van Assche
The meaning of several member variables of these two data structures is nontrivial. Hence document all member variables using the kernel-doc syntax. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-05block: Delay default elevator initializationDamien Le Moal
When elevator_init_mq() is called from blk_mq_init_allocated_queue(), the only information known about the device is the number of hardware queues as the block device scan by the device driver is not completed yet for most drivers. The device type and elevator required features are not set yet, preventing to correctly select the default elevator most suitable for the device. This currently affects all multi-queue zoned block devices which default to the "none" elevator instead of the required "mq-deadline" elevator. These drives currently include host-managed SMR disks connected to a smartpqi HBA and null_blk block devices with zoned mode enabled. Upcoming NVMe Zoned Namespace devices will also be affected. Fix this by adding the boolean elevator_init argument to blk_mq_init_allocated_queue() to control the execution of elevator_init_mq(). Two cases exist: 1) elevator_init = false is used for calls to blk_mq_init_allocated_queue() within blk_mq_init_queue(). In this case, a call to elevator_init_mq() is added to __device_add_disk(), resulting in the delayed initialization of the queue elevator after the device driver finished probing the device information. This effectively allows elevator_init_mq() access to more information about the device. 2) elevator_init = true preserves the current behavior of initializing the elevator directly from blk_mq_init_allocated_queue(). This case is used for the special request based DM devices where the device gendisk is created before the queue initialization and device information (e.g. queue limits) is already known when the queue initialization is executed. Additionally, to make sure that the elevator initialization is never done while requests are in-flight (there should be none when the device driver calls device_add_disk()), freeze and quiesce the device request queue before calling blk_mq_init_sched() in elevator_init_mq(). Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-27block: Remove blk_mq_register_dev()Bart Van Assche
This function has no callers. Hence remove it. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-04blk-mq: add callback of .cleanup_rqMing Lei
SCSI maintains its own driver private data hooked off of each SCSI request, and the pridate data won't be freed after scsi_queue_rq() returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver (e.g. dm-rq) may need to retry these SCSI requests, before SCSI has fully dispatched them, due to a lower level SCSI driver's resource limitation identified in scsi_queue_rq(). Currently SCSI's per-request private data is leaked when the upper layer driver (dm-rq) frees and then retries these requests in response to BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE returns from scsi_queue_rq(). This usecase is so specialized that it doesn't warrant training an existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to account for freeing its driver private data -- doing so would add an extra branch for handling a special case that all other consumers of SCSI (and blk-mq) won't ever need to worry about. So the most pragmatic way forward is to delegate freeing SCSI driver private data to the upper layer driver (dm-rq). Do so by adding new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method from dm-rq. A following commit will implement the .cleanup_rq() hook in scsi_mq_ops. Cc: Ewan D. Milne <emilne@redhat.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: <stable@vger.kernel.org> Fixes: 396eaf21ee17 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-04blk-mq: remove blk_mq_complete_request_syncMing Lei
blk_mq_tagset_wait_completed_request() has been applied for waiting for completed request's fn, so not necessary to use blk_mq_complete_request_sync() any more. Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-04blk-mq: introduce blk_mq_tagset_wait_completed_request()Ming Lei
blk-mq may schedule to call queue's complete function on remote CPU via IPI, but doesn't provide any way to synchronize the request's complete fn. The current queue freeze interface can't provide the synchonization because aborted requests stay at blk-mq queues during EH. In some driver's EH(such as NVMe), hardware queue's resource may be freed & re-allocated. If the completed request's complete fn is run finally after the hardware queue's resource is released, kernel crash will be triggered. Prepare for fixing this kind of issue by introducing blk_mq_tagset_wait_completed_request(). Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-04blk-mq: introduce blk_mq_request_completed()Ming Lei
NVMe needs this function to decide if one request to be aborted has been completed in normal IO path already. So introduce it. Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-06-20block: remove the bi_phys_segments field in struct bioChristoph Hellwig
We only need the number of segments in the blk-mq submission path. Remove the field from struct bio, and return it from a variant of blk_queue_split instead of that it can passed as an argument to those functions that need the value. This also means we stop recounting segments except for cloning and partial segments. To keep the number of arguments in this how path down remove pointless struct request_queue arguments from any of the functions that had it and grew a nr_segs argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-05-04blk-mq: always free hctx after request queue is freedMing Lei
In normal queue cleanup path, hctx is released after request queue is freed, see blk_mq_release(). However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because of hw queues shrinking. This way is easy to cause use-after-free, because: one implicit rule is that it is safe to call almost all block layer APIs if the request queue is alive; and one hctx may be retrieved by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues(); finally use-after-free is triggered. Fixes this issue by always freeing hctx after releasing request queue. If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce a per-queue list to hold them, then try to resuse these hctxs if numa node is matched. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Reviewed-by: Hannes Reinecke <hare@suse.com> Tested-by: James Smart <james.smart@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-10blk-mq: introduce blk_mq_complete_request_sync()Ming Lei
In NVMe's error handler, follows the typical steps of tearing down hardware for recovering controller: 1) stop blk_mq hw queues 2) stop the real hw queues 3) cancel in-flight requests via blk_mq_tagset_busy_iter(tags, cancel_request, ...) cancel_request(): mark the request as abort blk_mq_complete_request(req); 4) destroy real hw queues However, there may be race between #3 and #4, because blk_mq_complete_request() may run q->mq_ops->complete(rq) remotelly and asynchronously, and ->complete(rq) may be run after #4. This patch introduces blk_mq_complete_request_sync() for fixing the above race. Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Bart Van Assche <bvanassche@acm.org> Cc: James Smart <james.smart@broadcom.com> Cc: linux-nvme@lists.infradead.org Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-20block: Unexport blk_mq_add_to_requeue_list()Bart Van Assche
This function is not used outside the block layer core. Hence unexport it. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-19blk-mq: remove unused 'nr_expired' from blk_mq_hw_ctxDongli Zhang
There is no usage of 'nr_expired'. The 'nr_expired' was introduced by commit 1d9bd5161ba3 ("blk-mq: replace timeout synchronization with a RCU and generation based scheme"). Its usage was removed since commit 12f5b9314545 ("blk-mq: Remove generation seqeunce"). Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-15block: kill BLK_MQ_F_SG_MERGEMing Lei
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-18block: make request_to_qc_t publicSagi Grimberg
block consumers will need it for polling requests that are sent with blk_execute_rq_nowait. Also, get rid of blk_tag_to_qc_t and open-code it instead. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-17blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()Jens Axboe
There's a single user of this function, dm, and dm just wants to check if IO is inflight, not that it's just allocated. This fixes a hang with srp/002 in blktests with dm, where it tries to suspend but waits for inflight IO to finish first. As it checks for just allocated requests, this fails. Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-04block: move queues types to the block layerChristoph Hellwig
Having another indirect all in the fast path doesn't really help in our post-spectre world. Also having too many queue type is just going to create confusion, so I'd rather manage them centrally. Note that the queue type naming and ordering changes a bit - the first index now is the default queue for everything not explicitly marked, the optional ones are read and poll queues. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-29blk-mq: add mq_ops->commit_rqs()Jens Axboe
blk-mq passes information to the hardware about any given request being the last that we will issue in this sequence. The point is that hardware can defer costly doorbell type writes to the last request. But if we run into errors issuing a sequence of requests, we may never send the request with bd->last == true set. For that case, we need a hook that tells the hardware that nothing else is coming right now. For failures returned by the drivers ->queue_rq() hook, the driver is responsible for flushing pending requests, if it uses bd->last to optimize that part. This works like before, no changes there. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-26blk-mq: Simplify request completion stateKeith Busch
There are no more users relying on blk-mq request states to prevent double completions, so replace the relatively expensive cmpxchg operation with WRITE_ONCE. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-26blk-mq: Return true if request was completedKeith Busch
A driver may have internal state to cleanup if we're pretending a request didn't complete. Return 'false' if the command wasn't actually completed due to the timeout error injection, and true otherwise. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-26blk-mq: remove 'tag' parameter from mq_ops->poll()Jens Axboe
We always pass in -1 now and none of the callers use the tag value, remove the parameter. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-08blk-mq: provide a helper to check if a queue is busyJens Axboe
Returns true if the queue currently has requests pending, false if not. DM can use this to replace the atomic_inc/dec they do per device to see if a device is busy. Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-08blk-mq-tag: change busy_iter_fn to return whether to continue or notJens Axboe
We have this functionality in sbitmap, but we don't export it in blk-mq for users of the tags busy iteration. This can be useful for stopping the iteration, if the caller doesn't need to find more requests. Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-07nvme: add separate poll queue mapJens Axboe
Adds support for defining a variable number of poll queues, currently configurable with the 'poll_queues' module parameter. Defaults to a single poll queue. And now we finally have poll support without triggering interrupts! Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-07blk-mq: initial support for multiple queue mapsJens Axboe
Add a queue offset to the tag map. This enables users to map iteratively, for each queue map type they support. Bump maximum number of supported maps to 2, we're now fully able to support more than 1 map. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-07blk-mq: support multiple hctx mapsJens Axboe
Add support for the tag set carrying multiple queue maps, and for the driver to inform blk-mq how many it wishes to support through setting set->nr_maps. This adds an mq_ops helper for drivers that support more than 1 map, mq_ops->rq_flags_to_type(). The function takes request/bio flags and CPU, and returns a queue map index for that. We then use the type information in blk_mq_map_queue() to index the map set. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-07blk-mq: allow software queue to map to multiple hardware queuesJens Axboe
The mapping used to be dependent on just the CPU location, but now it's a tuple of (type, cpu) instead. This is a prep patch for allowing a single software queue to map to multiple hardware queues. No functional changes in this patch. This changes the software queue count to an unsigned short to save a bit of space. We can still support 64K-1 CPUs, which should be enough. Add a check to catch a wrap. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-07blk-mq: abstract out queue mapJens Axboe
This is in preparation for allowing multiple sets of maps per queue, if so desired. Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-07block: kill request ->cpu memberJens Axboe
This was used for completion placement for the legacy path, but for mq we have rq->mq_ctx->cpu for that. Add a helper to get the request CPU assignment, as the mq_ctx type is private to blk-mq. Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-07block: get rid of q->softirq_done_fn()Jens Axboe
With the legacy path gone, all we do is funnel it through the mq_ops->complete() operation. Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-07blk-mq: provide mq_ops->busy() hookJens Axboe
We'll hook into this from blk_lld_busy(), allowing blk-mq to also return whether or not a given queue currently has requests in progress. Reviewed-by: Hannes Reinecke <hare@suse.com> Tested-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-15blk-mq: provide helper for setting up an SQ queue and tag setJens Axboe
This pattern is repeated throughout all the blk-mq conversions. Provide a basic helper to get it done. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-14Merge tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: "First pull request for this merge window, there will also be a followup request with some stragglers. This pull request contains: - Fix for a thundering heard issue in the wbt block code (Anchal Agarwal) - A few NVMe pull requests: * Improved tracepoints (Keith) * Larger inline data support for RDMA (Steve Wise) * RDMA setup/teardown fixes (Sagi) * Effects log suppor for NVMe target (Chaitanya Kulkarni) * Buffered IO suppor for NVMe target (Chaitanya Kulkarni) * TP4004 (ANA) support (Christoph) * Various NVMe fixes - Block io-latency controller support. Much needed support for properly containing block devices. (Josef) - Series improving how we handle sense information on the stack (Kees) - Lightnvm fixes and updates/improvements (Mathias/Javier et al) - Zoned device support for null_blk (Matias) - AIX partition fixes (Mauricio Faria de Oliveira) - DIF checksum code made generic (Max Gurtovoy) - Add support for discard in iostats (Michael Callahan / Tejun) - Set of updates for BFQ (Paolo) - Removal of async write support for bsg (Christoph) - Bio page dirtying and clone fixups (Christoph) - Set of bcache fix/changes (via Coly) - Series improving blk-mq queue setup/teardown speed (Ming) - Series improving merging performance on blk-mq (Ming) - Lots of other fixes and cleanups from a slew of folks" * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits) blkcg: Make blkg_root_lookup() work for queues in bypass mode bcache: fix error setting writeback_rate through sysfs interface null_blk: add lock drop/acquire annotation Blk-throttle: reduce tail io latency when iops limit is enforced block: paride: pd: mark expected switch fall-throughs block: Ensure that a request queue is dissociated from the cgroup controller block: Introduce blk_exit_queue() blkcg: Introduce blkg_root_lookup() block: Remove two superfluous #include directives blk-mq: count the hctx as active before allocating tag block: bvec_nr_vecs() returns value for wrong slab bcache: trivial - remove tailing backslash in macro BTREE_FLAG bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section bcache: set max writeback rate when I/O request is idle bcache: add code comments for bset.c bcache: fix mistaken comments in request.c bcache: fix mistaken code comments in bcache.h bcache: add a comment in super.c bcache: avoid unncessary cache prefetch bch_btree_node_get() bcache: display rate debug parameters to 0 when writeback is not running ...
2018-07-24blk-mq: export setting request completion stateKeith Busch
This is preparing for drivers that want to directly alter the state of their requests. No functional change here. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09blk-mq: dequeue request one by one from sw queue if hctx is busyMing Lei
It won't be efficient to dequeue request one by one from sw queue, but we have to do that when queue is busy for better merge performance. This patch takes the Exponential Weighted Moving Average(EWMA) to figure out if queue is busy, then only dequeue request one by one from sw queue when queue is busy. Fixes: b347689ffbca ("blk-mq-sched: improve dispatching from sw queue") Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Reported-by: Kashyap Desai <kashyap.desai@broadcom.com> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09blk-mq: introduce new lock for protecting hctx->dispatch_waitMing Lei
Now hctx->lock is only acquired when adding hctx->dispatch_wait to one wait queue, but not held when removing it from the wait queue. IO hang can be observed easily if SCHED RESTART is disabled, that means now RESTART exits just for fixing the issue in blk_mq_mark_tag_wait(). This patch fixes the issue by introducing hctx->dispatch_wait_lock and holding it for removing hctx->dispatch_wait in blk_mq_dispatch_wake(), since we need to avoid acquiring hctx->lock in irq context. Fixes: eb619fdb2d4cb8b3d3419 ("blk-mq: fix issue with shared tag queue re-running") Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-14blk-mq: remove blk_mq_tagset_iterChristoph Hellwig
Unused now that nvme stopped using it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk>
2018-05-30blk-mq: abstract out blk-mq-sched rq list iteration bio merge helperJens Axboe
No functional changes in this patch, just a prep patch for utilizing this in an IO scheduler. Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Omar Sandoval <osandov@fb.com>
2018-04-25block: mq: Add some minor doc for core structsLinus Walleij
As it came up in discussion on the mailing list that the semantic meaning of 'blk_mq_ctx' and 'blk_mq_hw_ctx' isn't completely obvious to everyone, let's add some minimal kerneldoc for a starter. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>