summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2019-04-12block: genhd: remove async_events fieldMartin Wilck
The async_events field, intended to be used for drivers that support asynchronous notifications about disk events (aka media change events), isn't currently used by any driver, and apparently that has been that way for a long time (if not forever). Remove it. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-12block: only allow contiguous page structs in a bio_vecChristoph Hellwig
We currently have to call nth_page when iterating over pages inside a bio_vec. Jens complained a while ago that this is fairly expensive. To mitigate this we can check that that the actual page structures are contiguous when adding them to the bio, and just do check pointer arithmetics later on. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-12block: change how we get page references in bio_iov_iter_get_pagesChristoph Hellwig
Instead of needing a special macro to iterate over all pages in a bvec just do a second passs over the whole bio. This also matches what we do on the release side. The release side helper is moved up to where we need the get helper to clearly express the symmetry. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-12block: don't allow multiple bio_iov_iter_get_pages calls per bioChristoph Hellwig
No caller uses bio_iov_iter_get_pages multiple times on a given bio, and that funtionality isn't all that useful. Removing it will make some future changes a little easier and also simplifies the function a bit. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-12block: refactor __bio_iov_bvec_add_pagesChristoph Hellwig
Return early on error, and add an unlikely annotation for that case. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-12block: rewrite blk_bvec_map_sg to avoid a nth_page callChristoph Hellwig
The offset in scatterlists is allowed to be larger than the page size, so don't go to great length to avoid that case and simplify the arithmetics. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-10block: do not leak memory in bio_copy_user_iov()Jérôme Glisse
When bio_add_pc_page() fails in bio_copy_user_iov() we should free the page we just allocated otherwise we are leaking it. Cc: linux-block@vger.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Jérôme Glisse <jglisse@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-10blk-mq: introduce blk_mq_complete_request_sync()Ming Lei
In NVMe's error handler, follows the typical steps of tearing down hardware for recovering controller: 1) stop blk_mq hw queues 2) stop the real hw queues 3) cancel in-flight requests via blk_mq_tagset_busy_iter(tags, cancel_request, ...) cancel_request(): mark the request as abort blk_mq_complete_request(req); 4) destroy real hw queues However, there may be race between #3 and #4, because blk_mq_complete_request() may run q->mq_ops->complete(rq) remotelly and asynchronously, and ->complete(rq) may be run after #4. This patch introduces blk_mq_complete_request_sync() for fixing the above race. Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Bart Van Assche <bvanassche@acm.org> Cc: James Smart <james.smart@broadcom.com> Cc: linux-nvme@lists.infradead.org Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-10block, bfq: fix use after free in bfq_bfqq_expirePaolo Valente
The function bfq_bfqq_expire() invokes the function __bfq_bfqq_expire(), and the latter may free the in-service bfq-queue. If this happens, then no other instruction of bfq_bfqq_expire() must be executed, or a use-after-free will occur. Basing on the assumption that __bfq_bfqq_expire() invokes bfq_put_queue() on the in-service bfq-queue exactly once, the queue is assumed to be freed if its refcounter is equal to one right before invoking __bfq_bfqq_expire(). But, since commit 9dee8b3b057e ("block, bfq: fix queue removal from weights tree") this assumption is false. __bfq_bfqq_expire() may also invoke bfq_weights_tree_remove() and, since commit 9dee8b3b057e ("block, bfq: fix queue removal from weights tree"), also the latter function may invoke bfq_put_queue(). So __bfq_bfqq_expire() may invoke bfq_put_queue() twice, and this is the actual case where the in-service queue may happen to be freed. To address this issue, this commit moves the check on the refcounter of the queue right around the last bfq_put_queue() that may be invoked on the queue. Fixes: 9dee8b3b057e ("block, bfq: fix queue removal from weights tree") Reported-by: Dmitrii Tcvetkov <demfloro@demfloro.ru> Reported-by: Douglas Anderson <dianders@chromium.org> Tested-by: Dmitrii Tcvetkov <demfloro@demfloro.ru> Tested-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-08block: fix build warning in merging bvecsMing Lei
Commit f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can be mergeable") changes bvec merge by only considering two bvecs from different bios. However, if the former bio doesn't inlcude any io bvec, then the following warning may be triggered: warning: ‘bvec.bv_offset’ may be used uninitialized in this function [-Wmaybe-uninitialized] In practice, it shouldn't be triggered. Fixes it by adding check on former bio, the check shouldn't add any cost given 'bio->bi_iter' can be hit in cache. Reported-by: Jens Axboe <axboe@kernel.dk> Fixes: f6970f83ef79 ("block: don't check if adjacent bvecs in one bio can be mergeable") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-08block, bfq: fix some typos in commentsAngelo Ruocco
Some of the comments in the bfq files had typos. This patch fixes them. Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-08block: remove unused variable 'def'Hisao Tanabe
The 'def' local variable became unused after commit f382fb0bcef4 ("block: remove legacy IO schedulers"), let's remove it. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Hisao Tanabe <xtanabe@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: rename next to execute_stepsDavid Kozub
As the function is responsible for executing the individual steps supplied in the steps argument, execute_steps is a more descriptive name than the rather generic next. Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: don't repeat opal_discovery0 in each steps arrayDavid Kozub
Originally each of the opal functions that call next include opal_discovery0 in the array of steps. This is superfluous and can be done always inside next. Acked-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: pass steps via argument rather than via opal_devDavid Kozub
The steps argument is only read by the next function, so it can be passed directly as an argument rather than via opal_dev. Normally, the steps is an array on the stack, so the pointer stops being valid then the function that set opal_dev.steps returns. If opal_dev.steps was not set to NULL before return it would become a dangling pointer. When the steps are passed as argument this becomes easier to see and more difficult to misuse. Acked-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: use named Opal tokens instead of integer literalsDavid Kozub
Replace integer literals by Opal tokens defined in opal_proto.h where possible. Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: unify retrieval of table columnsDavid Kozub
Instead of having multiple places defining the same argument list to get a specific column of a sed-opal table, provide a generic version and call it from those functions. Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: add token for OPAL_LIFECYCLEDavid Kozub
Define OPAL_LIFECYCLE token and use it instead of literals in get_lsp_lifecycle. Acked-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: split generation of bytestring header and contentJonas Rabenstein
Split the header generation from the (normal) memcpy part if a bytestring is copied into the command buffer. This allows in-place generation of the bytestring content. For example, copy_from_user may be used without an intermediate buffer. Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: print failed function addressJonas Rabenstein
Add function address (and if available its symbol) to the message if a step function fails. Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: reuse response_get_token to decrease code duplicationDavid Kozub
response_get_token had already been in place, its functionality had been duplicated within response_get_{u64,bytestring} with the same error handling. Unify the handling by reusing response_get_token within the other functions. Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: unify error handling of responsesDavid Kozub
response_get_{string,u64} include error handling for argument resp being NULL but response_get_token does not handle this. Make all three of response_get_{string,u64,token} handle NULL resp in the same way. Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: unify cmd startDavid Kozub
Every step starts with resetting the cmd buffer as well as the comid and constructs the appropriate OPAL_CALL command. Consequently, those actions may be combined into one generic function. On should take care that the opening and closing tokens for the argument list are already emitted by cmd_start and cmd_finalize respectively and thus must not be additionally added. Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: close parameter list in cmd_finalizeDavid Kozub
Every step ends by calling cmd_finalize (via finalize_and_send) yet every step adds the token OPAL_ENDLIST on its own. Moving this into cmd_finalize decreases code duplication. Co-authored-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: unify space check in add_token_*Jonas Rabenstein
All add_token_* functions have a common set of conditions that have to be checked. Use a common function for those checks in order to avoid different behaviour as well as code duplication. Acked-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Co-authored-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: use correct macro for method lengthJonas Rabenstein
Also the values of OPAL_UID_LENGTH and OPAL_METHOD_LENGTH are the same, it is weird to use OPAL_UID_LENGTH for the definition of the methods. Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: fix typos and formattingDavid Kozub
This should make no change in functionality. The formatting changes were triggered by checkpatch.pl. Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: sed-opal: fix IOC_OPAL_ENABLE_DISABLE_MBRDavid Kozub
The implementation of IOC_OPAL_ENABLE_DISABLE_MBR handled the value opal_mbr_data.enable_disable incorrectly: enable_disable is expected to be one of OPAL_MBR_ENABLE(0) or OPAL_MBR_DISABLE(1). enable_disable was passed directly to set_mbr_done and set_mbr_enable_disable where is was interpreted as either OPAL_TRUE(1) or OPAL_FALSE(0). The end result was that calling IOC_OPAL_ENABLE_DISABLE_MBR with OPAL_MBR_ENABLE actually disabled the shadow MBR and vice versa. This patch adds correct conversion from OPAL_MBR_DISABLE/ENABLE to OPAL_FALSE/TRUE. The change affects existing programs using IOC_OPAL_ENABLE_DISABLE_MBR but this is typically used only once when setting up an Opal drive. Acked-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Scott Bauer <sbauer@plzdonthack.me> Signed-off-by: David Kozub <zub@linux.fjfi.cvut.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-06block: remove CONFIG_LBDAFChristoph Hellwig
Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit architectures. These types are required to support block device and/or file sizes larger than 2 TiB, and have generally defaulted to on for a long time. Enabling the option only increases the i386 tinyconfig size by 145 bytes, and many data structures already always use 64-bit values for their in-core and on-disk data structures anyway, so there should not be a large change in dynamic memory usage either. Dropping this option removes a somewhat weird non-default config that has cause various bugs or compiler warnings when actually used. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-05block: Revert v5.0 blk_mq_request_issue_directly() changesBart Van Assche
blk_mq_try_issue_directly() can return BLK_STS*_RESOURCE for requests that have been queued. If that happens when blk_mq_try_issue_directly() is called by the dm-mpath driver then dm-mpath will try to resubmit a request that is already queued and a kernel crash follows. Since it is nontrivial to fix blk_mq_request_issue_directly(), revert the blk_mq_request_issue_directly() changes that went into kernel v5.0. This patch reverts the following commits: * d6a51a97c0b2 ("blk-mq: replace and kill blk_mq_request_issue_directly") # v5.0. * 5b7a6f128aad ("blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests") # v5.0. * 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: James Smart <james.smart@broadcom.com> Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: <stable@vger.kernel.org> Reported-by: Laurence Oberman <loberman@redhat.com> Tested-by: Laurence Oberman <loberman@redhat.com> Fixes: 7f556a44e61d ("blk-mq: refactor the code of issue request directly") # v5.0. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-04block: bio: ensure newly added bio flags don't override BVEC_POOL_IDXJohannes Thumshirn
With the introduction of BIO_NO_PAGE_REF we've used up all available bits in bio::bi_flags. Convert the defines of the flags to an enum and add a BUILD_BUG_ON() call to make sure no-one adds a new one and thus overrides the BVEC_POOL_IDX causing crashes. Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-04blk-mq: do not reset plug->rq_count before the list is sortedDongli Zhang
We would never be able to sort the list if we first reset plug->rq_count which is used in conditional check later. Fixes: ce5b009cff19 ("block: improve logic around when to sort a plug list") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-02blk-mq: add trace block plug and unplug for multiple queuesYufen Yu
For now, we just trace plug for single queue device or drivers provide .commit_rqs, and have not trace plug for multiple queues device. But, unplug events will be recorded when call blk_mq_flush_plug_list(). Then, trace events will be asymmetrical, just have unplug and without plug. This patch add trace plug and unplug for multiple queues device in blk_mq_make_request(). After that, we can accurately trace plug and unplug for multiple queues. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yufen Yu <yuyufen@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-02block: use blk_free_flush_queue() to free hctx->fq in blk_mq_init_hctxShenghui Wang
kfree() can leak the hctx->fq->flush_rq field. Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Shenghui Wang <shhuiw@foxmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block: don't check if adjacent bvecs in one bio can be mergeableMing Lei
Now both passthrough and FS IO have supported multi-page bvec, and bvec merging has been handled actually when adding page to bio, then adjacent bvecs won't be mergeable any more if they belong to same bio. So only try to merge bvecs if they are from different bios. Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block: reuse __blk_bvec_map_sg() for mapping page sized bvecMing Lei
Inside __blk_segment_map_sg(), page sized bvec mapping is optimized a bit with one standalone branch. So reuse __blk_bvec_map_sg() to do that. Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block: remove argument of 'request_queue' from __blk_bvec_map_sgMing Lei
The argument of 'request_queue' isn't used by __blk_bvec_map_sg(), so remove it. Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block: enable multi-page bvec for passthrough IOMing Lei
Now block IO stack is basically ready for supporting multi-page bvec, however it isn't enabled on passthrough IO. One reason is that passthrough IO is dispatched to LLD directly and bio split is bypassed, so the bio has to be built correctly for dispatch to LLD from the beginning. Implement multi-page support for passthrough IO by limitting each bvec as block device's segment and applying all kinds of queue limit in blk_add_pc_page(). Then we don't need to calculate segments any more for passthrough IO any more, turns out code is simplified much. Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block: put the same page when adding it to bioMing Lei
When the added page is merged to last same page in bio_add_pc_page(), the user may need to put this page for avoiding page leak. bio_map_user_iov() needs this kind of handling, and now it deals with it by itself in hack style. Moves the handling of put page into __bio_add_pc_page(), so bio_map_user_iov() may be simplified a bit, and maybe more users can benefit from this change. Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block: check if page is mergeable in one helperMing Lei
Now the check for deciding if one page is mergeable to current bvec becomes a bit complicated, and we need to reuse the code before adding pc page. So move the check in one dedicated helper. No function change. Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block: cleanup bio_add_pc_pageMing Lei
REQ_PC is out of date, so replace it with passthrough IO. Also remove the local variable of 'prev' since we can reuse the top local variable of 'bvec'. No function change. Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block: don't merge adjacent bvecs to one segment in bio blk_queue_splitMing Lei
For normal filesystem IO, each page is added via blk_add_page(), in which bvec(page) merge has been handled already, and basically not possible to merge two adjacent bvecs in one bio. So not try to merge two adjacent bvecs in blk_queue_split(). Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block: avoid to break XEN by multi-page bvecMing Lei
XEN has special page merge requirement, see xen_biovec_phys_mergeable(). We can't merge pages into one bvec simply for XEN. So move XEN's specific check on page merge into __bio_try_merge_page(), then abvoid to break XEN by multi-page bvec. Cc: ris Ostrovsky <boris.ostrovsky@oracle.com> Cc: xen-devel@lists.xenproject.org Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block: pass page to xen_biovec_phys_mergeableMing Lei
xen_biovec_phys_mergeable() only needs .bv_page of the 2nd bio bvec for checking if the two bvecs can be merged, so pass page to xen_biovec_phys_mergeable() directly. No function change. Cc: ris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: xen-devel@lists.xenproject.org Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block, bfq: save & resume weight on a queue merge/splitFrancesco Pollicino
bfq saves the state of a queue each time a merge occurs, to be able to resume such a state when the queue is associated again with its original process, on a split. Unfortunately bfq does not save & restore also the weight of the queue. If the weight is not correctly resumed when the queue is recycled, then the weight of the recycled queue could differ from the weight of the original queue. This commit adds the missing save & resume of the weight. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block, bfq: print SHARED instead of pid for shared queues in logsFrancesco Pollicino
The function "bfq_log_bfqq" prints the pid of the process associated with the queue passed as input. Unfortunately, if the queue is shared, then more than one process is associated with the queue. The pid that gets printed in this case is the pid of one of the associated processes. Which process gets printed depends on the exact sequence of merge events the queue underwent. So printing such a pid is rather useless and above all is often rather confusing because it reports a random pid between those of the associated processes. This commit addresses this issue by printing SHARED instead of a pid if the queue is shared. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block, bfq: always protect newly-created queues from existing active queuesPaolo Valente
If many bfq_queues belonging to the same group happen to be created shortly after each other, then the processes associated with these queues have typically a common goal. In particular, bursts of queue creations are usually caused by services or applications that spawn many parallel threads/processes. Examples are systemd during boot, or git grep. If there are no other active queues, then, to help these processes get their job done as soon as possible, the best thing to do is to reach a high throughput. To this goal, it is usually better to not grant either weight-raising or device idling to the queues associated with these processes. And this is exactly what BFQ currently does. There is however a drawback: if, in contrast, some other queues are already active, then the newly created queues must be protected from the I/O flowing through the already existing queues. In this case, the best thing to do is the opposite as in the other case: it is much better to grant weight-raising and device idling to the newly-created queues, if they deserve it. This commit addresses this issue by doing so if there are already other active queues. This change also helps eliminating false positives, which occur when the newly-created queues do not belong to an actual large burst of creations, but some background task (e.g., a service) happens to trigger the creation of new queues in the middle, i.e., very close to when the victim queues are created. These false positive may cause total loss of control on process latencies. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block, bfq: do not tag totally seeky queues as soft rtPaolo Valente
Sync random I/O is likely to be confused with soft real-time I/O, because it is characterized by limited throughput and apparently isochronous arrival pattern. To avoid false positives, this commits prevents bfq_queues containing only random (seeky) I/O from being tagged as soft real-time. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block, bfq: do not merge queues on flash storage with queueingPaolo Valente
To boost throughput with a set of processes doing interleaved I/O (i.e., a set of processes whose individual I/O is random, but whose merged cumulative I/O is sequential), BFQ merges the queues associated with these processes, i.e., redirects the I/O of these processes into a common, shared queue. In the shared queue, I/O requests are ordered by their position on the medium, thus sequential I/O gets dispatched to the device when the shared queue is served. Queue merging costs execution time, because, to detect which queues to merge, BFQ must maintain a list of the head I/O requests of active queues, ordered by request positions. Measurements showed that this costs about 10% of BFQ's total per-request processing time. Request processing time becomes more and more critical as the speed of the underlying storage device grows. Yet, fortunately, queue merging is basically useless on the very devices that are so fast to make request processing time critical. To reach a high throughput, these devices must have many requests queued at the same time. But, in this configuration, the internal scheduling algorithms of these devices do also the job of queue merging: they reorder requests so as to obtain as much as possible a sequential I/O pattern. As a consequence, with processes doing interleaved I/O, the throughput reached by one such device is likely to be the same, with and without queue merging. In view of this fact, this commit disables queue merging, and all related housekeeping, for non-rotational devices with internal queueing. The total, single-lock-protected, per-request processing time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz (time measured with simple code instrumentation, and using the throughput-sync.sh script of the S suite [1], in performance-profiling mode). To put this result into context, the total, single-lock-protected, per-request execution time of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ). Disabling merging provides a further, remarkable benefit in terms of throughput. Merging tends to make many workloads artificially more uneven, mainly because of shared queues remaining non empty for incomparably more time than normal queues. So, if, e.g., one of the queues in a set of merged queues has a higher weight than a normal queue, then the shared queue may inherit such a high weight and, by staying almost always active, may force BFQ to perform I/O plugging most of the time. This evidently makes it harder for BFQ to let the device reach a high throughput. As a practical example of this problem, and of the benefits of this commit, we measured again the throughput in the nasty scenario considered in previous commit messages: dbench test (in the Phoronix suite), with 6 clients, on a filesystem with journaling, and with the journaling daemon enjoying a higher weight than normal processes. With this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any of the other I/O schedulers. As such, this is also likely to be the maximum possible throughput reachable with this workload on this device, because I/O is mostly random, and the other schedulers basically just pass I/O requests to the drive as fast as possible. [1] https://github.com/Algodev-github/S Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Alessio Masola <alessio.masola@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-01block, bfq: tune service injection basing on request service timesPaolo Valente
The processes associated with a bfq_queue, say Q, may happen to generate their cumulative I/O at a lower rate than the rate at which the device could serve the same I/O. This is rather probable, e.g., if only one process is associated with Q and the device is an SSD. It results in Q becoming often empty while in service. If BFQ is not allowed to switch to another queue when Q becomes empty, then, during the service of Q, there will be frequent "service holes", i.e., time intervals during which Q gets empty and the device can only consume the I/O already queued in its hardware queues. This easily causes considerable losses of throughput. To counter this problem, BFQ implements a request injection mechanism, which tries to fill the above service holes with I/O requests taken from other bfq_queues. The hard part in this mechanism is finding the right amount of I/O to inject, so as to both boost throughput and not break Q's bandwidth and latency guarantees. To this goal, the current version of this mechanism measures the bandwidth enjoyed by Q while it is being served, and tries to inject the maximum possible amount of extra service that does not cause Q's bandwidth to decrease too much. This solution has an important shortcoming. For bandwidth measurements to be stable and reliable, Q must remain in service for a much longer time than that needed to serve a single I/O request. Unfortunately, this does not hold with many workloads. This commit addresses this issue by changing the way the amount of injection allowed is dynamically computed. It tunes injection as a function of the service times of single I/O requests of Q, instead of Q's bandwidth. Single-request service times are evidently meaningful even if Q gets very few I/O requests completed while it is in service. As a testbed for this new solution, we measured the throughput reached by BFQ for one of the nastiest workloads and configurations for this scheduler: the workload generated by the dbench test (in the Phoronix suite), with 6 clients, on a filesystem with journaling, and with the journaling daemon enjoying a higher weight than normal processes. With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on a PLEXTOR PX-256M5. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Francesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>