Age | Commit message (Collapse) | Author |
|
Add a separate flag to encode that the transfer is using the small
page sized pool, and use a normal 0..n count for the number of
descriptors.
Contains improvements and suggestions from Kanchan Joshi
<joshi.k@samsung.com> and Leon Romanovsky <leon@kernel.org>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
|
|
They are used for both PRPs and SGLs, and we use descriptor elsewhere
when referring to their allocations, so use that name here as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
|
|
There is no real point in having a union of two pointer types here, just
use a void pointer as we mix and match types between the arms of the
union between the allocation and freeing side already.
Also rename the nr_allocations field to nr_descriptors to better describe
what it does.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[leon: ported forward to include metadata SGL support]
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
|
|
Instead of keeping dedicated "bool aborted" variable, switch to a flags
flags that can be used for other flags as well.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
|
|
No admin command defined in an NVMe specification supports metadata,
but to protect against vendor specific commands using metadata ensure
that we don't try to use SGLs for metadata on the admin queue, as NVMe
does not support SGLs on the admin queue for the PCI transport. Do
this by checking if the data transfer has been setup using SGLs as
that is required for using SGLs for metadata.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
|
|
NVMe commands with over 8 KB of discontiguous data allocate PRP list
pages from the per-nvme_device dma_pool prp_page_pool or prp_small_pool.
Each call to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool
spinlock. These device-global spinlocks are a significant source of
contention when many CPUs are submitting to the same NVMe devices. On a
workload issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2
NUMA nodes to 23 NVMe devices, we observed 2.4% of CPU time spent in
_raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free.
Ideally, the dma_pools would be per-hctx to minimize contention. But
that could impose considerable resource costs in a system with many NVMe
devices and CPUs.
As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each
nvme_queue to the set of DMA pools corresponding to its device and its
hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by
about half, to 1.2%. Preventing the sharing of PRP list pages across
NUMA nodes also makes them cheaper to initialize.
Link: https://lore.kernel.org/linux-nvme/CADUfDZqa=OOTtTTznXRDmBQo1WrFcDw1hBA7XwM7hzJ-hpckcA@mail.gmail.com/T/#u
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
nvme_init_hctx() and nvme_admin_init_hctx() are very similar. In
preparation for adding more logic, factor out a nvme_init_hctx-common()
helper.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Introduce dma_pool_create_node(), like dma_pool_create() but taking an
additional NUMA node argument. Allocate struct dma_pool on the desired
node, and store the node on dma_pool for allocating struct dma_page.
Make dma_pool_create() an alias for dma_pool_create_node() with node set
to NUMA_NO_NODE.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The lsrsp object is maintained by the LLDD. The lifetime of the lsrsp
object is implicit. Because there is no explicit cleanup/free call into
the LLDD, it is not safe to assume after xml_rsp_fails, that the lsrsp
is still valid. The LLDD could have freed the object already.
With the recent changes how fcloop tracks the resources, this is the
case. Thus don't access lsrsp after xml_rsp_fails.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The lifetime of the fcloop_lsreq is not tight to the lifetime of the
host or target port, thus there is no need anymore to synchronize the
cleanup path anymore.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Add the missing fcloop_call_host_done calls so that the caller
frees resources when something goes wrong.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Ensure that the tgtport is not going away as long portentry has a
pointer on it.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
When nvmet_fc_unregister_targetport is called by the LLDD, it's not
possible to communicate with the host, thus all pending request will not
be process. Thus explicitly free them.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
When the target port is gone, the lsrsp pointer is invalid. Thus don't
call the done function anymore instead just drop the response.
This happens when the target sends a disconnect association. After this
the target starts tearing down all resources and doesn't expect any
response.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
fcloop depends on the host or the target to allocate the fcloop_lsreq
object. This means that the lifetime of the fcloop_lsreq is tied to
either the host or the target. Consequently, the host or the target must
cooperate during shutdown.
Unfortunately, this approach does not work well when the target forces a
shutdown, as there are dependencies that are difficult to resolve in a
clean way.
The simplest solution is to decouple the lifetime of the fcloop_lsreq
object by managing them directly within fcloop. Since this is not a
performance-critical path and only a small number of LS objects are used
during setup and cleanup, it does not significantly impact performance
to allocate them during normal operation.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The delete callback can be called either via the unregister function or
from the transport directly. Thus it is necessary ensure resources are
not freed multiple times.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The abort handling logic expects that the state and the fcpreq are only
accessed when holding the reqlock lock.
While at it, only handle the aborts in the abort handler.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Track the lifetime of the in-flight tfcp_req to ensure
the object is not freed too early.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Use the newly introduced fcloop_lport_lookup instead
of the open coded version.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The checks for a valid input values are mixed with the logic to insert a
newly allocated nport. Refactor the function so that first the checks
are done.
This allows to untangle the setup steps into a more linear form which
reduces the complexity of the functions.
Also start tracking lport when a lport is assigned to a nport. This
ensures, that the lport is not going away as long it is still referenced
by a nport.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The nport object has an association with the rport and lport object,
that means we can only remove an nport object from the global nport_list
after the last user of an rport or lport is gone.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
A nport object is always used in association with targerport,
remoteport, tport and rport objects. Add explicit references for any of
the associated object. This ensures that nport is not removed too early
on shutdown sequences.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Use SHASH_DESC_ON_STACK to avoid explicit allocation.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Use SHASH_DESC_ON_STACK to avoid explicit allocation.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Now that a submission queue holds a reference to its completion queue,
there is no need to pass the cq argument to nvmet_req_init(), so remove
it.
Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The NVMe PCI transport specification allows for completion queues to be
shared by different submission queues.
This patch allows a submission queue to keep track of the completion queue
it is using with reference counting. As such, it can be ensured that a
completion queue is not deleted while a submission queue is actively
using it.
This patch enables completion queue sharing in the pci-epf target driver.
For fabrics drivers, completion queue sharing is not enabled as it is
not possible as per the fabrics specification. However, this patch
modifies the fabrics drivers to correctly integrate the new API that
supports completion queue sharing.
Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
With struct nvmet_cq now having a reference count, this patch amends the
target fabrics call chain to initialize and destroy/put a completion
queue.
Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
For the PCI transport, the NVMe specification allows submission queues
to share completion queues, however, this is not supported in the
current NVMe target implementation. This is a preparatory patch to allow
for completion queue (CQ) sharing between different submission queues
(SQ).
To support queue sharing, reference counting completion queues is
required. This patch adds the refcount_t field ref to struct nvmet_cq
coupled with respective nvmet_cq_init(), nvmet_cq_get(), nvmet_cq_put(),
nvmet_cq_is_deletable() and nvmet_cq_destroy() functions.
A CQ reference count is initialized with nvmet_cq_init() when a CQ is
created. Using nvmet_cq_get(), a reference to a CQ is taken when an SQ is
created that uses the respective CQ. Similarly. when an SQ is destroyed,
the reference count to the respective CQ from the SQ being destroyed is
decremented with nvmet_cq_put(). The last reference to a CQ is dropped
on a CQ deletion using nvmet_cq_put(), which invokes nvmet_cq_destroy()
to fully cleanup after the CQ. The helper function nvmet_cq_in_use() is
used to determine if any SQs are still using the CQ pending deletion.
In which case, the CQ must not be deleted. This should protect scenarios
where a bad host may attempt to delete a CQ without first having deleted
SQ(s) using that CQ.
Additionally, this patch adds an array of struct nvmet_cq to the
nvmet_ctrl structure. This allows for the controller to keep track of CQs
as they are created and destroyed, similar to the current tracking done
for SQs. The memory for this array is freed when the controller is freed.
A struct nvmet_ctrl reference is also added to the nvmet_cq structure to
allow for CQs to be removed from the controller whilst keeping the new
API similar to the existing API for SQs.
Sample callchain with CQ refcounting for the PCI endpoint target
(pci-epf):
i. nvmet_execute_create_cq -> nvmet_pci_epf_create_cq
-> nvmet_cq_create -> nvmet_cq_init [cq refcount=1]
ii. nvmet_execute_create_sq -> nvmet_pci_epf_create_sq
-> nvmet_sq_create -> nvmet_sq_init -> nvmet_cq_get [cq refcount=2]
iii. nvmet_execute_delete_sq - > nvmet_pci_epf_delete_sq ->
-> nvmet_sq_destroy -> nvmet_cq_put [cq refcount 1]
iv. nvmet_execute_delete_cq -> nvmet_pci_epf_delete_cq
-> nvmet_cq_put [cq refcount 0]
Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
This patch adds a new helper function nvmet_check_io_cqid(). It is to be
used when parsing host commands for IO CQ creation/deletion and IO SQ
creation to ensure that the specified IO completion queue identifier
(CQID) is not 0 (Admin queue ID). This is a check that already occurs in
the nvmet_execute_x() functions prior to nvmet_check_cqid.
With the addition of this helper function, the CQ ID checks in the
nvmet_execute_x() function can be removed, and instead simply call
nvmet_check_io_cqid() in place of nvmet_check_cqid().
Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Do not start authentication on I/O queues as it doesn't really add value,
and secure concatenation disallows it anyway. Authentication commands on
I/O queues are not aborted, so the host may still run the authentication
protocol on I/O queues.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
When sending 'connect' the queues can figure out from the return code
whether authentication is required or not. But reauthentication doesn't
disconnect the queues, so this check is not available. Rather we need
to check whether the queue had been authenticated initially to figure
out if we need to reauthenticate.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Now that the crc32c() library function directly takes advantage of
architecture-specific optimizations, it is unnecessary to go through the
crypto API. Just use crc32c(). This is much simpler, and it improves
performance due to eliminating the crypto API overhead.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The strncpy() function is deprecated for NUL-terminated strings as
explained in the "strncpy() on NUL-terminated strings" section of
Documentation/process/deprecated.rst.
The key issues are:
- strncpy() fails to guarantee NULL-termination when source > destination
- it unnecessarily zero-pads short strings, causing performance overhead
strscpy() is the proper replacement because:
- it guarantees NULL-termination
- it avoids redundant zero-padding
- it aligns with current kernel string-copying best practice
memcpy() was rejected because:
- NQN buffers (subsysnqn/hostnqn) are treated as NULL-terminated strings:
- strcmp() usage in nvmet_host_allowed() (discovery.c)
- strscpy() to copy subsysnqn in nvmet_execute_disc_identify()
seq_buf wasn't used because:
- this is a simple fixed-size buffer copy
- there is no need for progressive string construction features
Signed-off-by: Marcelo Moreira <marcelomoreira1905@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
When handling an R2T PDU we short-circuit nvme_tcp_queue_request()
as we should not attempt to send consecutive PDUs. So open-code
nvme_tcp_queue_request() for R2T and drop the last argument.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
|
|
When checking for secure concatenation we have already validated
that 'ctrl->opts' is set, so we can remove this check.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.
Move the conflicting declaration to the end of the structure. Notice
that `struct nvme_loop_iod` is a flexible structure --a structure
that contains a flexible-array member.
Fix the following warning:
drivers/nvme/target/loop.c:36:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
blk-mq-dma.c was split from blk-merge.c which has no copyright notice,
but except for some boilerplate code and comments left from the old
version this is all my code, so add my copyright.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250513071433.836797-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
While working on the new DMA API I kept getting annoyed how it was placed
right in the middle of the bio splitting code in blk-merge.c.
Split it out into a separate file.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250513071433.836797-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
unregister_sysctl_table() checks for NULL pointers internally.
Remove unneeded NULL check here.
Signed-off-by: Chen Ni <nichen@iscas.ac.cn>
Link: https://lore.kernel.org/lkml/20250514032139.2317578-1-nichen@iscas.ac.cn
Reviewed-by: Phillip Potter <phil@philpotter.co.uk>
Link: https://lore.kernel.org/lkml/aCURuvkmz-fw3Nnp@equinox
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Link: https://lore.kernel.org/r/20250514223354.1429-2-phil@philpotter.co.uk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When nr_hw_queues is updated, the elevator needs to be switched to
ensure that we exit elevator and reattach it to ensure that hctx->
sched_tags is correctly allocated for the new hardware queues.
However, elv_update_nr_hw_queues() currently only switches the
elevator if the queue is not registered. This is incorrect, as it
prevents reattaching the elevator after updating nr_hw_queues, which
in turn inhibits allocation of sched_tags.
Fix this by allowing the elevator switch if the queue is registered,
ensuring proper reattachment and resource allocation.
Fixes: 596dce110b7d ("block: simplify elevator reattachment for updating nr_hw_queues")
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250515134511.548270-1-nilay@linux.ibm.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
If blk-throttle is enabled but blktrace is not, then the compiler will
notice that the following two variables are unused:
../block/blk-throttle.c: In function 'throtl_pending_timer_fn':
../block/blk-throttle.c:1153:30: warning: unused variable 'bio_cnt_w' [-Wunused-variable]
1153 | unsigned int bio_cnt_w = sq_queued(sq, WRITE);
| ^~~~~~~~~
../block/blk-throttle.c:1152:30: warning: unused variable 'bio_cnt_r' [-Wunused-variable]
1152 | unsigned int bio_cnt_r = sq_queued(sq, READ);
| ^~~~~~~~~
Silence that my annotating them with __maybe_unused.
Fixes: 28ad83b774a6 ("blk-throttle: Split the service queue")
Link: https://lore.kernel.org/all/20250515130830.9671-1-aishwarya.tcv@arm.com/
Reported-by: Aishwarya <aishwarya.tcv@arm.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The xarray can return the previous entry at a location. Use this
fact to simplify the brd code when there is no existing page at
a location. This also slighly improves the handling of racy
discards as we now always have a page under RCU protection by the
time we are ready to copy the data.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250507060700.3929430-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 9bc1e897a821 ("blk-mq: remove unused queue mapping helpers") makes
the two config options, BLK_MQ_PCI and BLK_MQ_VIRTIO, have no remaining
effect.
Remove the two obsolete config options.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20250514065513.463941-1-lukas.bulwahn@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
bvec_try_merge_page currently returns if the added page fragment is
within the same page as the last page in the last current bio_vec.
This information is used by __bio_iov_iter_get_pages so that we always
have a single folio pin per page even when the page is split over
multiple __bio_iov_iter_get_pages calls.
Threading this through the entire lowlevel add page to bio logic is
annoying and inefficient and leads to less code sharing than otherwise
possible. Instead add code to __bio_iov_iter_get_pages that checks if
the bio_vecs did not change and thus a merge into the last segment must
have happened, and if there is an offset into the page for the currently
added fragment, because if yes we must have already had a previous
fragment of the same page in the last bio_vec. While this is still a bit
ugly, it keeps the logic in the one place that needs it and allows for
more code sharing.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250512042354.514329-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
[BUG]
There has an issue of io delayed dispatch caused by io splitting. Consider
the following scenario:
1) If we set a BPS limit of 1MB/s and restrict the maximum IO size per
dispatch to 4KB, submitting -two- 1MB IO requests results in completion
times of 1s and 2s, which is expected.
2) However, if we additionally set an IOPS limit of 1,000,000/s with the
same BPS limit of 1MB/s, submitting -two- 1MB IO requests again results in
both completing in 2s, even though the IOPS constraint is being met.
[CAUSE]
This issue arises because BPS and IOPS currently share the same queue in
the blkthrotl mechanism:
1) This issue does not occur when only BPS is limited because the split IOs
return false in blk_should_throtl() and do not go through to throtl again.
2) For split IOs, even if they have been tagged with BIO_BPS_THROTTLED,
they still get queued alternately in the same list due to continuous
splitting and reordering. As a result, the two IO requests are both
completed at the 2-second mark, causing an unintended delay.
3) It is not difficult to imagine that in this scenario, if N 1MB IOs are
issued at once, all IOs will eventually complete together in N seconds.
[FIX]
With the queue separation introduced in the previous patches, we now have
separate BPS and IOPS queues. For IOs that have already passed the BPS
limitation, they do not need to re-enter the BPS queue and can directly
placed to the IOPS queue.
Since we have split the queues, when the IOPS queue is previously empty
and a new bio is added to the first qnode->bios_iops list in the
service_queue, we also need to update the disptime. This patch introduces
"THROTL_TG_IOPS_WAS_EMPTY" flag to mark it.
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com>
Link: https://lore.kernel.org/r/20250506020935.655574-8-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch splits throtl_service_queue->nr_queued into "nr_queued_bps" and
"nr_queued_iops", allowing separate accounting of BPS and IOPS queued bios.
This prepares for future changes that need to check whether the BPS or IOPS
queues are empty.
To facilitate updating the number of IOs in the BPS and IOPS queues, the
addition logic will be moved from throtl_add_bio_tg() to
throtl_qnode_add_bio(), and similarly, the removal logic will be moved from
tg_dispatch_one_bio() to throtl_pop_queued().
And introduce sq_queued() to calculate the total sum of sq->nr_queued.
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com>
Link: https://lore.kernel.org/r/20250506020935.655574-7-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch splits the single queue into separate bps and iops queues. Now,
an IO request must first pass through the bps queue, then the iops queue,
and finally be dispatched. Due to the queue splitting, we need to modify
the throtl add/peek/pop function.
Additionally, the patch modifies the logic related to tg_dispatch_time().
If bio needs to wait for bps, function directly returns the bps wait time;
otherwise, it charges bps and returns the iops wait time so that bio can be
directly placed into the iops queue afterward. Note that this may lead to
more frequent updates to disptime, but the overhead is negligible for the
slow path.
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com>
Link: https://lore.kernel.org/r/20250506020935.655574-6-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Subsequent patches will split the single queue into separate bps and iops
queues. To prevent IO that has already passed through the bps queue at a
single tg level from being counted toward bps wait time again, we introduce
"BIO_TG_BPS_THROTTLED" flag. Since throttle and QoS operate at different
levels, we reuse the value as "BIO_QOS_THROTTLED".
We set this flag when charge bps and clear it when charge iops, as the bio
will move to the upper-level tg or be dispatched.
This patch does not involve functional changes.
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com>
Link: https://lore.kernel.org/r/20250506020935.655574-5-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Split throtl_charge_bio() to facilitate subsequent patches that will
separately charge bps and iops after queue separation.
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com>
Link: https://lore.kernel.org/r/20250506020935.655574-4-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
tg_dispatch_time() contained both bps and iops throttling logic. We now
split its internal logic into tg_dispatch_bps/iops_time() to improve code
consistency for future separation of the bps and iops queues.
Besides, merge time_before() from caller into throtl_extend_slice() to make
code cleaner.
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huaweicloud.com>
Link: https://lore.kernel.org/r/20250506020935.655574-3-wozizhi@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|