Age | Commit message (Collapse) | Author |
|
Commit 4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin
io-policy") introduced the creation of the multipath sysfs group under
the NVMe head gendisk device node. However, it also inadvertently added
the same sysfs group under each namespace path device which head node
refers to and that is incorrect.
The multipath sysfs group should only be exposed through the namespace
head gendisk node. This is sufficient, as the head device already
provides symbolic links to the individual namespace paths it manages.
This patch fixes the issue by preventing the creation of the multipath
sysfs group under namespace path devices, ensuring it only appears under
the head disk node.
Fixes: 4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin io-policy")
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
In the NVMe context, the term "shutdown" has a specific technical
meaning. To avoid confusion, this commit renames the nvme_mpath_
shutdown_disk function to nvme_mpath_remove_disk to better reflect
its purpose (i.e. removing the disk from the system). However,
nvme_mpath_remove_disk was already in use, and its functionality
is related to releasing or putting the head node disk. To resolve
this naming conflict and improve clarity, the existing nvme_mpath_
remove_disk function is also renamed to nvme_mpath_put_disk.
This renaming improves code readability and better aligns function
names with their actual roles.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Currently, a multipath head disk node is not created for single-
ported NVMe adapters or private namespaces with non-unique NSID.
However, creating a head node in these cases can help transparently
handle transient PCIe link failures. Without a head node, features
like delayed removal cannot be leveraged, making it difficult to
tolerate such link failures. To address this, this commit introduces
nvme_core module parameter multipath_always_on.
When multipath_always_on is set to true, it forces the creation of a
multipath head node regardless NVMe disk or namespace type. So this
option allows the use of delayed removal of head node functionality
even for single-ported NVMe disks and private namespaces with a unique
NSID and thus helps transparently handle transient PCIe link failures.
By default multipath_always_on is set to false, thus preserving the
existing behavior. Setting it to true enables improved fault tolerance
in PCIe setups. Moreover, please note that enabling this option would
also implicitly enable nvme_core.multipath.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Currently, the multipath head node of an NVMe disk is removed
immediately as soon as all paths of the disk are removed. However,
this can cause issues in scenarios where:
- The disk hot-removal followed by re-addition.
- Transient PCIe link failures that trigger re-enumeration,
temporarily removing and then restoring the disk.
In these cases, removing the head node prematurely may lead to a head
disk node name change upon re-addition, requiring applications to
reopen their handles if they were performing I/O during the failure.
To address this, introduce a delayed removal mechanism of head disk
node. During transient failure, instead of immediate removal of head
disk node, the system waits for a configurable timeout, allowing the
disk to recover.
During transient disk failure, if application sends any IO then we
queue it instead of failing such IO immediately. If the disk comes back
online within the timeout, the queued IOs are resubmitted to the disk
ensuring seamless operation. In case disk couldn't recover from the
failure then queued IOs are failed to its completion and application
receives the error.
So this way, if disk comes back online within the configured period,
the head node remains unchanged, ensuring uninterrupted workloads
without requiring applications to reopen device handles.
A new sysfs attribute, named "delayed_removal_secs" is added under head
disk blkdev for user who wish to configure time for the delayed removal
of head disk node. The default value of this attribute is set to zero
second ensuring no behavior change unless explicitly configured.
Link: https://lore.kernel.org/linux-nvme/Y9oGTKCFlOscbPc2@infradead.org/
Link: https://lore.kernel.org/linux-nvme/Y+1aKcQgbskA2tra@kbusch-mbp.dhcp.thefacebook.com/
Suggested-by: Keith Busch <kbusch@kernel.org>
Suggested-by: Christoph Hellwig <hch@infradead.org>
[nilay: reworked based on the original idea/POC from Christoph and Keith]
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Redefine the max segments and max integrity limits based on the limiting
factors. This keeps exactly the same values for 4k PAGE_SIZE systems,
but increases the number of segments for larger page size as it properly
derives the scatterlist allocation based limit for them instead of
assuming a 4k PAGE_SIZE.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
|
|
This avoids open coding the variable size array arithmetics.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
|
|
Open coding magic numbers in multiple places is never a good idea.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
|
|
Add a separate flag to encode that the transfer is using the small
page sized pool, and use a normal 0..n count for the number of
descriptors.
Contains improvements and suggestions from Kanchan Joshi
<joshi.k@samsung.com> and Leon Romanovsky <leon@kernel.org>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
|
|
They are used for both PRPs and SGLs, and we use descriptor elsewhere
when referring to their allocations, so use that name here as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
|
|
There is no real point in having a union of two pointer types here, just
use a void pointer as we mix and match types between the arms of the
union between the allocation and freeing side already.
Also rename the nr_allocations field to nr_descriptors to better describe
what it does.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[leon: ported forward to include metadata SGL support]
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
|
|
Instead of keeping dedicated "bool aborted" variable, switch to a flags
flags that can be used for other flags as well.
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
|
|
No admin command defined in an NVMe specification supports metadata,
but to protect against vendor specific commands using metadata ensure
that we don't try to use SGLs for metadata on the admin queue, as NVMe
does not support SGLs on the admin queue for the PCI transport. Do
this by checking if the data transfer has been setup using SGLs as
that is required for using SGLs for metadata.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
|
|
NVMe commands with over 8 KB of discontiguous data allocate PRP list
pages from the per-nvme_device dma_pool prp_page_pool or prp_small_pool.
Each call to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool
spinlock. These device-global spinlocks are a significant source of
contention when many CPUs are submitting to the same NVMe devices. On a
workload issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2
NUMA nodes to 23 NVMe devices, we observed 2.4% of CPU time spent in
_raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free.
Ideally, the dma_pools would be per-hctx to minimize contention. But
that could impose considerable resource costs in a system with many NVMe
devices and CPUs.
As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each
nvme_queue to the set of DMA pools corresponding to its device and its
hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by
about half, to 1.2%. Preventing the sharing of PRP list pages across
NUMA nodes also makes them cheaper to initialize.
Link: https://lore.kernel.org/linux-nvme/CADUfDZqa=OOTtTTznXRDmBQo1WrFcDw1hBA7XwM7hzJ-hpckcA@mail.gmail.com/T/#u
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
nvme_init_hctx() and nvme_admin_init_hctx() are very similar. In
preparation for adding more logic, factor out a nvme_init_hctx-common()
helper.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The lsrsp object is maintained by the LLDD. The lifetime of the lsrsp
object is implicit. Because there is no explicit cleanup/free call into
the LLDD, it is not safe to assume after xml_rsp_fails, that the lsrsp
is still valid. The LLDD could have freed the object already.
With the recent changes how fcloop tracks the resources, this is the
case. Thus don't access lsrsp after xml_rsp_fails.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
When sending 'connect' the queues can figure out from the return code
whether authentication is required or not. But reauthentication doesn't
disconnect the queues, so this check is not available. Rather we need
to check whether the queue had been authenticated initially to figure
out if we need to reauthenticate.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
When handling an R2T PDU we short-circuit nvme_tcp_queue_request()
as we should not attempt to send consecutive PDUs. So open-code
nvme_tcp_queue_request() for R2T and drop the last argument.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
|
|
When checking for secure concatenation we have already validated
that 'ctrl->opts' is set, so we can remove this check.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Pull block fixes from Jens Axboe:
- NVMe pull request via Christoph:
- fixes for atomic writes (Alan Adamson)
- fixes for polled CQs in nvmet-epf (Damien Le Moal)
- fix for polled CQs in nvme-pci (Keith Busch)
- fix compile on odd configs that need to be forced to inline
(Kees Cook)
- one more quirk (Ilya Guterman)
- Fix for missing allocation of an integrity buffer for some cases
- Fix for a regression with ublk command cancelation
* tag 'block-6.15-20250515' of git://git.kernel.dk/linux:
ublk: fix dead loop when canceling io command
nvme-pci: add NVME_QUIRK_NO_DEEPEST_PS quirk for SOLIDIGM P44 Pro
nvme: all namespaces in a subsystem must adhere to a common atomic write size
nvme: multipath: enable BLK_FEAT_ATOMIC_WRITES for multipathing
nvmet: pci-epf: remove NVMET_PCI_EPF_Q_IS_SQ
nvmet: pci-epf: improve debug message
nvmet: pci-epf: cleanup nvmet_pci_epf_raise_irq()
nvmet: pci-epf: do not fall back to using INTX if not supported
nvmet: pci-epf: clear completion queue IRQ flag on delete
nvme-pci: acquire cq_poll_lock in nvme_poll_irqdisable
nvme-pci: make nvme_pci_npages_prp() __always_inline
block: always allocate integrity buffer when required
|
|
This commit adds the NVME_QUIRK_NO_DEEPEST_PS quirk for device
[126f:2262], which belongs to device SOLIDIGM P44 Pro SSDPFKKW020X7
The device frequently have trouble exiting the deepest power state (5),
resulting in the entire disk being unresponsive.
Verified by setting nvme_core.default_ps_max_latency_us=10000 and
observing the expected behavior.
Signed-off-by: Ilya Guterman <amfernusus@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The first namespace configured in a subsystem sets the subsystem's
atomic write size based on its AWUPF or NAWUPF. Subsequent namespaces
must have an atomic write size (per their AWUPF or NAWUPF) less than or
equal to the subsystem's atomic write size, or their probing will be
rejected.
Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
[hch: fold in review comments from John Garry]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
|
|
A change to QEMU resulted in all nvme controllers (single and
multi-controller subsystems) to have its CMIC.MCTRS bit set which
indicates the subsystem supports multiple controllers and it is possible
a namespace can be shared between those multiple controllers in a
multipath configuration.
When a namespace of a CMIC.MCTRS enabled subsystem is allocated, a
multipath node is created. The queue limits for this node are inherited
from the namespace being allocated. When inheriting queue limits, the
features being inherited need to be specified. The atomic write feature
(BLK_FEAT_ATOMIC_WRITES) was not specified so the atomic queue limits
were not inherited by the multipath disk node which resulted in the sysfs
atomic write attributes being zeroed. The fix is to include
BLK_FEAT_ATOMIC_WRITES in the list of features to be inherited.
Signed-off-by: Alan Adamson <alan.adamson@oracle.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
We need to lock this queue for that condition because the timeout work
executes per-namespace and can poll the poll CQ.
Reported-by: Hannes Reinecke <hare@kernel.org>
Closes: https://lore.kernel.org/all/20240902130728.1999-1-hare@kernel.org/
Fixes: a0fa9647a54e ("NVMe: add blk polling support")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The only reason nvme_pci_npages_prp() could be used as a compile-time
known result in BUILD_BUG_ON() is because the compiler was always choosing
to inline the function. Under special circumstances (sanitizer coverage
functions disabled for __init functions on ARCH=um), the compiler decided
to stop inlining it:
drivers/nvme/host/pci.c: In function 'nvme_init':
include/linux/compiler_types.h:557:45: error: call to '__compiletime_assert_678' declared with attribute error: BUILD_BUG_ON failed: nvme_pci_npages_prp() > NVME_MAX_NR_ALLOCATIONS
557 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^
include/linux/compiler_types.h:538:25: note: in definition of macro '__compiletime_assert'
538 | prefix ## suffix(); \
| ^~~~~~
include/linux/compiler_types.h:557:9: note: in expansion of macro '_compiletime_assert'
557 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
| ^~~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
| ^~~~~~~~~~~~~~~~~~
include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
| ^~~~~~~~~~~~~~~~
drivers/nvme/host/pci.c:3804:9: note: in expansion of macro 'BUILD_BUG_ON'
3804 | BUILD_BUG_ON(nvme_pci_npages_prp() > NVME_MAX_NR_ALLOCATIONS);
| ^~~~~~~~~~~~
Force it to be __always_inline to make sure it is always available for
use with BUILD_BUG_ON().
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202505061846.12FMyRjj-lkp@intel.com/
Fixes: c372cdd1efdf ("nvme-pci: iod npages fits in s8")
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Pull block fixes from Jens Axboe:
- Fix for a regression in this series for loop and read/write iterator
handling
- zone append block update tweak
- remove a broken IO priority test
- NVMe pull request via Christoph:
- unblock ctrl state transition for firmware update (Daniel
Wagner)
* tag 'block-6.15-20250509' of git://git.kernel.dk/linux:
block: remove test of incorrect io priority level
nvme: unblock ctrl state transition for firmware update
block: only update request sector if needed
loop: Add sanity check for read/write_iter
|
|
Remove the q argument from blk_rq_map_kern and the internal helpers
called by it as the queue can trivially be derived from the request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The original nvme subsystem design didn't have a CONNECTING state; the
state machine allowed transitions from RESETTING to LIVE directly.
With the introduction of nvme fabrics the CONNECTING state was
introduce. Over time the nvme-pci started to use the CONNECTING state as
well.
Eventually, a bug fix for the nvme-fc started to depend that the only
valid transition to LIVE was from CONNECTING. Though this change didn't
update the firmware update handler which was still depending on
RESETTING to LIVE transition.
The simplest way to address it for the time being is to switch into
CONNECTING state before going to LIVE state.
Fixes: d2fe192348f9 ("nvme: only allow entering LIVE from CONNECTING state")
Reported-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Closes: https://lore.kernel.org/all/0134ea15-8d5f-41f7-9e9a-d7e6d82accaa@roeck-us.net
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
|
|
The plid array, head->plids, is meant to store placement IDs, each of
type u16. But its size has been incorrectly calculated, as the size of
the pointer is being used instead of the size of the object it points
to.
Use the sizeof(*head->plids) in kcalloc so that we don't allocate extra.
Fixes: 38e8397dde63 ("nvme: use fdp streams if write stream is provided")
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
write_stream_granularity is set to max(info->runs, U32_MAX), which means
that any RUNS value less than 2 ** 32 becomes U32_MAX, and any larger
value is silently truncated to an unsigned int.
Use min() instead to provide the correct semantics, capping RUNS values
at U32_MAX.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 30b5f20bb2dd ("nvme: register fdp parameters with the block layer")
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250506175413.1936110-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Maps a user requested write stream to an FDP placement ID if possible.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250506121732.8211-12-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Register the device data placement limits if supported. This is just
registering the limits with the block layer. Nothing beyond reporting
these attributes is happening in this patch.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250506121732.8211-11-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
That allows passing in structures instead of the u32 result, and thus
reduce the amount of bit shifting and masking required to parse the
result.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250506121732.8211-9-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
For log pages that need to pass in a LSI value, while at the same time
not touching all the existing nvme_get_log callers.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250506121732.8211-8-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Merge 6.15 block fixes in, once again, to resolve conflicts with the
fixes for ublk that went into mainline and the 6.16 ublk updates.
* block-6.15:
nvmet-auth: always free derived key data
nvmet-tcp: don't restore null sk_state_change
nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS
nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS
nvme-tcp: fix premature queue removal and I/O failover
nvme-pci: add quirks for WDC Blue SN550 15b7:5009
nvme-pci: add quirks for device 126f:1001
nvme-pci: fix queue unquiesce check on slot_reset
ublk: remove the check of ublk_need_req_ref() from __ublk_check_and_get_req
ublk: enhance check for register/unregister io buffer command
ublk: decouple zero copy from user copy
selftests: ublk: fix UBLK_F_NEED_GET_DATA
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull block fixes from Jens Axboe:
- NVMe pull request via Christoph:
- fix queue unquiesce check on PCI slot_reset (Keith Busch)
- fix premature queue removal and I/O failover in nvme-tcp (Michael
Liang)
- don't restore null sk_state_change (Alistair Francis)
- select CONFIG_TLS where needed (Alistair Francis)
- always free derived key data (Hannes Reinecke)
- more quirks (Wentao Guan)
- ublk zero copy fix
- ublk selftest fix for UBLK_F_NEED_GET_DATA
* tag 'block-6.15-20250502' of git://git.kernel.dk/linux:
nvmet-auth: always free derived key data
nvmet-tcp: don't restore null sk_state_change
nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS
nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS
nvme-tcp: fix premature queue removal and I/O failover
nvme-pci: add quirks for WDC Blue SN550 15b7:5009
nvme-pci: add quirks for device 126f:1001
nvme-pci: fix queue unquiesce check on slot_reset
ublk: remove the check of ublk_need_req_ref() from __ublk_check_and_get_req
ublk: enhance check for register/unregister io buffer command
ublk: decouple zero copy from user copy
selftests: ublk: fix UBLK_F_NEED_GET_DATA
|
|
Ensure that TLS support is enabled in the kernel when
CONFIG_NVME_TCP_TLS is enabled. Without this the code compiles, but does
not actually work unless something else enables CONFIG_TLS.
Fixes: be8e82caa68 ("nvme-tcp: enable TLS handshake upcall")
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
This patch addresses a data corruption issue observed in nvme-tcp during
testing.
In an NVMe native multipath setup, when an I/O timeout occurs, all
inflight I/Os are canceled almost immediately after the kernel socket is
shut down. These canceled I/Os are reported as host path errors,
triggering a failover that succeeds on a different path.
However, at this point, the original I/O may still be outstanding in the
host's network transmission path (e.g., the NIC’s TX queue). From the
user-space app's perspective, the buffer associated with the I/O is
considered completed since they're acked on the different path and may
be reused for new I/O requests.
Because nvme-tcp enables zero-copy by default in the transmission path,
this can lead to corrupted data being sent to the original target,
ultimately causing data corruption.
We can reproduce this data corruption by injecting delay on one path and
triggering i/o timeout.
To prevent this issue, this change ensures that all inflight
transmissions are fully completed from host's perspective before
returning from queue stop. To handle concurrent I/O timeout from multiple
namespaces under the same controller, always wait in queue stop
regardless of queue's state.
This aligns with the behavior of queue stopping in other NVMe fabric
transports.
Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Add two quirks for the WDC Blue SN550 (PCI ID 15b7:5009) based on user
reports and hardware analysis:
- NVME_QUIRK_NO_DEEPEST_PS:
liaozw talked to me the problem and solved with
nvme_core.default_ps_max_latency_us=0, so add the quirk.
I also found some reports in the following link.
- NVME_QUIRK_BROKEN_MSI:
after get the lspci from Jack Rio.
I think that the disk also have NVME_QUIRK_BROKEN_MSI.
described in commit d5887dc6b6c0 ("nvme-pci: Add quirk for broken MSIs")
as sean said in link which match the MSI 1/32 and MSI-X 17.
Log:
lspci -nn | grep -i memory
03:00.0 Non-Volatile memory controller [0108]: Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less) [15b7:5009] (rev 01)
lspci -v -d 15b7:5009
03:00.0 Non-Volatile memory controller: Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less) (rev 01) (prog-if 02 [NVM Express])
Subsystem: Sandisk Corp WD Blue SN550 NVMe SSD
Flags: bus master, fast devsel, latency 0, IRQ 35, IOMMU group 10
Memory at fe800000 (64-bit, non-prefetchable) [size=16K]
Memory at fe804000 (64-bit, non-prefetchable) [size=256]
Capabilities: [80] Power Management version 3
Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
Capabilities: [c0] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [150] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [1b8] Latency Tolerance Reporting
Capabilities: [300] Secondary PCI Express
Capabilities: [900] L1 PM Substates
Kernel driver in use: nvme
dmesg | grep nvme
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.12.20-amd64-desktop-rolling root=UUID= ro splash quiet nvme_core.default_ps_max_latency_us=0 DEEPIN_GFXMODE=
[ 0.059301] Kernel command line: BOOT_IMAGE=/vmlinuz-6.12.20-amd64-desktop-rolling root=UUID= ro splash quiet nvme_core.default_ps_max_latency_us=0 DEEPIN_GFXMODE=
[ 0.542430] nvme nvme0: pci function 0000:03:00.0
[ 0.560426] nvme nvme0: allocated 32 MiB host memory buffer.
[ 0.562491] nvme nvme0: 16/0/0 default/read/poll queues
[ 0.567764] nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8 p9
[ 6.388726] EXT4-fs (nvme0n1p7): mounted filesystem ro with ordered data mode. Quota mode: none.
[ 6.893421] EXT4-fs (nvme0n1p7): re-mounted r/w. Quota mode: none.
[ 7.125419] Adding 16777212k swap on /dev/nvme0n1p8. Priority:-2 extents:1 across:16777212k SS
[ 7.157588] EXT4-fs (nvme0n1p6): mounted filesystem r/w with ordered data mode. Quota mode: none.
[ 7.165021] EXT4-fs (nvme0n1p9): mounted filesystem r/w with ordered data mode. Quota mode: none.
[ 8.036932] nvme nvme0: using unchecked data buffer
[ 8.096023] block nvme0n1: No UUID available providing old NGUID
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5887dc6b6c054d0da3cd053afc15b7be1f45ff6
Link: https://lore.kernel.org/all/20240422162822.3539156-1-sean.anderson@linux.dev/
Reported-by: liaozw <hedgehog-002@163.com>
Closes: https://bbs.deepin.org.cn/post/286300
Reported-by: rugk <rugk+github@posteo.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=208123
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
This commit adds NVME_QUIRK_NO_DEEPEST_PS and NVME_QUIRK_BOGUS_NID for
device [126f:1001].
It is similar to commit e89086c43f05 ("drivers/nvme: Add quirks for
device 126f:2262")
Diff is according the dmesg, use NVME_QUIRK_IGNORE_DEV_SUBNQN.
dmesg | grep -i nvme0:
nvme nvme0: pci function 0000:01:00.0
nvme nvme0: missing or invalid SUBNQN field.
nvme nvme0: 12/0/0 default/read/poll queues
Link:https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e89086c43f0500bc7c4ce225495b73b8ce234c1f
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
A zero return means the reset was successfully scheduled. We don't want
to unquiesce the queues while the reset_work is pending, as that will
just flush out requeued requests to a failed completion.
Fixes: 71a5bb153be104 ("nvme: ensure disabling pairs with unquiesce")
Reported-by: Dhankaran Singh Ajravat <dhankaran@meta.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Pull block fixes from Jens Axboe:
- MD pull via Yu:
- fix raid10 missing discard IO accounting (Yu Kuai)
- fix bitmap stats for bitmap file (Zheng Qixing)
- fix oops while reading all member disks failed during
check/repair (Meir Elisha)
- NVMe pull via Christoph:
- fix scan failure for non-ANA multipath controllers (Hannes
Reinecke)
- fix multipath sysfs links creation for some cases (Hannes
Reinecke)
- PCIe endpoint fixes (Damien Le Moal)
- use NULL instead of 0 in the auth code (Damien Le Moal)
- Various ublk fixes:
- Slew of selftest additions
- Improvements and fixes for IO cancelation
- Tweak to Kconfig verbiage
- Fix for page dirtying for blk integrity mapped pages
- loop fixes:
- buffered IO fix
- uevent fixes
- request priority inheritance fix
- Various little fixes
* tag 'block-6.15-20250417' of git://git.kernel.dk/linux: (38 commits)
selftests: ublk: add generic_06 for covering fault inject
ublk: simplify aborting ublk request
ublk: remove __ublk_quiesce_dev()
ublk: improve detection and handling of ublk server exit
ublk: move device reset into ublk_ch_release()
ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
ublk: add ublk_force_abort_dev()
ublk: properly serialize all FETCH_REQs
selftests: ublk: move creating UBLK_TMP into _prep_test()
selftests: ublk: add test_stress_05.sh
selftests: ublk: support user recovery
selftests: ublk: support target specific command line
selftests: ublk: increase max nr_queues and queue depth
selftests: ublk: set queue pthread's cpu affinity
selftests: ublk: setup ring with IORING_SETUP_SINGLE_ISSUER/IORING_SETUP_DEFER_TASKRUN
selftests: ublk: add two stress tests for zero copy feature
selftests: ublk: run stress tests in parallel
selftests: ublk: make sure _add_ublk_dev can return in sub-shell
selftests: ublk: cleanup backfile automatically
selftests: ublk: add io_uring uapi header
...
|
|
When rapidly rescanning for new namespaces nvme_mpath_add_sysfs_link() may be
called for a block device not added to sysfs. But NVME_NS_SYSFS_ATTR_LINK
had already been set, so when checking this device a second time we will fail
to create the link.
Fix this by exchanging the order of the block device check and the
NVME_NS_SYSFS_ATTR_LINK bit check.
Fixes: 4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin io-policy")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>**
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Commit 62baf70c3274 caused the ANA log page to be re-read, even on
controllers that do not support ANA. While this should generally
harmless, some controllers hang on the unsupported log page and
never finish probing.
Fixes: 62baf70c3274 ("nvme: re-read ANA log page after ns scan completes")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Tested-by: Srikanth Aithal <sraithal@amd.com>
[hch: more detailed commit message]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
|
|
Pull more block fixes from Jens Axboe:
"Apparently my internal clock was off, or perhaps it was just wishful
thinking, but I sent out block fixes yesterday as my brain assumed it
was Friday. Subsequently, that missed the NVMe fixes that should go
into this weeks release as well. Hence, here's a followup with those,
and another simple fix.
- NVMe pull request via Christoph:
- nvmet fc/fcloop refcounting fixes (Daniel Wagner)
- fix missed namespace/ANA scans (Hannes Reinecke)
- fix a use after free in the new TCP netns support (Kuniyuki
Iwashima)
- fix a NULL instead of false review in multipath (Uday Shankar)
- Use strscpy() for null_blk disk name copy"
* tag 'block-6.15-20250411' of git://git.kernel.dk/linux:
null_blk: Use strscpy() instead of strscpy_pad() in null_add_dev()
nvmet-fc: put ref when assoc->del_work is already scheduled
nvmet-fc: take tgtport reference only once
nvmet-fc: update tgtport ref per assoc
nvmet-fc: inline nvmet_fc_free_hostport
nvmet-fc: inline nvmet_fc_delete_assoc
nvmet-fcloop: add ref counting to lport
nvmet-fcloop: replace kref with refcount
nvmet-fcloop: swap list_add_tail arguments
nvme-tcp: fix use-after-free of netns by kernel TCP socket.
nvme: multipath: fix return value of nvme_available_path
nvme: re-read ANA log page after ns scan completes
nvme: requeue namespace scan on missed AENs
|
|
Commit 1be52169c348 ("nvme-tcp: fix selinux denied when calling
sock_sendmsg") converted sock_create() in nvme_tcp_alloc_queue()
to sock_create_kern().
sock_create_kern() creates a kernel socket, which does not hold
a reference to netns. If the code does not manage the netns
lifetime properly, use-after-free could happen.
Also, TCP kernel socket with sk_net_refcnt 0 has a socket leak
problem: it remains FIN_WAIT_1 if it misses FIN after close()
because tcp_close() stops all timers.
To fix such problems, let's hold netns ref by sk_net_refcnt_upgrade().
We had the same issue in CIFS, SMC, etc, and applied the same
solution, see commit ef7134c7fc48 ("smb: client: Fix use-after-free
of network namespace.") and commit 9744d2bf1976 ("smc: Fix
use-after-free in tcp_write_timer_handler().").
Fixes: 1be52169c348 ("nvme-tcp: fix selinux denied when calling sock_sendmsg")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
The function returns bool so we should return false, not NULL. No
functional changes are expected.
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
When scanning for new namespaces we might have missed an ANA AEN.
The NVMe base spec (NVMe Base Specification v2.1, Figure 151 'Asynchonous
Event Information - Notice': Asymmetric Namespace Access Change) states:
A controller shall not send this even if an Attached Namespace
Attribute Changed asynchronous event [...] is sent for the same event.
so we need to re-read the ANA log page after we rescanned the namespace
list to update the ANA states of the new namespaces.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
Scanning for namespaces can take some time, so if the target is
reconfigured while the scan is running we may miss a Attached Namespace
Attribute Changed AEN.
Check if the NVME_AER_NOTICE_NS_CHANGED bit is set once the scan has
finished, and requeue scanning to pick up any missed change.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
|
|
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Pull more block updates from Jens Axboe:
- NVMe pull request via Keith:
- PCI endpoint target cleanup (Damien)
- Early import for uring_cmd fixed buffer (Caleb)
- Multipath documentation and notification improvements (John)
- Invalid pci sq doorbell write fix (Maurizio)
- Queue init locking fix
- Remove dead nsegs parameter from blk_mq_get_new_requests()
* tag 'block-6.15-20250403' of git://git.kernel.dk/linux:
block: don't grab elevator lock during queue initialization
nvme-pci: skip nvme_write_sq_db on empty rqlist
nvme-multipath: change the NVME_MULTIPATH config option
nvme: update the multipath warning in nvme_init_ns_head
nvme/ioctl: move fixed buffer lookup to nvme_uring_cmd_io()
nvme/ioctl: move blk_mq_free_request() out of nvme_map_user_request()
nvme/ioctl: don't warn on vectorized uring_cmd with fixed buffer
nvmet: pci-epf: Keep completion queues mapped
block: remove unused nseg parameter
|