Age | Commit message (Collapse) | Author |
|
Pull NVMe fixes from Keith:
"nvme fixes for Linux 6.14
- Concurrent pci error and hotplug handling fix (Keith)
- Endpoint function fixes (Damien)"
* tag 'nvme-6.14-2025-03-13' of git://git.infradead.org/nvme:
nvmet: pci-epf: Do not add an IRQ vector if not needed
nvmet: pci-epf: Set NVMET_PCI_EPF_Q_LIVE when a queue is fully created
nvme-pci: fix stuck reset on concurrent DPC and HP
|
|
Commit 1f47ed294a2b ("block: cleanup and fix batch completion adding
conditions") modified the evaluation criteria for the third argument,
'ioerror', in the blk_mq_add_to_batch() function. Initially, the
function had checked if 'ioerror' equals zero. Following the commit, it
started checking for negative error values, with the presumption that
such values, for instance -EIO, would be passed in.
However, blk_mq_add_to_batch() callers do not pass negative error
values. Instead, they pass status codes defined in various ways:
- NVMe PCI and Apple drivers pass NVMe status code
- virtio_blk driver passes the virtblk request header status byte
- null_blk driver passes blk_status_t
These codes are either zero or positive, therefore the revised check
fails to function as intended. Specifically, with the NVMe PCI driver,
this modification led to the failure of the blktests test case nvme/039.
In this test scenario, errors are artificially injected to the NVMe
driver, resulting in positive NVMe status codes passed to
blk_mq_add_to_batch(), which unexpectedly processes the failed I/O in a
batch. Hence the failure.
To correct the ioerror check within blk_mq_add_to_batch(), make all
callers to uniformly pass the argument as boolean. Modify the callers to
check their specific status codes and pass the boolean value 'is_error'.
Also describe the arguments of blK_mq_add_to_batch as kerneldoc.
Fixes: 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/20250311104359.1767728-3-shinichiro.kawasaki@wdc.com
[axboe: fold in documentation update]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The PCIe error handling has the nvme driver quiesce the device, attempt
to restart it, then wait for that restart to complete.
A PCIe DPC event also toggles the PCIe link. If the slot doesn't have
out-of-band presence detection, this will trigger a pciehp
re-enumeration.
The error handling that calls nvme_error_resume is holding the device
lock while this happens. This lock blocks pciehp's request to disconnect
the driver from proceeding.
Meanwhile the nvme's reset can't make forward progress because its
device isn't there anymore with outstanding IO, and the timeout handler
won't do anything to fix it because the device is undergoing error
handling.
End result: deadlocked.
Fix this by having the timeout handler short cut the disabling for a
disconnected PCIe device. The downside is that we're relying on an IO
timeout to clean up this mess, which could be a minute by default.
Tested-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The PCI P2PDMA code will register the CMB block to the memory
hot-plugging subsystem, which have an alignment requirement. Memory
blocks that do not satisfy this alignment requirement (usually 2MB) will
lead to a WARNING from memory hotplugging.
Verify the CMB block's address and size against the alignment and only
try to send CMB blocks compatible with it to prevent this warning.
Tested on Intel DC D4502 SSD, which has a 512K CMB block that is too
small for memory hotplugging (thus PCI P2PDMA).
Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
CMB decoding should get disabled when the CMB block isn't successfully
registered to P2P DMA subsystem.
Clean up the CMBMSC register in this error handling codepath to disable
CMB decoding (and CMBLOC/CMBSZ registers).
Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
In order for two Acer FA100 SSDs to work in one PC (in the case of
myself, a Lenovo Legion T5 28IMB05), and not show one drive and not
the other, and sometimes mix up what drive shows up (randomly), these
two lines of code need to be added, and then both of the SSDs will
show up and not conflict when booting off of one of them. If you boot
up your computer with both SSDs installed without this patch, you may
also randomly get into a kernel panic (if the initrd is not set up) or
stuck in the initrd "/init" process, it is set up, however, if you do
apply this patch, there should not be problems with booting or seeing
both contents of the drive. Tested with the btrfs filesystem with a
RAID configuration of having the root drive '/' combined to make two
256GB Acer FA100 SSDs become 512GB in total storage.
Kernel Logs with patch applied (`dmesg -t | grep -i nvm`):
```
...
nvme 0000:04:00.0: platform quirk: setting simple suspend
nvme nvme0: pci function 0000:04:00.0
nvme 0000:05:00.0: platform quirk: setting simple suspend
nvme nvme1: pci function 0000:05:00.0
nvme nvme1: missing or invalid SUBNQN field.
nvme nvme1: allocated 64 MiB host memory buffer.
nvme nvme0: missing or invalid SUBNQN field.
nvme nvme0: allocated 64 MiB host memory buffer.
nvme nvme1: 8/0/0 default/read/poll queues
nvme nvme1: Ignoring bogus Namespace Identifiers
nvme nvme0: 8/0/0 default/read/poll queues
nvme nvme0: Ignoring bogus Namespace Identifiers
nvme0n1: p1 p2
...
```
Kernel Logs with patch not applied (`dmesg -t | grep -i nvm`):
```
...
nvme 0000:04:00.0: platform quirk: setting simple suspend
nvme nvme0: pci function 0000:04:00.0
nvme 0000:05:00.0: platform quirk: setting simple suspend
nvme nvme1: pci function 0000:05:00.0
nvme nvme0: missing or invalid SUBNQN field.
nvme nvme1: missing or invalid SUBNQN field.
nvme nvme0: allocated 64 MiB host memory buffer.
nvme nvme1: allocated 64 MiB host memory buffer.
nvme nvme0: 8/0/0 default/read/poll queues
nvme nvme1: 8/0/0 default/read/poll queues
nvme nvme1: globally duplicate IDs for nsid 1
nvme nvme1: VID:DID 1dbe:5216 model:Acer SSD FA100 256GB firmware:1.Z.J.2X
nvme0n1: p1 p2
...
```
Signed-off-by: Christopher Lentocha <christopherericlentocha@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Pull NVMe fixes from Keith:
"nvme fixes for Linux 6.14
- Connection fixes for fibre channel transport (Daniel)
- Endian fixes (Keith, Christoph)
- Cleanup fix for host memory buffer (Francis)
- Platform specific power quirks (Georg)
- Target memory leak (Sagi)
- Use appropriate controller state accessor (Daniel)"
* tag 'nvme-6.14-2025-01-31' of git://git.infradead.org/nvme:
nvme-fc: use ctrl state getter
nvme: make nvme_tls_attrs_group static
nvmet: add a missing endianess conversion in nvmet_execute_admin_connect
nvmet: the result field in nvmet_alloc_ctrl_args is little endian
nvmet: fix a memory leak in controller identify
nvme-fc: do not ignore connectivity loss during connecting
nvme: handle connectivity loss in nvme_set_queue_count
nvme-fc: go straight to connecting state when initializing
nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk
nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk
nvme-pci: remove redundant dma frees in hmb
nvmet: fix rw control endian access
|
|
Pull block updates from Jens Axboe:
- NVMe pull requests via Keith:
- Target support for PCI-Endpoint transport (Damien)
- TCP IO queue spreading fixes (Sagi, Chaitanya)
- Target handling for "limited retry" flags (Guixen)
- Poll type fix (Yongsoo)
- Xarray storage error handling (Keisuke)
- Host memory buffer free size fix on error (Francis)
- MD pull requests via Song:
- Reintroduce md-linear (Yu Kuai)
- md-bitmap refactor and fix (Yu Kuai)
- Replace kmap_atomic with kmap_local_page (David Reaver)
- Quite a few queue freeze and debugfs deadlock fixes
Ming introduced lockdep support for this in the 6.13 kernel, and it
has (unsurprisingly) uncovered quite a few issues
- Use const attributes for IO schedulers
- Remove bio ioprio wrappers
- Fixes for stacked device atomic write support
- Refactor queue affinity helpers, in preparation for better supporting
isolated CPUs
- Cleanups of loop O_DIRECT handling
- Cleanup of BLK_MQ_F_* flags
- Add rotational support for null_blk
- Various fixes and cleanups
* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
block: Don't trim an atomic write
block: Add common atomic writes enable flag
md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
block: limit disk max sectors to (LLONG_MAX >> 9)
block: Change blk_stack_atomic_writes_limits() unit_min check
block: Ensure start sector is aligned for stacking atomic writes
blk-mq: Move more error handling into blk_mq_submit_bio()
block: Reorder the request allocation code in blk_mq_submit_bio()
nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
md/md-bitmap: move bitmap_{start, end}write to md upper layer
md/raid5: implement pers->bitmap_sector()
md: add a new callback pers->bitmap_sector()
md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
md: Replace deprecated kmap_atomic() with kmap_local_page()
md: reintroduce md-linear
partitions: ldm: remove the initial kernel-doc notation
blk-cgroup: rwstat: fix kernel-doc warnings in header file
blk-cgroup: fix kernel-doc warnings in header file
nbd: fix partial sending
...
|
|
On the TUXEDO InfinityBook Pro Gen9 Intel, a Samsung 990 Evo NVMe leads to
a high power consumption in s2idle sleep (4 watts).
This patch applies 'Force No Simple Suspend' quirk to achieve a sleep with
a lower power consumption, typically around 1.2 watts.
Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com>
Cc: stable@vger.kernel.org
Signed-off-by: Werner Sembach <wse@tuxedocomputers.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
On the TUXEDO InfinityFlex, a Samsung 990 Evo NVMe leads to a high power
consumption in s2idle sleep (4 watts).
This patch applies 'Force No Simple Suspend' quirk to achieve a sleep with
a lower power consumption, typically around 1.4 watts.
Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com>
Cc: stable@vger.kernel.org
Signed-off-by: Werner Sembach <wse@tuxedocomputers.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The value of size is 0 when there is no dma buffer allocated. The value
of i also remains 0. So, no need to free the dma buffer in out_free_bufs.
Hence, remove the redundant dma frees.
Signed-off-by: Francis Pravin <francis.p@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
dev->host_mem_size value is updated only after the successful buffer
allocation of hmb descriptor. Otherwise, it may have some undefined value.
So, use the correct size to free the hmb buffer when the hmb descriptor
buffer allocation failed.
Signed-off-by: Francis Pravin <francis.p@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
envent -> event.
Signed-off-by: Baruch Siach <baruch@tkos.co.il>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The nvme_poll_cq() function currently returns the number of CQEs
found, However, only one caller, nvme_poll(), requires a boolean
value to check whether any CQE was completed. The other callers do
not use the return value at all.
To better reflect its usage, update the return type of nvme_poll_cq()
from int to bool.
Signed-off-by: Yongsoo Joo <ysjoo@kookmin.ac.kr>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Replace all users of blk_mq_pci_map_queues with the more generic
blk_mq_map_hw_queues. This in preparation to retire
blk_mq_pci_map_queues.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Link: https://lore.kernel.org/r/20241202-refactor-blk-affinity-helpers-v6-6-27211e9c2cd5@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We initially introduced a quick fix limiting the queue depth to 1 as
experimentation showed that it fixed data corruption on 64GB steamdecks.
Further experimentation revealed corruption only happens when the last
PRP data element aligns to the end of the page boundary. The device
appears to treat this as a PRP chain to a new list instead of the data
element that it actually is. This implementation is in violation of the
spec. Encountering this errata with the Linux driver requires the host
request a 128k transfer and coincidently be handed the last small pool
dma buffer within a page.
The QD1 quirk effectly works around this because the last data PRP
always was at a 248 byte offset from the page start, so it never
appeared at the end of the page, but comes at the expense of throttling
IO and wasting the remainder of the PRP page beyond 256 bytes. Also to
note, the MDTS on these devices is small enough that the "large" prp
pool can hold enough PRP elements to never reach the end, so that pool
is not a problem either.
Introduce a new quirk to ensure the small pool is always aligned such
that the last PRP element can't appear a the end of the page. This comes
at the expense of wasting 256 bytes per small pool page allocated.
Link: https://lore.kernel.org/linux-nvme/20241113043151.GA20077@lst.de/T/#u
Fixes: 83bdfcbdbe5d ("nvme-pci: qdepth 1 quirk")
Cc: Paweł Anikiel <panikiel@google.com>
Signed-off-by: Robert Beckett <bob.beckett@collabora.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Only call into nvme_alloc_host_mem_single which uses
dma_alloc_noncontiguous when there is non-null dma merge boundary.
Without this we'll call into dma_alloc_noncontiguous for device using
dma-direct, which can work fine as long as the preferred size is below the
MAX_ORDER of the page allocator, but blows up with a warning if it is
too large.
Fixes: 63a5c7a4b4c4 ("nvme-pci: use dma_alloc_noncontigous if possible")
Reported-by: Leon Romanovsky <leon@kernel.org>
Reported-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Leon Romanovsky <leon@kernel.org>
Tested-by: Chaitanya Kumar Borah <chaitanya.kumar.borah@intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The quirk was initially used as a signal to set the discard_zeroes_data
queue limit because there were some use cases that relied on that
behavior. The queue limit no longer exists as every user of it has been
converted to use the write zeroes operation instead.
The quirk now means to use a discard command as an alias to a write
zeroes request. Two of the devices previously using the quirk support
the write zeroes command directly, so these don't need or want to use
discard when the desired operation is to write zeroes.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
If the device supports SGLs, use these for all user requests. This
format encodes the expected transfer length so it can catch short buffer
errors in a user command, whether it occurred accidently or maliciously.
For controllers that support SGL data mode, this is a viable mitigation
to CVE-2023-6238. For controllers that don't support SGLs, log a warning
in the passthrough path since not having the capability can corrupt
data if the interface is not used correctly.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Supporting this mode allows creating and merging multi-segment metadata
requests that wouldn't be possible otherwise. It also allows directly
using user space requests that straddle physically discontiguous pages.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add requests to the tail of the list instead of the front so that they
are queued up in submission order.
Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs
and nvme_queue_rqs now that the list is ordered as expected.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Replace the semi-open coded request list helpers with a proper rq_list
type that mirrors the bio_list and has head and tail pointers. Besides
better type safety this actually allows to insert at the tail of the
list, which will be useful soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
blk_mq_flush_plug_list submits requests in the reverse order that they
were submitted, which leads to a rather suboptimal I/O pattern especially
in rotational devices. Fix this by rewriting nvme_queue_rqs so that it
always pops the requests from the passed in request list, and then adds
them to the head of a local submit list. This actually simplifies the
code a bit as it removes the complicated list splicing, at the cost of
extra updates of the rq_next pointer. As that should be cache hot
anyway it should be an easy price to pay.
Fixes: d62cbcf62f2f ("nvme: add support for mq_ops->queue_rqs()")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241113152050.157179-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use dma_alloc_noncontigous to allocate a single IOVA-contigous segment
when backed by an IOMMU. This allow to easily use bigger segments and
avoids running into segment limits if we can avoid it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The HMB descriptor table is sized to the maximum number of descriptors
that could be used for a given device, but __nvme_alloc_host_mem could
break out of the loop earlier on memory allocation failure and end up
using less descriptors than planned for, which leads to an incorrect
size passed to dma_free_coherent.
In practice this was not showing up because the number of descriptors
tends to be low and the dma coherent allocator always allocates and
frees at least a page.
Fixes: 87ad72a59a38 ("nvme-pci: implement host memory buffer support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
nvme_dev_disable() modifies the dev->online_queues field, therefore
nvme_pci_update_nr_queues() should avoid racing against it, otherwise
we could end up passing invalid values to blk_mq_update_nr_hw_queues().
WARNING: CPU: 39 PID: 61303 at drivers/pci/msi/api.c:347
pci_irq_get_affinity+0x187/0x210
Workqueue: nvme-reset-wq nvme_reset_work [nvme]
RIP: 0010:pci_irq_get_affinity+0x187/0x210
Call Trace:
<TASK>
? blk_mq_pci_map_queues+0x87/0x3c0
? pci_irq_get_affinity+0x187/0x210
blk_mq_pci_map_queues+0x87/0x3c0
nvme_pci_map_queues+0x189/0x460 [nvme]
blk_mq_update_nr_hw_queues+0x2a/0x40
nvme_reset_work+0x1be/0x2a0 [nvme]
Fix the bug by locking the shutdown_lock mutex before using
dev->online_queues. Give up if nvme_dev_disable() is running or if
it has been executed already.
Fixes: 949928c1c731 ("NVMe: Fix possible queue use after freed")
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Pull block updates from Jens Axboe:
- MD changes via Song:
- md-bitmap refactoring (Yu Kuai)
- raid5 performance optimization (Artur Paszkiewicz)
- Other small fixes (Yu Kuai, Chen Ni)
- Add a sysfs entry 'new_level' (Xiao Ni)
- Improve information reported in /proc/mdstat (Mateusz Kusiak)
- NVMe changes via Keith:
- Asynchronous namespace scanning (Stuart)
- TCP TLS updates (Hannes)
- RDMA queue controller validation (Niklas)
- Align field names to the spec (Anuj)
- Metadata support validation (Puranjay)
- A syntax cleanup (Shen)
- Fix a Kconfig linking error (Arnd)
- New queue-depth quirk (Keith)
- Add missing unplug trace event (Keith)
- blk-iocost fixes (Colin, Konstantin)
- t10-pi modular removal and fixes (Alexey)
- Fix for potential BLKSECDISCARD overflow (Alexey)
- bio splitting cleanups and fixes (Christoph)
- Deal with folios rather than rather than pages, speeding up how the
block layer handles bigger IOs (Kundan)
- Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike)
- Reduce zoned device overhead in ublk (Ming)
- Add and use sendpages_ok() for drbd and nvme-tcp (Ofir)
- Fix regression in partition error pointer checking (Riyan)
- Add support for write zeroes and rotational status in nbd (Wouter)
- Add Yu Kuai as new BFQ maintainer. The scheduler has been
unmaintained for quite a while.
- Various sets of fixes for BFQ (Yu Kuai)
- Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail,
Yang)
* tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits)
nvme-pci: qdepth 1 quirk
block: fix potential invalid pointer dereference in blk_add_partition
blk_iocost: make read-only static array vrate_adj_pct const
block: unpin user pages belonging to a folio at once
mm: release number of pages of a folio
block: introduce folio awareness and add a bigger size from folio
block: Added folio-ized version of bio_add_hw_page()
block, bfq: factor out a helper to split bfqq in bfq_init_rq()
block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq()
block, bfq: remove local variable 'split' in bfq_init_rq()
block, bfq: remove bfq_log_bfqg()
block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()
block, bfq: fix procress reference leakage for bfqq in merge chain
block, bfq: fix uaf for accessing waker_bfqq after splitting
blk-throttle: support prioritized processing of metadata
blk-throttle: remove last_low_overflow_time
drbd: Add NULL check for net_conf to prevent dereference in state validation
nvme-tcp: fix link failure for TCP auth
blk-mq: add missing unplug trace event
mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init()
...
|
|
Another device has been reported to be unreliable if we have more than
one outstanding command. In this new case, data corruption may occur.
Since we have two devices now needing this quirky behavior, make a
generic quirk flag.
The same Apple quirk is clearly not "temporary", so update the comment
while moving it.
Link: https://lore.kernel.org/linux-nvme/191d810a4e3.fcc6066c765804.973611676137075390@collabora.com/
Reported-by: Robert Beckett <bob.beckett@collabora.com>
Reviewed-by: Christoph Hellwig hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
On some TUXEDO platforms, a Samsung 990 Evo NVMe leads to a high
power consumption in s2idle sleep (2-3 watts).
This patch applies 'Force No Simple Suspend' quirk to achieve a
sleep with a lower power consumption, typically around 0.5 watts.
Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com>
Signed-off-by: Werner Sembach <wse@tuxedocomputers.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
If a drive is unable to create IO queues on the initial probe, a
subsequent reset will need to allocate the tagset if IO queue creation
is successful. Without this, blk_mq_update_nr_hw_queues will crash on a
bad pointer due to the invalid tagset.
Fixes: eac3ef262941f62 ("nvme-pci: split the initial probe from the rest path")
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
nvme_map_data() is called when request has physical segments, hence
the nvme_unmap_data() should have same condition to avoid dereference.
Fixes: 4aedb705437f ("nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
pcie_aspm=off tells the kernel not to modify the ASPM configuration. This
setting does not guarantee that ASPM (Active State Power Management) is
disabled. Hence add pcie_port_pm=off. This disables power management for
all PCIe ports.
This patch has been tested on a workstation with a Samsung SSD 970 EVO Plus
NVMe SSD.
Fixes: 4641a8e6e145 ("nvme-pci: add trouble shooting steps for timeouts")
Cc: Keith Busch <kbusch@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
There is a hardware power-saving problem with the Lenovo N60z
board. When turn it on and leave it for 10 hours, there is a
20% chance that a nvme disk will not wake up until reboot.
Link: https://lore.kernel.org/all/2B5581C46AC6E335+9c7a81f1-05fb-4fd0-9fbb-108757c21628@uniontech.com
Signed-off-by: hmy <huanglin@uniontech.com>
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
for-6.11/block
Pull NVMe updates from Keith:
"nvme updates for Linux 6.11
- Device initialization memory leak fixes (Keith)
- More constants defined (Weiwen)
- Target debugfs support (Hannes)
- PCIe subsystem reset enhancements (Keith)
- Queue-depth multipath policy (Redhat and PureStorage)
- Implement get_unique_id (Christoph)
- Authentication error fixes (Gaosheng)"
* tag 'nvme-6.11-2024-07-08' of git://git.infradead.org/nvme: (21 commits)
nvmet-auth: fix nvmet_auth hash error handling
nvme: implement ->get_unique_id
nvme-multipath: implement "queue-depth" iopolicy
nvme-multipath: prepare for "queue-depth" iopolicy
nvme-pci: do not directly handle subsys reset fallout
lpfc_nvmet: implement 'host_traddr'
nvme-fcloop: implement 'host_traddr'
nvmet-fc: implement host_traddr()
nvmet-rdma: implement host_traddr()
nvmet-tcp: implement host_traddr()
nvmet: add 'host_traddr' callback for debugfs
nvmet: add debugfs support
mailmap: add entry for Weiwen Hu
nvme: rename CDR/MORE/DNR to NVME_STATUS_*
nvme: fix status magic numbers
nvme: rename nvme_sc_to_pr_err to nvme_status_to_pr_err
nvme: split device add from initialization
nvme: fc: split controller bringup handling
nvme: rdma: split controller bringup handling
nvme: tcp: split controller bringup handling
...
|
|
If we allocate a bio that is larger than NVMe maximum request size,
attach integrity metadata to it and send it to the NVMe subsystem, the
integrity metadata will be corrupted.
Splitting the bio works correctly. The function bio_split will clone the
bio, trim the iterator of the first bio and advance the iterator of the
second bio.
However, the function rq_integrity_vec has a bug - it returns the first
vector of the bio's metadata and completely disregards the metadata
iterator that was advanced when the bio was split. Thus, the second bio
uses the same metadata as the first bio and this leads to metadata
corruption.
This commit changes rq_integrity_vec, so that it calls mp_bvec_iter_bvec
instead of returning the first vector. mp_bvec_iter_bvec reads the
iterator and uses it to build a bvec for the current position in the
iterator.
The "queue_max_integrity_segments(rq->q) > 1" check was removed, because
the updated rq_integrity_vec function works correctly with multiple
segments.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/49d1afaa-f934-6ed2-a678-e0d428c63a65@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Scheduling reset_work after a nvme subsystem reset is expected to fail
on pcie, but this also prevents potential handling the platform's pcie
services may provide that might successfully recovering the link without
re-enumeration. Such examples include AER, DPC, and power's EEH.
Provide a pci specific operation that safely initiates a subsystem
reset, and instead of scheduling reset work, read back the status
register to trigger a pcie read error.
Since this only affects pci, the other fabrics drivers subscribe to a
generic nvmf subsystem reset that is exactly the same as before. The
loop fabric doesn't use it because nvmet doesn't support setting that
property anyway.
And since we're using the magic NSSR value in two places now, provide a
symbolic define for it.
Reported-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Combining both creates an ambiguous cleanup scenario for the caller if
an error is returned: does the device reference need to be dropped or
did the error occur before the device was initialized? If an error
occurs after the device is added, then the existing cleanup routines
will leak memory.
Furthermore, the nvme core is taking it upon itself to free the device's
kobj name under certain conditions rather than go through the core
device API. We shouldn't be peaking into these implementation details.
Split the device initialization from the addition to make it easier to
know the error handling actions, fix the existing memory leaks, and stop
the device layering violations.
Link: https://lore.kernel.org/linux-nvme/c4050a37-ecc9-462c-9772-65e25166f439@grimberg.me/
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
bio_vec start offset may be relatively large particularly when large
folio gets added to the bio. A bigger offset will result in avoiding the
single-segment mapping optimization and end up using expensive
mempool_alloc further.
Rather than using absolute value, adjust bv_offset by
NVME_CTRL_PAGE_SIZE while checking if segment can be fitted into one/two
PRP entries.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kundan Kumar <kundan.kumar@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Sandisk SN530 NVMe drives have broken MSIs. On systems without MSI-X
support, all commands time out resulting in the following message:
nvme nvme0: I/O tag 12 (100c) QID 0 timeout, completion polled
These timeouts cause the boot to take an excessively-long time (over 20
minutes) while the initial command queue is flushed.
Address this by adding a quirk for drives with buggy MSIs. The lspci
output for this device (recorded on a system with MSI-X support) is:
02:00.0 Non-Volatile memory controller: Sandisk Corp Device 5008 (rev 01) (prog-if 02 [NVM Express])
Subsystem: Sandisk Corp Device 5008
Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0
Memory at f7e00000 (64-bit, non-prefetchable) [size=16K]
Memory at f7e04000 (64-bit, non-prefetchable) [size=256]
Capabilities: [80] Power Management version 3
Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
Capabilities: [c0] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [150] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [1b8] Latency Tolerance Reporting
Capabilities: [300] Secondary PCI Express
Capabilities: [900] L1 PM Substates
Kernel driver in use: nvme
Kernel modules: nvme
Cc: <stable@vger.kernel.org>
Signed-off-by: Sean Anderson <sean.anderson@linux.dev>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
While I/O is running, if the pci bus error occurs then
in-flight I/O can not complete. Worst, if at this time,
user (logically) hot-unplug the nvme disk then the
nvme_remove() code path can't forward progress until
in-flight I/O is cancelled. So these sequence of events
may potentially hang hot-unplug code path indefinitely.
This patch helps cancel the pending/in-flight I/O from the
nvme request timeout handler in case the nvme controller
is in the terminal (DEAD/DELETING/DELETING_NOIO) state and
that helps nvme_remove() code path forward progress and
finish successfully.
Link: https://lore.kernel.org/all/199be893-5dfa-41e5-b6f2-40ac90ebccc4@linux.ibm.com/
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
This commit adds NVME_QUIRK_NO_DEEPEST_PS and NVME_QUIRK_BOGUS_NID for
device [126f:2262], which appears to be a generic VID:PID pair used for
many SSDs based on the Silicon Motion SM2262/SM2262EN controller.
Two of my SSDs with this VID:PID pair exhibit the same behavior:
* They frequently have trouble exiting the deepest power state (5),
resulting in the entire disk unresponsive.
Verified by setting nvme_core.default_ps_max_latency_us=10000 and
observing them behaving normally.
* They produce all-zero nguid and eui64 with `nvme id-ns` command.
The offending products are:
* HP SSD EX950 1TB
* HIKVISION C2000Pro 2TB
Signed-off-by: Jiawei Fu <i@ibugone.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
nvme_opcode_str() currently supports admin, IO, and fabrics commands.
However, fabrics commands aren't allowed for the pci transport.
Currently the pci caller passes 0 as the fctype,
which means any fabrics command would be displayed as "Property Set".
Move fabrics command support into a function nvme_fabrics_opcode_str()
and remove the fctype argument to nvme_opcode_str().
This way, a fabrics command will display as "Unknown" for pci.
Convert the rdma and tcp transports to use nvme_fabrics_opcode_str().
Signed-off-by: Caleb Sander <csander@purestorage.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add MODULE_DESCRIPTION() in order to remove warnings & get clean build:-
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-core.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-fabrics.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-rdma.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-fc.o
WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-tcp.o
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Pull block fixes from Jens Axboe:
- NVMe pull request via Keith:
- tcp, fc, and rdma target fixes (Maurizio, Daniel, Hannes,
Christoph)
- discard fixes and improvements (Christoph)
- timeout debug improvements (Keith, Max)
- various cleanups (Daniel, Max, Giuxen)
- trace event string fixes (Arnd)
- shadow doorbell setup on reset fix (William)
- a write zeroes quirk for SK Hynix (Jim)
- MD pull request via Song:
- Sparse warning since v6.0 (Bart)
- /proc/mdstat regression since v6.7 (Yu Kuai)
- Use symbolic error value (Christian)
- IO Priority documentation update (Christian)
- Fix for accessing queue limits without having entered the queue
(Christoph, me)
- Fix for loop dio support (Christoph)
- Move null_blk off deprecated ida interface (Christophe)
- Ensure nbd initializes full msghdr (Eric)
- Fix for a regression with the folio conversion, which is now easier
to hit because of an unrelated change (Matthew)
- Remove redundant check in virtio-blk (Li)
- Fix for a potential hang in sbitmap (Ming)
- Fix for partial zone appending (Damien)
- Misc changes and fixes (Bart, me, Kemeng, Dmitry)
* tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux: (45 commits)
Documentation: block: ioprio: Update schedulers
loop: fix the the direct I/O support check when used on top of block devices
blk-mq: Remove the hctx 'run' debugfs attribute
nbd: always initialize struct msghdr completely
block: Fix iterating over an empty bio with bio_for_each_folio_all
block: bio-integrity: fix kcalloc() arguments order
virtio_blk: remove duplicate check if queue is broken in virtblk_done
sbitmap: remove stale comment in sbq_calc_wake_batch
block: Correct a documentation comment in blk-cgroup.c
null_blk: Remove usage of the deprecated ida_simple_xx() API
block: ensure we hold a queue reference when using queue limits
blk-mq: rename blk_mq_can_use_cached_rq
block: print symbolic error name instead of error code
blk-mq: fix IO hang from sbitmap wakeup race
nvmet-rdma: avoid circular locking dependency on install_queue()
nvmet-tcp: avoid circular locking dependency on install_queue()
nvme-pci: set doorbell config before unquiescing
block: fix partial zone append completion handling in req_bio_endio()
block/iocost: silence warning on 'last_period' potentially being unused
md/raid1: Use blk_opf_t for read and write operations
...
|
|
During resets, if queues are unquiesced first, then the host can submit
IOs to the controller using shadow doorbell logic but the controller
won't be aware. This can lead to necessary MMIO doorbells from being
not issued, causing requests to be delayed and timed-out.
Signed-off-by: William Butler <wab@google.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Kernel configs don't necessarily have opcode decoding, and some opcodes
are not even decodable. It is still interesting for debugging SSD issues
to know what opcode is timing out, what request type it came from, and
the data size (if applicable).
Also print the command_id along side blk-mq's tag to help match commands
with protocol wire traces and firmware logs,
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
SK Hynix BC901 drive write zero will cause Chromebook takes more than 20 mins to switch to developer mode
"disable write zeroes" can fix this issue and Sk Hynix has been verified.
Signed-off-by: Jim.Lin <jim.lin@siliconmotion.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Some Kingston NV1 and A2000 are wasting a lot of power on specific TUXEDO
platforms in s2idle sleep if 'Simple Suspend' is used.
This patch applies a new quirk 'Force No Simple Suspend' to achieve a
low power sleep without 'Simple Suspend'.
Signed-off-by: Werner Sembach <wse@tuxedocomputers.com>
Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
A different CPU may be setting the ctrl->state value, so ensure proper
barriers to prevent optimizing to a stale state. Normally it isn't a
problem to observe the wrong state as it is merely advisory to take a
quicker path during initialization and error recovery, but seeing an old
state can report unexpected ENETRESET errors when a reset request was in
fact successful.
Reported-by: Minh Hoang <mh2022@meta.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Hannes Reinecke <hare@suse.de>
|
|
Pull block updates from Jens Axboe:
- Improvements to the queue_rqs() support, and adding null_blk support
for that as well (Chengming)
- Series improving badblocks support (Coly)
- Key store support for sed-opal (Greg)
- IBM partition string handling improvements (Jan)
- Make number of ublk devices supported configurable (Mike)
- Cancelation improvements for ublk (Ming)
- MD pull requests via Song:
- Handle timeout in md-cluster, by Denis Plotnikov
- Cleanup pers->prepare_suspend, by Yu Kuai
- Rewrite mddev_suspend(), by Yu Kuai
- Simplify md_seq_ops, by Yu Kuai
- Reduce unnecessary locking array_state_store(), by Mariusz
Tkaczyk
- Make rdev add/remove independent from daemon thread, by Yu Kuai
- Refactor code around quiesce() and mddev_suspend(), by Yu Kuai
- NVMe pull request via Keith:
- nvme-auth updates (Mark)
- nvme-tcp tls (Hannes)
- nvme-fc annotaions (Kees)
- Misc cleanups and improvements (Jiapeng, Joel)
* tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux: (95 commits)
block: ublk_drv: Remove unused function
md: cleanup pers->prepare_suspend()
nvme-auth: allow mixing of secret and hash lengths
nvme-auth: use transformed key size to create resp
nvme-auth: alloc nvme_dhchap_key as single buffer
nvmet-tcp: use 'spin_lock_bh' for state_lock()
powerpc/pseries: PLPKS SED Opal keystore support
block: sed-opal: keystore access for SED Opal keys
block:sed-opal: SED Opal keystore
ublk: simplify aborting request
ublk: replace monitor with cancelable uring_cmd
ublk: quiesce request queue when aborting queue
ublk: rename mm_lock as lock
ublk: move ublk_cancel_dev() out of ub->mutex
ublk: make sure io cmd handled in submitter task context
ublk: don't get ublk device reference in ublk_abort_queue()
ublk: Make ublks_max configurable
ublk: Limit dev_id/ub_number values
md-cluster: check for timeout while a new disk adding
nvme: rework NVME_AUTH Kconfig selection
...
|