summaryrefslogtreecommitdiff
path: root/drivers/nvme/host/core.c
AgeCommit message (Collapse)Author
7 daysMerge tag 'for-6.17/block-20250728' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - MD pull request via Yu: - call del_gendisk synchronously (Xiao) - cleanup unused variable (John) - cleanup workqueue flags (Ryo) - fix faulty rdev can't be removed during resync (Qixing) - NVMe pull request via Christoph: - try PCIe function level reset on init failure (Keith Busch) - log TLS handshake failures at error level (Maurizio Lombardi) - pci-epf: do not complete commands twice if nvmet_req_init() fails (Rick Wertenbroek) - misc cleanups (Alok Tiwari) - Removal of the pktcdvd driver This has been more than a decade coming at this point, and some recently revealed breakages that had it causing issues even for cases where it isn't required made me re-pull the trigger on this one. It's known broken and nobody has stepped up to maintain the code - Series for ublk supporting batch commands, enabling the use of multishot where appropriate - Speed up ublk exit handling - Fix for the two-stage elevator fixing which could leak data - Convert NVMe to use the new IOVA based API - Increase default max transfer size to something more reasonable - Series fixing write operations on zoned DM devices - Add tracepoints for zoned block device operations - Prep series working towards improving blk-mq queue management in the presence of isolated CPUs - Don't allow updating of the block size of a loop device that is currently under exclusively ownership/open - Set chunk sectors from stacked device stripe size and use it for the atomic write size limit - Switch to folios in bcache read_super() - Fix for CD-ROM MRW exit flush handling - Various tweaks, fixes, and cleanups * tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits) block: restore two stage elevator switch while running nr_hw_queue update cdrom: Call cdrom_mrw_exit from cdrom_release function sunvdc: Balance device refcount in vdc_port_mpgroup_check nvme-pci: try function level reset on init failure dm: split write BIOs on zone boundaries when zone append is not emulated block: use chunk_sectors when evaluating stacked atomic write limits dm-stripe: limit chunk_sectors to the stripe size md/raid10: set chunk_sectors limit md/raid0: set chunk_sectors limit block: sanitize chunk_sectors for atomic write limits ilog2: add max_pow_of_two_factor() nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails nvme-tcp: log TLS handshake failures at error level docs: nvme: fix grammar in nvme-pci-endpoint-target.rst nvme: fix typo in status code constant for self-test in progress nvmet: remove redundant assignment of error code in nvmet_ns_enable() nvme: fix incorrect variable in io cqes error message nvme: fix multiple spelling and grammar issues in host drivers block: fix blk_zone_append_update_request_bio() kernel-doc md/raid10: fix set but not used variable in sync_request_write() ...
7 daysMerge tag 'vfs-6.17-rc1.integrity' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs 'protection info' updates from Christian Brauner: "This adds the new FS_IOC_GETLBMD_CAP ioctl() to query metadata and protection info (PI) capabilities. This ioctl returns information about the files integrity profile. This is useful for userspace applications to understand a files end-to-end data protection support and configure the I/O accordingly. For now this interface is only supported by block devices. However the design and placement of this ioctl in generic FS ioctl space allows us to extend it to work over files as well. This maybe useful when filesystems start supporting PI-aware layouts. A new structure struct logical_block_metadata_cap is introduced, which contains the following fields: - lbmd_flags: bitmask of logical block metadata capability flags - lbmd_interval: the amount of data described by each unit of logical block metadata - lbmd_size: size in bytes of the logical block metadata associated with each interval - lbmd_opaque_size: size in bytes of the opaque block tag associated with each interval - lbmd_opaque_offset: offset in bytes of the opaque block tag within the logical block metadata - lbmd_pi_size: size in bytes of the T10 PI tuple associated with each interval - lbmd_pi_offset: offset in bytes of T10 PI tuple within the logical block metadata - lbmd_pi_guard_tag_type: T10 PI guard tag type - lbmd_pi_app_tag_size: size in bytes of the T10 PI application tag - lbmd_pi_ref_tag_size: size in bytes of the T10 PI reference tag - lbmd_pi_storage_tag_size: size in bytes of the T10 PI storage tag The internal logic to fetch the capability is encapsulated in a helper function blk_get_meta_cap(), which uses the blk_integrity profile associated with the device. The ioctl returns -EOPNOTSUPP, if CONFIG_BLK_DEV_INTEGRITY is not enabled" * tag 'vfs-6.17-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: block: fix lbmd_guard_tag_type assignment in FS_IOC_GETLBMD_CAP block: fix FS_IOC_GETLBMD_CAP parsing in blkdev_common_ioctl() fs: add ioctl to query metadata and protection info capabilities nvme: set pi_offset only when checksum type is not BLK_INTEGRITY_CSUM_NONE block: introduce pi_tuple_size field in blk_integrity block: rename tuple_size field in blk_integrity to metadata_size
7 daysMerge tag 'vfs-6.17-rc1.fallocate' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fallocate updates from Christian Brauner: "fallocate() currently supports creating preallocated files efficiently. However, on most filesystems fallocate() will preallocate blocks in an unwriten state even if FALLOC_FL_ZERO_RANGE is specified. The extent state must later be converted to a written state when the user writes data into this range, which can trigger numerous metadata changes and journal I/O. This may leads to significant write amplification and performance degradation in synchronous write mode. At the moment, the only method to avoid this is to create an empty file and write zero data into it (for example, using 'dd' with a large block size). However, this method is slow and consumes a considerable amount of disk bandwidth. Now that more and more flash-based storage devices are available it is possible to efficiently write zeros to SSDs using the unmap write zeroes command if the devices do not write physical zeroes to the media. For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command does not write actual data to the device, instead, NVMe converts the zeroed range to a deallocated state, which works fast and consumes almost no disk write bandwidth. This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices. fallocate() is subsequently extended with the FALLOC_FL_WRITE_ZEROES flag. FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a way that subsequent writes to that range do not require further changes to the file mapping metadata. This flag is beneficial for subsequent pure overwriting within this range, as it can save on block allocation and, consequently, significant metadata changes" * tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: ext4: add FALLOC_FL_WRITE_ZEROES support block: add FALLOC_FL_WRITE_ZEROES support block: factor out common part in blkdev_fallocate() fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate dm: clear unmap write zeroes limits when disabling write zeroes scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP nvmet: set WZDS and DRB if device enables unmap write zeroes operation nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit block: introduce max_{hw|user}_wzeroes_unmap_sectors to queue limits
2025-07-17nvme: fix multiple spelling and grammar issues in host driversAlok Tiwari
This commit fixes several typos and grammatical issues across various nvme host driver files: - correct "glace" to "glance" in a comment in apple.c - fix "Idependent" to "Independent" in core.c - change "unsucceesful" to "unsuccessful", "they blk-mq" to "the blk-mq", - fix "terminaed" to "terminated" and other grammar in fc.c - update "O's" to "0's" to clarify meaning in nvme.h - fix a function name reference in a comment in zns.c: *_transter_len() -> *_transfer_len(). - fix sysfs_emit() output format in pci.c (replace x%08x with 0x%08x) These changes improve the code readability and documentation consistency across the NVMe driver. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-15nvme: fix misaccounting of nvme-mpath inflight I/OYu Kuai
Procedures for nvme-mpath IO accounting: 1) initialize nvme_request and clear flags; 2) set NVME_MPATH_IO_STATS and increase inflight counter when IO started; 3) check NVME_MPATH_IO_STATS and decrease inflight counter when IO is done; However, for the case nvme_fail_nonready_command(), both step 1) and 2) are skipped, and if old nvme_request set NVME_MPATH_IO_STATS and then request is reused, step 3) will still be executed, causing inflight I/O counter to be negative. Fix the problem by clearing nvme_request in nvme_fail_nonready_command(). Fixes: ea5e5f42cd2c ("nvme-fabrics: avoid double completions in nvmf_fail_nonready_command") Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/all/CAHj4cs_+dauobyYyP805t33WMJVzOWj=7+51p4_j9rA63D9sog@mail.gmail.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-15nvme: revert the cross-controller atomic write size validationChristoph Hellwig
This was originally added by commit 8695f060a029 ("nvme: all namespaces in a subsystem must adhere to a common atomic write size") to check the all controllers in a subsystem report the same atomic write size, but the check wasn't quite correct and caused problems for devices with multiple namespaces that report different LBA sizes. Commit f46d273449ba ("nvme: fix atomic write size validation") tried to fix this, but then caused problems for namespace rediscovery after a format with an LBA size change that changes the AWUPF value. This drops the validation and essentially reverts those two commits while keeping the cleanup that went in between the two. We'll need to figure out how to properly check for the mouse trap that nvme left us, but for now revert the check to keep devices working for users who couldn't care less about the atomic write feature. Fixes: 8695f060a029 ("nvme: all namespaces in a subsystem must adhere to a common atomic write size") Fixes: f46d273449ba ("nvme: fix atomic write size validation") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Alan Adamson <alan.adamson@oracle.com>
2025-07-14nvme: fix endianness of command word prints in nvme_log_err_passthru()John Garry
The command word members of struct nvme_common_command are __le32 type, so use helper le32_to_cpu() to read them properly. Fixes: 9f079dda1433 ("nvme: allow passthru cmd error logging") Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-14nvme: fix inconsistent RCU list manipulation in nvme_ns_add_to_ctrl_list()Zheng Qixing
When inserting a namespace into the controller's namespace list, the function uses list_add_rcu() when the namespace is inserted in the middle of the list, but falls back to a regular list_add() when adding at the head of the list. This inconsistency could lead to race conditions during concurrent access, as users might observe a partially updated list. Fix this by consistently using list_add_rcu() in both code paths to ensure proper RCU protection throughout the entire function. Fixes: be647e2c76b2 ("nvme: use srcu for iterating namespace list") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-01nvme: set pi_offset only when checksum type is not BLK_INTEGRITY_CSUM_NONEAnuj Gupta
protection information is treated as opaque when checksum type is BLK_INTEGRITY_CSUM_NONE. In order to maintain the right metadata semantics, set pi_offset only in cases where checksum type is not BLK_INTEGRITY_CSUM_NONE. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-4-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01block: introduce pi_tuple_size field in blk_integrityAnuj Gupta
Introduce a new pi_tuple_size field in struct blk_integrity to explicitly represent the size (in bytes) of the protection information (PI) tuple. This is a prep patch. Add validation in blk_validate_integrity_limits() to ensure that pi size matches the expected size for known checksum types and never exceeds the pi_tuple_size. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-3-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-01block: rename tuple_size field in blk_integrity to metadata_sizeAnuj Gupta
The tuple_size field in blk_integrity currently represents the total size of metadata associated with each data interval. To make the meaning more explicit, rename tuple_size to metadata_size. This is a purely mechanical rename with no functional changes. Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-2-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-30nvme: correctly account for namespace head reference counterNilay Shroff
The blktests nvme/058 manifests an issue where the NVMe subsystem kobject entry remains stale in sysfs, causing a failure during subsequent NVMe module reloads[1]. Specifically, when attempting to register a new NVMe subsystem, the driver encounters a kobejct name collision because a stale kobject still exists. Though, please note that nvme/058 doesn't report any failure and test case passes and it's only during subsequent NVMe module reloads, the stale nvme sub- system kobject entry in sysfs causes the observed symptom[1]. This issue stems from an imbalance in the get/put usage of the namespace head (nshead) reference counter. The nshead holds a reference to the associated NVMe subsystem. If the nshead reference is not properly released, it prevents the cleanup of the subsystem's kobject, leaving nvme subsystem stale entry behind in sysfs. During the failure case, the last namespace path referencing a nshead is removed, but the nshead reference was not released. This occurs because the release logic currently only puts the nshead reference when its state is LIVE. However, in configurations where ANA (Asymmetric Namespace Access) is enabled, a namespace may be associated with an ANA state that is neither optimized nor non-optimized. In this case, the nshead may never transition to LIVE, and the corresponding nshead reference is then never dropped. In fact nvme/058 associates some of nvme namespaces to an inaccessible ANA state and with that nshead is created but it's state is not transitioned to LIVE. So the current logic would then causes nshead reference to be leaked for non-LIVE states. Another scenario, during namespace allocation, the driver first allocates a nshead and then issues an Identify Namespace command. If this command fails — which can happen in tests like nvme/058 that rapidly enables and disables namespaces — we must release the reference to the newly allocated nshead. However this reference release is currently missing in the failure, causing a nshead reference leak. To fix this, we now unconditionally release the nshead reference when the last nvme path referencing to the nshead is removed, regardless of the head’s state. Also during identify namespace failure case we now properly release the nshead reference. So this ensures proper cleanup of the nshead, and consequently, the NVMe subsystem and its associated kobject. This change prevents stale kobject entries from lingering in sysfs and eliminates the module reload failures observed just after running nvme/058. [1] https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/ Reported-by: yi.zhang@redhat.com Closes: https://lore.kernel.org/all/CAHj4cs8fOBS-eSjsd5LUBzy7faKXJtgLkCN+mDy_-ezCLLLq+Q@mail.gmail.com/ Fixes: 62188639ec16 ("nvme-multipath: introduce delayed removal of the multipath head node") Tested-by: yi.zhang@redhat.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-30nvme: Fix incorrect cdw15 value in passthru error loggingAlok Tiwari
Fix an error in nvme_log_err_passthru() where cdw14 was incorrectly printed twice instead of cdw15. This fix ensures accurate logging of the full passthrough command payload. Fixes: 9f079dda1433 ("nvme: allow passthru cmd error logging") Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-26nvme: fix atomic write size validationChristoph Hellwig
Don't mix the namespace and controller values, and validate the per-controller limit when probing the controller. This avoid spurious failures for controllers with namespaces that have different namespaces with different logical block sizes, or report the per-namespace values only for some namespaces. It also fixes a missing queue_limits_cancel_update in an error path by removing that error path. Fixes: 8695f060a029 ("nvme: all namespaces in a subsystem must adhere to a common atomic write size") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Tested-by: Yi Zhang <yi.zhang@redhat.com>
2025-06-26nvme: refactor the atomic write unit detectionChristoph Hellwig
Move all the code out of nvme_update_disk_info into the helper, and rename the helper to have a somewhat less clumsy name. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com>
2025-06-26nvme: reset delayed remove_work after reconnectKeith Busch
The remove_work will proceed with permanently disconnecting on the initial final path failure if the head shows no paths after the delay. If a new path connects while the remove_work is pending, and if that new path happens to disconnect before that remove_work executes, the delayed removal should reset based on the most recent path disconnect time, but queue_delayed_work() won't do anything if the work is already pending. Attempt to cancel the delayed work when a new path connects, and use mod_delayed_work() in case the remove_work remains pending anyway. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-23nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bitZhang Yi
When the device supports the Write Zeroes command and the DEAC bit, it indicates that the deallocate bit in the Write Zeroes command is supported, and the bytes read from a deallocated logical block are zeroes. This means the device supports unmap Write Zeroes operation, so set the max_hw_wzeroes_unmap_sectors to max_write_zeroes_sectors on the device's queue limit. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-3-yi.zhang@huaweicloud.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-04nvme: spelling fixesYi Zhang
Fix various spelling errors in comments. Signed-off-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-04nvme: fix command limits status codeKeith Busch
The command specific status code, 0x183, was introduced in the NVMe 2.0 specification defined to "Command Size Limits Exceeded" and only ever applied to DSM and Copy commands. Fix the name and, remove the incorrect translation to error codes and special treatment in the target code for it. Fixes: 3b7c33b28a44d4 ("nvme.h: add Write Zeroes definitions") Cc: Chaitanya Kulkarni <chaitanyak@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-26Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...
2025-05-20nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_diskNilay Shroff
In the NVMe context, the term "shutdown" has a specific technical meaning. To avoid confusion, this commit renames the nvme_mpath_ shutdown_disk function to nvme_mpath_remove_disk to better reflect its purpose (i.e. removing the disk from the system). However, nvme_mpath_remove_disk was already in use, and its functionality is related to releasing or putting the head node disk. To resolve this naming conflict and improve clarity, the existing nvme_mpath_ remove_disk function is also renamed to nvme_mpath_put_disk. This renaming improves code readability and better aligns function names with their actual roles. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme-multipath: introduce delayed removal of the multipath head nodeNilay Shroff
Currently, the multipath head node of an NVMe disk is removed immediately as soon as all paths of the disk are removed. However, this can cause issues in scenarios where: - The disk hot-removal followed by re-addition. - Transient PCIe link failures that trigger re-enumeration, temporarily removing and then restoring the disk. In these cases, removing the head node prematurely may lead to a head disk node name change upon re-addition, requiring applications to reopen their handles if they were performing I/O during the failure. To address this, introduce a delayed removal mechanism of head disk node. During transient failure, instead of immediate removal of head disk node, the system waits for a configurable timeout, allowing the disk to recover. During transient disk failure, if application sends any IO then we queue it instead of failing such IO immediately. If the disk comes back online within the timeout, the queued IOs are resubmitted to the disk ensuring seamless operation. In case disk couldn't recover from the failure then queued IOs are failed to its completion and application receives the error. So this way, if disk comes back online within the configured period, the head node remains unchanged, ensuring uninterrupted workloads without requiring applications to reopen device handles. A new sysfs attribute, named "delayed_removal_secs" is added under head disk blkdev for user who wish to configure time for the delayed removal of head disk node. The default value of this attribute is set to zero second ensuring no behavior change unless explicitly configured. Link: https://lore.kernel.org/linux-nvme/Y9oGTKCFlOscbPc2@infradead.org/ Link: https://lore.kernel.org/linux-nvme/Y+1aKcQgbskA2tra@kbusch-mbp.dhcp.thefacebook.com/ Suggested-by: Keith Busch <kbusch@kernel.org> Suggested-by: Christoph Hellwig <hch@infradead.org> [nilay: reworked based on the original idea/POC from Christoph and Keith] Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-14nvme: all namespaces in a subsystem must adhere to a common atomic write sizeAlan Adamson
The first namespace configured in a subsystem sets the subsystem's atomic write size based on its AWUPF or NAWUPF. Subsequent namespaces must have an atomic write size (per their AWUPF or NAWUPF) less than or equal to the subsystem's atomic write size, or their probing will be rejected. Signed-off-by: Alan Adamson <alan.adamson@oracle.com> [hch: fold in review comments from John Garry] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com>
2025-05-07block: remove the q argument from blk_rq_map_kernChristoph Hellwig
Remove the q argument from blk_rq_map_kern and the internal helpers called by it as the queue can trivially be derived from the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07nvme: unblock ctrl state transition for firmware updateDaniel Wagner
The original nvme subsystem design didn't have a CONNECTING state; the state machine allowed transitions from RESETTING to LIVE directly. With the introduction of nvme fabrics the CONNECTING state was introduce. Over time the nvme-pci started to use the CONNECTING state as well. Eventually, a bug fix for the nvme-fc started to depend that the only valid transition to LIVE was from CONNECTING. Though this change didn't update the firmware update handler which was still depending on RESETTING to LIVE transition. The simplest way to address it for the time being is to switch into CONNECTING state before going to LIVE state. Fixes: d2fe192348f9 ("nvme: only allow entering LIVE from CONNECTING state") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Daniel Wagner <wagi@kernel.org> Closes: https://lore.kernel.org/all/0134ea15-8d5f-41f7-9e9a-d7e6d82accaa@roeck-us.net Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Guenter Roeck <linux@roeck-us.net>
2025-05-06nvme: fix incorrect sizeofKanchan Joshi
The plid array, head->plids, is meant to store placement IDs, each of type u16. But its size has been incorrectly calculated, as the size of the pointer is being used instead of the size of the object it points to. Use the sizeof(*head->plids) in kcalloc so that we don't allocate extra. Fixes: 38e8397dde63 ("nvme: use fdp streams if write stream is provided") Reported-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: fix write_stream_granularity initializationCaleb Sander Mateos
write_stream_granularity is set to max(info->runs, U32_MAX), which means that any RUNS value less than 2 ** 32 becomes U32_MAX, and any larger value is silently truncated to an unsigned int. Use min() instead to provide the correct semantics, capping RUNS values at U32_MAX. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: 30b5f20bb2dd ("nvme: register fdp parameters with the block layer") Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250506175413.1936110-1-csander@purestorage.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: use fdp streams if write stream is providedKeith Busch
Maps a user requested write stream to an FDP placement ID if possible. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-12-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: register fdp parameters with the block layerKeith Busch
Register the device data placement limits if supported. This is just registering the limits with the block layer. Nothing beyond reporting these attributes is happening in this patch. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-11-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: pass a void pointer to nvme_get/set_features for the resultChristoph Hellwig
That allows passing in structures instead of the u32 result, and thus reduce the amount of bit shifting and masking required to parse the result. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-9-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06nvme: add a nvme_get_log_lsi helperChristoph Hellwig
For log pages that need to pass in a LSI value, while at the same time not touching all the existing nvme_get_log callers. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20250506121732.8211-8-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-04-16nvme: fixup scan failure for non-ANA multipath controllersHannes Reinecke
Commit 62baf70c3274 caused the ANA log page to be re-read, even on controllers that do not support ANA. While this should generally harmless, some controllers hang on the unsupported log page and never finish probing. Fixes: 62baf70c3274 ("nvme: re-read ANA log page after ns scan completes") Signed-off-by: Hannes Reinecke <hare@kernel.org> Tested-by: Srikanth Aithal <sraithal@amd.com> [hch: more detailed commit message] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2025-04-08nvme: re-read ANA log page after ns scan completesHannes Reinecke
When scanning for new namespaces we might have missed an ANA AEN. The NVMe base spec (NVMe Base Specification v2.1, Figure 151 'Asynchonous Event Information - Notice': Asymmetric Namespace Access Change) states: A controller shall not send this even if an Attached Namespace Attribute Changed asynchronous event [...] is sent for the same event. so we need to re-read the ANA log page after we rescanned the namespace list to update the ANA states of the new namespaces. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-07nvme: requeue namespace scan on missed AENsHannes Reinecke
Scanning for namespaces can take some time, so if the target is reconfigured while the scan is running we may miss a Attached Namespace Attribute Changed AEN. Check if the NVME_AER_NOTICE_NS_CHANGED bit is set once the scan has finished, and requeue scanning to pick up any missed change. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-03Merge tag 'block-6.15-20250403' of git://git.kernel.dk/linuxLinus Torvalds
Pull more block updates from Jens Axboe: - NVMe pull request via Keith: - PCI endpoint target cleanup (Damien) - Early import for uring_cmd fixed buffer (Caleb) - Multipath documentation and notification improvements (John) - Invalid pci sq doorbell write fix (Maurizio) - Queue init locking fix - Remove dead nsegs parameter from blk_mq_get_new_requests() * tag 'block-6.15-20250403' of git://git.kernel.dk/linux: block: don't grab elevator lock during queue initialization nvme-pci: skip nvme_write_sq_db on empty rqlist nvme-multipath: change the NVME_MULTIPATH config option nvme: update the multipath warning in nvme_init_ns_head nvme/ioctl: move fixed buffer lookup to nvme_uring_cmd_io() nvme/ioctl: move blk_mq_free_request() out of nvme_map_user_request() nvme/ioctl: don't warn on vectorized uring_cmd with fixed buffer nvmet: pci-epf: Keep completion queues mapped block: remove unused nseg parameter
2025-04-01Merge tag 'mm-nonmm-stable-2025-03-30-18-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - The series "powerpc/crash: use generic crashkernel reservation" from Sourabh Jain changes powerpc's kexec code to use more of the generic layers. - The series "get_maintainer: report subsystem status separately" from Vlastimil Babka makes some long-requested improvements to the get_maintainer output. - The series "ucount: Simplify refcounting with rcuref_t" from Sebastian Siewior cleans up and optimizing the refcounting in the ucount code. - The series "reboot: support runtime configuration of emergency hw_protection action" from Ahmad Fatoum improves the ability for a driver to perform an emergency system shutdown or reboot. - The series "Converge on using secs_to_jiffies() part two" from Easwar Hariharan performs further migrations from msecs_to_jiffies() to secs_to_jiffies(). - The series "lib/interval_tree: add some test cases and cleanup" from Wei Yang permits more userspace testing of kernel library code, adds some more tests and performs some cleanups. - The series "hung_task: Dump the blocking task stacktrace" from Masami Hiramatsu arranges for the hung_task detector to dump the stack of the blocking task and not just that of the blocked task. - The series "resource: Split and use DEFINE_RES*() macros" from Andy Shevchenko provides some cleanups to the resource definition macros. - Plus the usual shower of singleton patches - please see the individual changelogs for details. * tag 'mm-nonmm-stable-2025-03-30-18-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (77 commits) mailmap: consolidate email addresses of Alexander Sverdlin fs/procfs: fix the comment above proc_pid_wchan() relay: use kasprintf() instead of fixed buffer formatting resource: replace open coded variant of DEFINE_RES() resource: replace open coded variants of DEFINE_RES_*_NAMED() resource: replace open coded variant of DEFINE_RES_NAMED_DESC() resource: split DEFINE_RES_NAMED_DESC() out of DEFINE_RES_NAMED() samples: add hung_task detector mutex blocking sample hung_task: show the blocker task if the task is hung on mutex kexec_core: accept unaccepted kexec segments' destination addresses watchdog/perf: optimize bytes copied and remove manual NUL-termination lib/interval_tree: fix the comment of interval_tree_span_iter_next_gap() lib/interval_tree: skip the check before go to the right subtree lib/interval_tree: add test case for span iteration lib/interval_tree: add test case for interval_tree_iter_xxx() helpers lib/rbtree: add random seed lib/rbtree: split tests lib/rbtree: enable userland test suite for rbtree related data structure checkpatch: describe --min-conf-desc-length scripts/gdb/symbols: determine KASLR offset on s390 ...
2025-04-01nvme: update the multipath warning in nvme_init_ns_headJohn Meneghini
The new NVME_MULTIPATH_PARAM config option requires updates to the warning message in nvme_init_ns_head(). Signed-off-by: John Meneghini <jmeneghi@redhat.com> Tested-by: John Meneghini <jmeneghi@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-26Merge tag 'for-6.15/block-20250322' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - Fixes for integrity handling - NVMe pull request via Keith: - Secure concatenation for TCP transport (Hannes) - Multipath sysfs visibility (Nilay) - Various cleanups (Qasim, Baruch, Wang, Chen, Mike, Damien, Li) - Correct use of 64-bit BARs for pci-epf target (Niklas) - Socket fix for selinux when used in containers (Peijie) - MD pull request via Yu: - fix recovery can preempt resync (Li Nan) - fix md-bitmap IO limit (Su Yue) - fix raid10 discard with REQ_NOWAIT (Xiao Ni) - fix raid1 memory leak (Zheng Qixing) - fix mddev uaf (Yu Kuai) - fix raid1,raid10 IO flags (Yu Kuai) - some refactor and cleanup (Yu Kuai) - Series cleaning up and fixing bugs in the bad block handling code - Improve support for write failure simulation in null_blk - Various lock ordering fixes - Fixes for locking for debugfs attributes - Various ublk related fixes and improvements - Cleanups for blk-rq-qos wait handling - blk-throttle fixes - Fixes for loop dio and sync handling - Fixes and cleanups for the auto-PI code - Block side support for hardware encryption keys in blk-crypto - Various cleanups and fixes * tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux: (105 commits) nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi) nvme-tcp: fix selinux denied when calling sock_sendmsg nvmet: pci-epf: Always configure BAR0 as 64-bit nvmet: Remove duplicate uuid_copy nvme: zns: Simplify nvme_zone_parse_entry() nvmet: pci-epf: Remove redundant 'flush_workqueue()' calls nvmet-fc: Remove unused functions nvme-pci: remove stale comment nvme-fc: Utilise min3() to simplify queue count calculation nvme-multipath: Add visibility for queue-depth io-policy nvme-multipath: Add visibility for numa io-policy nvme-multipath: Add visibility for round-robin io-policy nvmet: add tls_concat and tls_key debugfs entries nvmet-tcp: support secure channel concatenation nvmet: Add 'sq' argument to alloc_ctrl_args nvme-fabrics: reset admin connection for secure concatenation nvme-tcp: request secure channel concatenation nvme-keyring: add nvme_tls_psk_refresh() nvme: add nvme_auth_derive_tls_psk() nvme: add nvme_auth_generate_digest() ...
2025-03-20nvme-multipath: Add visibility for round-robin io-policyNilay Shroff
This patch helps add nvme native multipath visibility for round-robin io-policy. It creates a "multipath" sysfs directory under head gendisk device node directory and then from "multipath" directory it adds a link to each namespace path device the head node refers. For instance, if we have a shared namespace accessible from two different controllers/paths then we create a soft link to each path device from head disk node as shown below: $ ls -l /sys/block/nvme1n1/multipath/ nvme1c1n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme1/nvme1c1n1 nvme1c3n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme3/nvme1c3n1 In the above example, nvme1n1 is head gendisk node created for a shared namespace and the namespace is accessible from nvme1c1n1 and nvme1c3n1 paths. For round-robin I/O policy, we could easily infer from the above output that I/O workload targeted to nvme1n1 would toggle across paths nvme1c1n1 and nvme1c3n1. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-17nvme: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies() to avoid the multiplication This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies +secs_to_jiffies (E - * \( 1000 \| MSEC_PER_SEC \) ) Link: https://lkml.kernel.org/r/20250225-converge-secs-to-jiffies-part-two-v3-11-a43967e36c88@linux.microsoft.com Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Acked-by: Keith Busch <kbusch@kernel.org> Cc: Carlos Maiolino <cem@kernel.org> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: David Sterba <dsterba@suse.com> Cc: Dick Kennedy <dick.kennedy@broadcom.com> Cc: Dongsheng Yang <dongsheng.yang@easystack.cn> Cc: Fabio Estevam <festevam@gmail.com> Cc: Frank Li <frank.li@nxp.com> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Ilpo Jarvinen <ilpo.jarvinen@linux.intel.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: James Bottomley <james.bottomley@HansenPartnership.com> Cc: James Smart <james.smart@broadcom.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jens Axboe <axboe@kernel.dk> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Kalesh Anakkur Purayil <kalesh-anakkur.purayil@broadcom.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Brown <broonie@kernel.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Nicolas Palix <nicolas.palix@imag.fr> Cc: Niklas Cassel <cassel@kernel.org> Cc: Oded Gabbay <ogabbay@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Sebastian Reichel <sre@kernel.org> Cc: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com> Cc: Shawn Guo <shawnguo@kernel.org> Cc: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com> Cc: Takashi Iwai <tiwai@suse.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-11nvme: move error logging from nvme_end_req() to __nvme_end_req()Shin'ichiro Kawasaki
Before the Commit 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions"), blk_mq_add_to_batch() did not add failed passthrough requests to batch, and returned false. After the commit, blk_mq_add_to_batch() always adds passthrough requests to batch regardless of whether the request failed or not, and returns true. This affected error logging feature in the NVME driver. Before the commit, the call chain of failed passthrough request was as follows: nvme_handle_cqe() blk_mq_add_to_batch() .. false is returned, then call nvme_pci_complete_rq() nvme_pci_complete_rq() nvme_complete_rq() nvme_end_req() nvme_log_err_passthru() .. error logging __nvme_end_req() .. end of the rqeuest After the commit, the call chain is as follows: nvme_handle_cqe() blk_mq_add_to_batch() .. true is returned, then set nvme_pci_complete_batch() .. nvme_pci_complete_batch() nvme_complete_batch() nvme_complete_batch_req() __nvme_end_req() .. end of the request, without error logging To make the error logging feature work again for passthrough requests, move the nvme_log_err_passthru() call from nvme_end_req() to __nvme_end_req(). While at it, move nvme_log_error() call for non-passthrough requests together with nvme_log_err_passthru(). Even though the trigger commit does not affect non-passthrough requests, move it together for code simplicity. Fixes: 1f47ed294a2b ("block: cleanup and fix batch completion adding conditions") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250311104359.1767728-2-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-20nvme: only allow entering LIVE from CONNECTING stateDaniel Wagner
The fabric transports and also the PCI transport are not entering the LIVE state from NEW or RESETTING. This makes the state machine more restrictive and allows to catch not supported state transitions, e.g. directly switching from RESETTING to LIVE. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-02-03Merge tag 'nvme-6.14-2025-01-31' of git://git.infradead.org/nvme into block-6.14Jens Axboe
Pull NVMe fixes from Keith: "nvme fixes for Linux 6.14 - Connection fixes for fibre channel transport (Daniel) - Endian fixes (Keith, Christoph) - Cleanup fix for host memory buffer (Francis) - Platform specific power quirks (Georg) - Target memory leak (Sagi) - Use appropriate controller state accessor (Daniel)" * tag 'nvme-6.14-2025-01-31' of git://git.infradead.org/nvme: nvme-fc: use ctrl state getter nvme: make nvme_tls_attrs_group static nvmet: add a missing endianess conversion in nvmet_execute_admin_connect nvmet: the result field in nvmet_alloc_ctrl_args is little endian nvmet: fix a memory leak in controller identify nvme-fc: do not ignore connectivity loss during connecting nvme: handle connectivity loss in nvme_set_queue_count nvme-fc: go straight to connecting state when initializing nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk nvme-pci: remove redundant dma frees in hmb nvmet: fix rw control endian access
2025-01-31block: force noio scope in blk_mq_freeze_queueChristoph Hellwig
When block drivers or the core block code perform allocations with a frozen queue, this could try to recurse into the block device to reclaim memory and deadlock. Thus all allocations done by a process that froze a queue need to be done without __GFP_IO and __GFP_FS. Instead of tying to track all of them down, force a noio scope as part of freezing the queue. Note that nvme is a bit of a mess here due to the non-owner freezes, and they will be addressed separately. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250131120352.1315351-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-23nvme: handle connectivity loss in nvme_set_queue_countDaniel Wagner
When the set feature attempts fails with any NVME status code set in nvme_set_queue_count, the function still report success. Though the numbers of queues set to 0. This is done to support controllers in degraded state (the admin queue is still up and running but no IO queues). Though there is an exception. When nvme_set_features reports an host path error, nvme_set_queue_count should propagate this error as the connectivity is lost, which means also the admin queue is not working anymore. Fixes: 9a0be7abb62f ("nvme: refactor set_queue_count") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-01-20Merge tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: "Not a lot in terms of features this time around, mostly just cleanups and code consolidation: - Support for PI meta data read/write via io_uring, with NVMe and SCSI covered - Cleanup the per-op structure caching, making it consistent across various command types - Consolidate the various user mapped features into a concept called regions, making the various users of that consistent - Various cleanups and fixes" * tag 'for-6.14/io_uring-20250119' of git://git.kernel.dk/linux: (56 commits) io_uring/fdinfo: fix io_uring_show_fdinfo() misuse of ->d_iname io_uring: reuse io_should_terminate_tw() for cmds io_uring: Factor out a function to parse restrictions io_uring/rsrc: require cloned buffers to share accounting contexts io_uring: simplify the SQPOLL thread check when cancelling requests io_uring: expose read/write attribute capability io_uring/rw: don't gate retry on completion context io_uring/rw: handle -EAGAIN retry at IO completion time io_uring/rw: use io_rw_recycle() from cleanup path io_uring/rsrc: simplify the bvec iter count calculation io_uring: ensure io_queue_deferred() is out-of-line io_uring/rw: always clear ->bytes_done on io_async_rw setup io_uring/rw: use NULL for rw->free_iovec assigment io_uring/rw: don't mask in f_iocb_flags io_uring/msg_ring: Drop custom destructor io_uring: Move old async data allocation helper to header io_uring/rw: Allocate async data through helper io_uring/net: Allocate msghdr async data through helper io_uring/uring_cmd: Allocate async data through generic helper io_uring/poll: Allocate apoll with generic alloc_cache helper ...
2025-01-17block: Add common atomic writes enable flagJohn Garry
Currently only stacked devices need to explicitly enable atomic writes by setting BLK_FEAT_ATOMIC_WRITES_STACKED flag. This does not work well for device mapper stacking devices, as there many sets of limits are stacked and what is the 'bottom' and 'top' device can swapped. This means that BLK_FEAT_ATOMIC_WRITES_STACKED needs to be set for many queue limits, which is messy. Generalize enabling atomic writes enabling by ensuring that all devices must explicitly set a flag - that includes NVMe, SCSI sd, and md raid. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20250116170301.474130-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13nvme: fix bogus kzalloc() return check in nvme_init_effects_log()Jens Axboe
nvme_init_effects_log() returns failure when kzalloc() is successful, which is obviously wrong and causes failures to boot. Correct the check. Fixes: d4a95adeabc6 ("nvme: Add error path for xa_store in nvme_init_effects") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-13Merge tag 'nvme-6.14-2025-01-12' of git://git.infradead.org/nvme into ↵Jens Axboe
for-6.14/block Pull NVMe updates from Keith: "nvme updates for Linux 6.14 - Target support for PCI-Endpoint transport (Damien) - TCP IO queue spreading fixes (Sagi, Chaitanya) - Target handling for "limited retry" flags (Guixen) - Poll type fix (Yongsoo) - Xarray storage error handling (Keisuke) - Host memory buffer free size fix on error (Francis)" * tag 'nvme-6.14-2025-01-12' of git://git.infradead.org/nvme: (25 commits) nvme-pci: use correct size to free the hmb buffer nvme: Add error path for xa_store in nvme_init_effects nvme-pci: fix comment typo Documentation: Document the NVMe PCI endpoint target driver nvmet: New NVMe PCI endpoint function target driver nvmet: Implement arbitration feature support nvmet: Implement interrupt config feature support nvmet: Implement interrupt coalescing feature support nvmet: Implement host identifier set feature support nvmet: Introduce get/set_feature controller operations nvmet: Do not require SGL for PCI target controller commands nvmet: Add support for I/O queue management admin commands nvmet: Introduce nvmet_sq_create() and nvmet_cq_create() nvmet: Introduce nvmet_req_transfer_len() nvmet: Improve nvmet_alloc_ctrl() interface and implementation nvme: Add PCI transport type nvmet: Add drvdata field to struct nvmet_ctrl nvmet: Introduce nvmet_get_cmd_effects_admin() nvmet: Export nvmet_update_cc() and nvmet_cc_xxx() helpers nvmet: Add vendor_id and subsys_vendor_id subsystem attributes ...
2025-01-12nvme: Add error path for xa_store in nvme_init_effectsKeisuke Nishimura
The xa_store() may fail due to memory allocation failure because there is no guarantee that the index NVME_CSI_NVM is already used. This fix introduces a new function to handle the error path. Fixes: cc115cbe12d9 ("nvme: always initialize known command effects") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>