summaryrefslogtreecommitdiff
path: root/drivers
AgeCommit message (Collapse)Author
2025-05-20Merge branch 'dma-mapping-for-6.16-two-step-api' of ↵Alex Williamson
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux into v6.16/vfio/next Merge two step DMA mapping API as basis for mlx5-vfio-pci uses. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2025-05-20iommu/io-pgtable-arm: Add quirk to quiet WARN_ON()Rob Clark
In situations where mapping/unmapping sequence can be controlled by userspace, attempting to map over a region that has not yet been unmapped is an error. But not something that should spam dmesg. Now that there is a quirk, we can also drop the selftest_running flag, and use the quirk instead for selftests. Acked-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Rob Clark <robdclark@chromium.org> Link: https://lore.kernel.org/r/20250519175348.11924-6-robdclark@gmail.com [will: Rename quirk to IO_PGTABLE_QUIRK_NO_WARN per Robin's suggestion] Signed-off-by: Will Deacon <will@kernel.org>
2025-05-20octeontx2-pf: Add tracepoint for NIX_PARSE_SSubbaraya Sundeep
The NIX_PARSE_S structure populated by hardware in the NIX RX CQE has parsing information for the received packet. A tracepoint to dump the all words of NIX_PARSE_S is helpful in debugging packet parser. Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Link: https://patch.msgid.link/1747331048-15347-1-git-send-email-sbhatta@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-20net: phy: make mdio consumer / device layer a separate moduleHeiner Kallweit
After having factored out the provider part from mdio_bus.c, we can make the mdio consumer / device layer a separate module. This also allows to remove Kconfig symbol MDIO_DEVICE. The module init / exit functions from mdio_bus.c no longer have to be called from phy_device.c. The link order defined in drivers/net/phy/Makefile ensures that init / exit functions are called in the right order. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/dba6b156-5748-44ce-b5e2-e8dc2fcee5a7@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-20platform/x86: think-lmi: Fix attribute name usage for non-compliant itemsMark Pearson
A few, quite rare, WMI attributes have names that are not compatible with filenames, e.g. "Intel VT for Directed I/O (VT-d)". For these cases the '/' gets replaced with '\' for display, but doesn't get switched again when doing the WMI access. Fix this by keeping the original attribute name and using that for sending commands to the BIOS Fixes: a40cd7ef22fb ("platform/x86: think-lmi: Add WMI interface support on Lenovo platforms") Signed-off-by: Mark Pearson <mpearson-lenovo@squebb.ca> Link: https://lore.kernel.org/r/20250520005027.3840705-1-mpearson-lenovo@squebb.ca Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2025-05-20platform/x86: thinkpad_acpi: Ignore battery threshold change event notificationMark Pearson
If user modifies the battery charge threshold an ACPI event is generated. Confirmed with Lenovo FW team this is only generated on user event. As no action is needed, ignore the event and prevent spurious kernel logs. Reported-by: Derek Barbosa <debarbos@redhat.com> Closes: https://lore.kernel.org/platform-driver-x86/7e9a1c47-5d9c-4978-af20-3949d53fb5dc@app.fastmail.com/T/#m5f5b9ae31d3fbf30d7d9a9d76c15fb3502dfd903 Signed-off-by: Mark Pearson <mpearson-lenovo@squebb.ca> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Armin Wolf <W_Armin@gmx.de> Link: https://lore.kernel.org/r/20250517023348.2962591-1-mpearson-lenovo@squebb.ca Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2025-05-20spi: sh-msiof: Transfer size improvements and I2SMark Brown
Merge series from Geert Uytterhoeven <geert+renesas@glider.be>: This patch series (A) improves single transfer sizes in the MSIOF driver, using two methods: - By increasing the assumed FIFO sizes, impacting both PIO and DMA transfers, - By using two groups, impacting DMA transfers, and (B) lets the recently-introduced MSIOF I2S drive reuse the SPI driver's register definitions. All of this is covered with a thick sauce of fixes for (harmless) bugs, cleanups, and refactorings. Note that the driver uses the limitations as specified in the hardware documentation. For discovering the actual FIFO sizes, I wrote some crude test code that can be found at [2]. This is based on spi/for-next and sound-asoc/for-next, and has been tested on a variery of R-Car SoCs. [1] https://lore.kernel.org/cover.1746180072.git.geert+renesas@glider.be [2] https://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers.git/log/?h=topic/msiof-fifo
2025-05-20Add sound card support for QCS9100 and QCS9075Mark Brown
Merge series from Mohammad Rafi Shaik <mohammad.rafi.shaik@oss.qualcomm.com>: This patchset adds support for sound card on Qualcomm QCS9100 and QCS9075 boards.
2025-05-20regmap: Move selecting for REGMAP_MDIO and REGMAP_IRQAndrew Davis
If either REGMAP_IRQ or REGMAP_MDIO are set then REGMAP is also set. This then enables the selecting of IRQ_DOMAIN or MDIO_BUS from REGMAP based on the above two symbols respectively. This makes it very easy to end up with "circular dependencies". Instead select the IRQ_DOMAIN or MDIO_BUS from the symbols that make use of them. This is almost equivalent to before but makes it less likely to end up with false circular dependency detections. Signed-off-by: Andrew Davis <afd@ti.com> Reported-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Closes: https://lore.kernel.org/r/bfe991fa-f54c-4d58-b2e0-34c4e4eb48f4@linaro.org/ Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://patch.msgid.link/20250516141722.13772-1-afd@ti.com Signed-off-by: Mark Brown <broonie@kernel.org>
2025-05-20i2c: core: add useful info when defer probeXu Yang
Add an useful info when failed to get irq/wakeirq due to -EPROBE_DEFER. Before: [ 15.737361] i2c 2-0050: deferred probe pending: (reason unknown) After: [ 15.816295] i2c 2-0050: deferred probe pending: tcpci: can't get irq Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Reviewed-by: Carlos Song <carlos.song@nxp.com> Reviewed-by: Frank Li <Frank.Li@nxp.com> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
2025-05-20ata: libata-eh: Keep DIPM disabled while modifying the allowed LPM statesNiklas Cassel
Currently, it is possible that LPM is enabled while calling the set_lpm() callback. The current code performs a SET FEATURES command to disable DIPM if policy < ATA_LPM_MED_POWER_WITH_DIPM, this means that it will currently disable DIPM for policies: ATA_LPM_UNKNOWN, ATA_LPM_MAX_POWER, ATA_LPM_MED_POWER (but not for policy ATA_LPM_MED_POWER_WITH_DIPM). The code called after calling the set_lpm() callback will later perform a SET FEATURES command to enable DIPM, if policy >= ATA_LPM_MED_POWER_WITH_DIPM. As we can see DIPM will not be disabled before calling set_lpm() if the LPM policy is: ATA_LPM_MED_POWER_WITH_DIPM, ATA_LPM_MIN_POWER_WITH_PARTIAL, or ATA_LPM_MIN_POWER. Make sure that we always disable DIPM before calling the set_lpm() callback. This is because the set_lpm() callback is the function (for AHCI) that sets the proper bits in PxSCTL.IPM, reflecting the support of the HBA. PxSCTL.IPM controls the LPM states that the device is allowed to enter. If the device tries to enter a state disabled by PxSCTL.IPM, the host will NAK the transition. If we do not disable DIPM before modifying PxSCTL.IPM, it is possible that DIPM will try (and will be allowed to) enter a LPM state that the HBA does not support (since we have not yet written PxSCTL.IPM, the HBA wasn't able to NAK the transition). While at it, remove the guard of host support for DIPM around the disabling of DIPM. While it makes sense to take host support for DIPM into account when enabling DIPM, it makes zero sense to take host support into account when disabling DIPM. If the host does not support DIPM, that is an even bigger reason why DIPM should be disabled on the device side. Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2025-05-20ata: libata-eh: Rename no_dipm variable to be more clearNiklas Cassel
Rename the no_dipm variable to host_has_dipm, by inverting the expression, and and also having a clearer name. No functional change. Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2025-05-20ata: libata-eh: Rename hipm and dipm variablesNiklas Cassel
Rename the hipm and dipm variables to have a clearer name. Also fold in the usage of no_dipm, as that is required in order to give the dipm variable a more descriptive name. No functional change. Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2025-05-20ata: libata-eh: Add ata_eh_set_lpm() WARN_ON_ONCENiklas Cassel
link->lpm_policy is initialized to ATA_LPM_UNKNOWN in ata_eh_reset(). ata_eh_set_lpm() is then only called if link->lpm_policy != ap->target_lpm_policy (after reset) and then only if link->lpm_policy > ATA_LPM_MAX_POWER (before revalidation). This means that ata_eh_set_lpm() is currently never called with policy == ATA_LPM_UNKNOWN. Add a WARN_ON_ONCE so that it is more obvious from reading the code that this function is never called with policy == ATA_LPM_UNKNOWN. Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2025-05-20ata: libata-eh: Update DIPM comments to reflect realityNiklas Cassel
The comments describing which LPM policies that has DIPM enabled predates the introduction of the LPM policies ATA_LPM_MIN_POWER_WITH_PARTIAL and ATA_LPM_MED_POWER_WITH_DIPM. Update the DIPM comments to reflect reality. Also remove the sentence that claims that "Order device and link configurations such that the host always allows DIPM requests." This comment is written before 24e0e61db3cb ("ata: libata: disallow dev-initiated LPM transitions to unsupported states"). Even though the set_lpm() call is done before enabling DIPM, the host will not always allow DIPM requests. For all LPM polcies where DIPM is enabled, only DIPM requests to LPM states that are supported by the HBA will be allowed. Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
2025-05-20PCI: Remove hybrid-devres usage warnings from kernel-docPhilipp Stanner
pci/iomap.c still contains warnings about those functions not behaving in a managed manner if pcim_enable_device() was called. Since all hybrid behavior that users could know about has been removed by now, those explicit warnings are no longer necessary. Remove the hybrid-devres usage warnings from the kernel-doc. Signed-off-by: Philipp Stanner <phasta@kernel.org> [kwilczynski: commit log] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/r/20250519112959.25487-8-phasta@kernel.org
2025-05-20PCI: Remove redundant set of request functionsPhilipp Stanner
When the demangling of the hybrid devres functions within PCI was implemented, it was necessary to implement several PCI functions a second time to avoid cyclic calls, since the hybrid functions in pci.c call the managed functions in devres.c, which in turn can be directly used outside of PCI and needed request infrastructure, too. Therefore, __pcim_request_region_range(), __pci_release_region_range() and wrappers around them were implemented. The hybrid nature has recently been removed from all functions in pci.c. Therefore, the functions in devres.c can now directly use their counterparts in pci.c without causing a call-cycle. Remove __pcim_request_region_range(), __pcim_request_region_range() and the wrappers. Use the corresponding request functions from pci.c in devres.c Signed-off-by: Philipp Stanner <phasta@kernel.org> Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/r/20250519112959.25487-7-phasta@kernel.org
2025-05-20PCI: Remove exclusive requests flags from _pcim_request_region()Philipp Stanner
pcim_request_region_exclusive(), the only user in PCI devres that needed exclusive region requests, has been removed. All features related to exclusive requests can, therefore, be removed, too. Remove them. Signed-off-by: Philipp Stanner <phasta@kernel.org> [kwilczynski: commit log] Signed-off-by: Krzysztof Wilczyński <kwilczynski@kernel.org> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20250519112959.25487-6-phasta@kernel.org
2025-05-20gpiolib: remove unneeded #ifdefBartosz Golaszewski
We are already within another `#ifdef CONFIG_GPIOLIB_IRQCHIP` in gpiochip_to_irq() so there's no need for another guard. Remove it. Acked-by: Peng Fan <peng.fan@nxp.com> Link: https://lore.kernel.org/r/20250519-gpio-irq-kconfig-fixes-v1-3-fe6ba1c6116d@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-05-20gpio: mpc8xxx: select GPIOLIB_IRQCHIPBartosz Golaszewski
This driver uses gpiochip_irq_reqres() and gpiochip_irq_relres() which are only built with GPIOLIB_IRQCHIP=y. Add the missing Kconfig select. Fixes: 7688a54d5b53 ("gpio: mpc8xxx: Make irq_chip immutable") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505180309.1nosQMkI-lkp@intel.com/ Acked-by: Peng Fan <peng.fan@nxp.com> Link: https://lore.kernel.org/r/20250519-gpio-irq-kconfig-fixes-v1-2-fe6ba1c6116d@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-05-20gpio: pxa: select GPIOLIB_IRQCHIPBartosz Golaszewski
This driver uses gpiochip_irq_reqres() and gpiochip_irq_relres() which are only built with GPIOLIB_IRQCHIP=y. Add the missing Kconfig select. Fixes: 20117cf426b6 ("gpio: pxa: Make irq_chip immutable") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505181429.mzyIatOU-lkp@intel.com/ Acked-by: Peng Fan <peng.fan@nxp.com> Link: https://lore.kernel.org/r/20250519-gpio-irq-kconfig-fixes-v1-1-fe6ba1c6116d@linaro.org Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2025-05-20cpufreq: scmi: Skip SCMI devices that aren't used by the CPUsMike Tipton
Currently, all SCMI devices with performance domains attempt to register a cpufreq driver, even if their performance domains aren't used to control the CPUs. The cpufreq framework only supports registering a single driver, so only the first device will succeed. And if that device isn't used for the CPUs, then cpufreq will scale the wrong domains. To avoid this, return early from scmi_cpufreq_probe() if the probing SCMI device isn't referenced by the CPU device phandles. This keeps the existing assumption that all CPUs are controlled by a single SCMI device. Signed-off-by: Mike Tipton <quic_mdtipton@quicinc.com> Reviewed-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Cristian Marussi <cristian.marussi@arm.com> Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2025-05-20Merge branch 'rust/cpufreq-dt' into cpufreq/arm/linux-nextViresh Kumar
2025-05-20cpufreq: Add Rust-based cpufreq-dt driverViresh Kumar
Introduce a Rust-based implementation of the cpufreq-dt driver, covering most of the functionality provided by the existing C version. Some features, such as retrieving platform data from `cpufreq-dt-platdev.c`, are still pending. The driver has been tested with QEMU, and frequency scaling works as expected. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
2025-05-19hwmon: (isl28022) Fix current reading calculationYikai Tsai
According to the ISL28022 datasheet, bit15 of the current register is representing -32768. Fix the calculation to properly handle this bit, ensuring correct measurements for negative values. Signed-off-by: Yikai Tsai <yikai.tsai.wiwynn@gmail.com> Link: https://lore.kernel.org/r/20250519084055.3787-2-yikai.tsai.wiwynn@gmail.com Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2025-05-20nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_diskNilay Shroff
In the NVMe context, the term "shutdown" has a specific technical meaning. To avoid confusion, this commit renames the nvme_mpath_ shutdown_disk function to nvme_mpath_remove_disk to better reflect its purpose (i.e. removing the disk from the system). However, nvme_mpath_remove_disk was already in use, and its functionality is related to releasing or putting the head node disk. To resolve this naming conflict and improve clarity, the existing nvme_mpath_ remove_disk function is also renamed to nvme_mpath_put_disk. This renaming improves code readability and better aligns function names with their actual roles. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme: introduce multipath_always_on module paramNilay Shroff
Currently, a multipath head disk node is not created for single- ported NVMe adapters or private namespaces with non-unique NSID. However, creating a head node in these cases can help transparently handle transient PCIe link failures. Without a head node, features like delayed removal cannot be leveraged, making it difficult to tolerate such link failures. To address this, this commit introduces nvme_core module parameter multipath_always_on. When multipath_always_on is set to true, it forces the creation of a multipath head node regardless NVMe disk or namespace type. So this option allows the use of delayed removal of head node functionality even for single-ported NVMe disks and private namespaces with a unique NSID and thus helps transparently handle transient PCIe link failures. By default multipath_always_on is set to false, thus preserving the existing behavior. Setting it to true enables improved fault tolerance in PCIe setups. Moreover, please note that enabling this option would also implicitly enable nvme_core.multipath. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme-multipath: introduce delayed removal of the multipath head nodeNilay Shroff
Currently, the multipath head node of an NVMe disk is removed immediately as soon as all paths of the disk are removed. However, this can cause issues in scenarios where: - The disk hot-removal followed by re-addition. - Transient PCIe link failures that trigger re-enumeration, temporarily removing and then restoring the disk. In these cases, removing the head node prematurely may lead to a head disk node name change upon re-addition, requiring applications to reopen their handles if they were performing I/O during the failure. To address this, introduce a delayed removal mechanism of head disk node. During transient failure, instead of immediate removal of head disk node, the system waits for a configurable timeout, allowing the disk to recover. During transient disk failure, if application sends any IO then we queue it instead of failing such IO immediately. If the disk comes back online within the timeout, the queued IOs are resubmitted to the disk ensuring seamless operation. In case disk couldn't recover from the failure then queued IOs are failed to its completion and application receives the error. So this way, if disk comes back online within the configured period, the head node remains unchanged, ensuring uninterrupted workloads without requiring applications to reopen device handles. A new sysfs attribute, named "delayed_removal_secs" is added under head disk blkdev for user who wish to configure time for the delayed removal of head disk node. The default value of this attribute is set to zero second ensuring no behavior change unless explicitly configured. Link: https://lore.kernel.org/linux-nvme/Y9oGTKCFlOscbPc2@infradead.org/ Link: https://lore.kernel.org/linux-nvme/Y+1aKcQgbskA2tra@kbusch-mbp.dhcp.thefacebook.com/ Suggested-by: Keith Busch <kbusch@kernel.org> Suggested-by: Christoph Hellwig <hch@infradead.org> [nilay: reworked based on the original idea/POC from Christoph and Keith] Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme-pci: derive and better document max segments limitsChristoph Hellwig
Redefine the max segments and max integrity limits based on the limiting factors. This keeps exactly the same values for 4k PAGE_SIZE systems, but increases the number of segments for larger page size as it properly derives the scatterlist allocation based limit for them instead of assuming a 4k PAGE_SIZE. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org>
2025-05-20nvme-pci: use struct_size for allocation struct nvme_devChristoph Hellwig
This avoids open coding the variable size array arithmetics. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Leon Romanovsky <leon@kernel.org>
2025-05-20nvme-pci: add a symolic name for the small pool sizeLeon Romanovsky
Open coding magic numbers in multiple places is never a good idea. Signed-off-by: Leon Romanovsky <leon@kernel.org> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
2025-05-20nvme-pci: use a better encoding for small prp pool allocationsChristoph Hellwig
Add a separate flag to encode that the transfer is using the small page sized pool, and use a normal 0..n count for the number of descriptors. Contains improvements and suggestions from Kanchan Joshi <joshi.k@samsung.com> and Leon Romanovsky <leon@kernel.org>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Leon Romanovsky <leon@kernel.org>
2025-05-20nvme-pci: rename the descriptor poolsChristoph Hellwig
They are used for both PRPs and SGLs, and we use descriptor elsewhere when referring to their allocations, so use that name here as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Leon Romanovsky <leon@kernel.org>
2025-05-20nvme-pci: remove struct nvme_descriptorChristoph Hellwig
There is no real point in having a union of two pointer types here, just use a void pointer as we mix and match types between the arms of the union between the allocation and freeing side already. Also rename the nr_allocations field to nr_descriptors to better describe what it does. Signed-off-by: Christoph Hellwig <hch@lst.de> [leon: ported forward to include metadata SGL support] Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
2025-05-20nvme-pci: store aborted state in flags variableLeon Romanovsky
Instead of keeping dedicated "bool aborted" variable, switch to a flags flags that can be used for other flags as well. Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
2025-05-20nvme-pci: don't try to use SGLs for metadata on the admin queueChristoph Hellwig
No admin command defined in an NVMe specification supports metadata, but to protect against vendor specific commands using metadata ensure that we don't try to use SGLs for metadata on the admin queue, as NVMe does not support SGLs on the admin queue for the PCI transport. Do this by checking if the data transfer has been setup using SGLs as that is required for using SGLs for metadata. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Leon Romanovsky <leon@kernel.org>
2025-05-20nvme-pci: make PRP list DMA pools per-NUMA-nodeCaleb Sander Mateos
NVMe commands with over 8 KB of discontiguous data allocate PRP list pages from the per-nvme_device dma_pool prp_page_pool or prp_small_pool. Each call to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlock. These device-global spinlocks are a significant source of contention when many CPUs are submitting to the same NVMe devices. On a workload issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA nodes to 23 NVMe devices, we observed 2.4% of CPU time spent in _raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free. Ideally, the dma_pools would be per-hctx to minimize contention. But that could impose considerable resource costs in a system with many NVMe devices and CPUs. As a compromise, allocate per-NUMA-node PRP list DMA pools. Map each nvme_queue to the set of DMA pools corresponding to its device and its hctx's NUMA node. This reduces the _raw_spin_lock_irqsave overhead by about half, to 1.2%. Preventing the sharing of PRP list pages across NUMA nodes also makes them cheaper to initialize. Link: https://lore.kernel.org/linux-nvme/CADUfDZqa=OOTtTTznXRDmBQo1WrFcDw1hBA7XwM7hzJ-hpckcA@mail.gmail.com/T/#u Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme-pci: factor out a nvme_init_hctx_common() helperCaleb Sander Mateos
nvme_init_hctx() and nvme_admin_init_hctx() are very similar. In preparation for adding more logic, factor out a nvme_init_hctx-common() helper. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvme-fc: do not reference lsrsp after failureDaniel Wagner
The lsrsp object is maintained by the LLDD. The lifetime of the lsrsp object is implicit. Because there is no explicit cleanup/free call into the LLDD, it is not safe to assume after xml_rsp_fails, that the lsrsp is still valid. The LLDD could have freed the object already. With the recent changes how fcloop tracks the resources, this is the case. Thus don't access lsrsp after xml_rsp_fails. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: don't wait for lport cleanupDaniel Wagner
The lifetime of the fcloop_lsreq is not tight to the lifetime of the host or target port, thus there is no need anymore to synchronize the cleanup path anymore. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: add missing fcloop_callback_host_doneDaniel Wagner
Add the missing fcloop_call_host_done calls so that the caller frees resources when something goes wrong. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fc: take tgtport refs for portentryDaniel Wagner
Ensure that the tgtport is not going away as long portentry has a pointer on it. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fc: free pending reqs on tgtport unregisterDaniel Wagner
When nvmet_fc_unregister_targetport is called by the LLDD, it's not possible to communicate with the host, thus all pending request will not be process. Thus explicitly free them. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: drop response if targetport is goneDaniel Wagner
When the target port is gone, the lsrsp pointer is invalid. Thus don't call the done function anymore instead just drop the response. This happens when the target sends a disconnect association. After this the target starts tearing down all resources and doesn't expect any response. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: allocate/free fcloop_lsreq directlyDaniel Wagner
fcloop depends on the host or the target to allocate the fcloop_lsreq object. This means that the lifetime of the fcloop_lsreq is tied to either the host or the target. Consequently, the host or the target must cooperate during shutdown. Unfortunately, this approach does not work well when the target forces a shutdown, as there are dependencies that are difficult to resolve in a clean way. The simplest solution is to decouple the lifetime of the fcloop_lsreq object by managing them directly within fcloop. Since this is not a performance-critical path and only a small number of LS objects are used during setup and cleanup, it does not significantly impact performance to allocate them during normal operation. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: prevent double port deletionDaniel Wagner
The delete callback can be called either via the unregister function or from the transport directly. Thus it is necessary ensure resources are not freed multiple times. Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: access fcpreq only when holding reqlockDaniel Wagner
The abort handling logic expects that the state and the fcpreq are only accessed when holding the reqlock lock. While at it, only handle the aborts in the abort handler. Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: update refs on tfcp_reqDaniel Wagner
Track the lifetime of the in-flight tfcp_req to ensure the object is not freed too early. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: refactor fcloop_delete_local_portDaniel Wagner
Use the newly introduced fcloop_lport_lookup instead of the open coded version. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-20nvmet-fcloop: refactor fcloop_nport_alloc and track lportDaniel Wagner
The checks for a valid input values are mixed with the logic to insert a newly allocated nport. Refactor the function so that first the checks are done. This allows to untangle the setup steps into a more linear form which reduces the complexity of the functions. Also start tracking lport when a lport is assigned to a nport. This ensures, that the lport is not going away as long it is still referenced by a nport. Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>