summaryrefslogtreecommitdiff
path: root/drivers
AgeCommit message (Collapse)Author
2024-06-28drm/xe/mocs: Update MOCS assertions and remove redundant checksMatt Roper
Rely more heavily on assertions to describe the MOCS programming invariants. CI checks these assertions and will ensure no violations sneak in due to programmer error, so we can remove some of the redundant WARN and silent return checks from non-debug builds. Also tweak/augment some of the existing assertions: there's no reason we'd ever want a platform not to have a MOCS 'ops' structure hooked up so ensure info->ops is non-NULL. Likewise, we should never have a case where the bspec-defined MOCS setting table is larger than the number of MOCS registers exposed by the hardware, so add an extra assert on those sizes as well. Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240627203741.2042752-3-matthew.d.roper@intel.com
2024-06-28ice: do not init struct ice_adapter more times than neededPrzemek Kitszel
Allocate and initialize struct ice_adapter object only once per physical card instead of once per port. This is not a big deal by now, but we want to extend this struct more and more in the near future. Our plans include PTP stuff and a devlink instance representing whole-device/physical card. Transactions requiring to be sleep-able (like those doing user (here ice) memory allocation) must be performed with an additional (on top of xarray) mutex. Adding it here removes need to xa_lock() manually. Since this commit is a reimplementation of ice_adapter_get(), a rather new scoped_guard() wrapper for locking is used to simplify the logic. It's worth to mention that xa_insert() use gives us both slot reservation and checks if it is already filled, what simplifies code a tiny bit. Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-06-28hwmon: iio: Add labels from IIO channelsSean Anderson
Add labels from IIO channels to our channels. This allows userspace to display more meaningful names instead of "in0" or "temp5". Although lm-sensors gracefully handles errors when reading channel labels, the ABI says the label attribute > Should only be created if the driver has hints about what this voltage > channel is being used for, and user-space doesn't. Therefore, we test to see if the channel has a label before creating the attribute. Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Acked-by: Guenter Roeck <linux@roeck-us.net> Link: https://patch.msgid.link/20240624174601.1527244-3-sean.anderson@linux.dev Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-06-28iio: Add iio_read_channel_label to inkern APISean Anderson
It can be convenient for other in-kernel drivers to reuse IIO channel labels. Export the iio_read_channel_label function to allow this. The signature is different depending on where we are calling it from, so the meat is moved to do_iio_read_channel_label. Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://patch.msgid.link/20240624174601.1527244-2-sean.anderson@linux.dev Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-06-28iio: adc: ltc2309: Fix endian type passed to be16_to_cpu()Jonathan Cameron
Picked up by sparse. Cc: Liam Beguin <liambeguin@gmail.com> Reviewed-by: Liam Beguin <liambeguin@gmail.com> Link: https://patch.msgid.link/20240624193210.347434-1-jic23@kernel.org Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-06-28cpuidle: teo: Remove recent intercepts metricChristian Loehle
The logic for recent intercepts didn't work, there is an underflow of the 'recent' value that can be observed during boot already, which teo usually doesn't recover from, making the entire logic pointless. Furthermore the recent intercepts also were never reset, thus not actually being very 'recent'. Having underflowed 'recent' values lead to teo always acting as if we were in a scenario were expected sleep length based on timers is too high and it therefore unnecessarily selecting shallower states. Experiments show that the remaining 'intercept' logic is enough to quickly react to scenarios in which teo cannot rely on the timer expected sleep length. See also here: https://lore.kernel.org/lkml/0ce2d536-1125-4df8-9a5b-0d5e389cd8af@arm.com/ Fixes: 77577558f25d ("cpuidle: teo: Rework most recent idle duration values treatment") Link: https://patch.msgid.link/20240628095955.34096-3-christian.loehle@arm.com Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-28Revert: "cpuidle: teo: Introduce util-awareness"Christian Loehle
This reverts commit 9ce0f7c4bc64d820b02a1c53f7e8dba9539f942b. Util-awareness was reported to be too aggressive in selecting shallower states. Additionally a single threshold was found to not be suitable for reasoning about sleep length as, for all practical purposes, almost arbitrary sleep lengths are still possible for any load value. Fixes: 9ce0f7c4bc64 ("cpuidle: teo: Introduce util-awareness") Link: https://patch.msgid.link/20240628095955.34096-2-christian.loehle@arm.com Reported-by: Qais Yousef <qyousef@layalina.io> Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Qais Yousef <qyousef@layalina.io> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-28cpufreq: make cpufreq_boost_enabled() return boolDhruva Gole
Since this function is supposed to return boost_enabled which is anyway a bool type make sure that it's return value is also marked as bool. This helps maintain better consistency in data types being used. Signed-off-by: Dhruva Gole <d-gole@ti.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://patch.msgid.link/20240627060117.1809477-1-d-gole@ti.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-28cpufreq: intel_pstate: Support highest performance change interruptSrinivas Pandruvada
On some systems, the HWP (Hardware P-states) highest performance level can change from the value set at boot-up. This behavior can lead to two issues: - The 'cpuinfo_max_freq' within the 'cpufreq' sysfs will not reflect the CPU's highest achievable performance. - Even if the CPU's highest performance level is increased after booting, the CPU may not reach the full expected performance. The availability of this feature is indicated by the CPUID instruction: if CPUID[6].EAX[15] is set to 1, the feature is supported. When supported, setting bit 2 of the MSR_HWP_INTERRUPT register enables notifications of the highest performance level changes. Therefore, as part of enabling the HWP interrupt, bit 2 of the MSR_HWP_INTERRUPT should also be set when this feature is supported. Upon a change in the highest performance level, a new HWP interrupt is generated, with bit 3 of the MSR_HWP_STATUS register set, and the MSR_HWP_CAPABILITIES register is updated with the new highest performance limit. The processing of the interrupt is the same as the guaranteed performance change. Notify change to cpufreq core and update MSR_HWP_REQUEST with new performance limits. The current driver implementation already takes care of the highest performance change as part of: commit dfeeedc1bf57 ("cpufreq: intel_pstate: Update cpuinfo.max_freq on HWP_CAP changes") For example: Before highest performance change interrupt: cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq 3700000 cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 3700000 After highest performance changes interrupt: cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq 3900000 cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq 3900000 Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Link: https://patch.msgid.link/20240624161109.1427640-3-srinivas.pandruvada@linux.intel.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-06-28drm/xe: Get hwe domain specific FW to read RING_TIMESTAMPUmesh Nerlige Ramappa
Per client engine utilization uses RING_TIMESTAMP to return drm-total-cycles to the user. Current code uses XE_FW_GT to read this register on the first available engine in a GT. When testing on DG2, it is observed that this value is 0 when running test on some engines. To resolve that, get the hwe domain specific FW for reading the engine timestamp. v2: - update commit message - use domain specific FW (Matt) v3: - Drop check for hwe in the helper (Matt, Michal) v4: - checkpatch fixes v5: Rebase Fixes: 188ced1e0ff8 ("drm/xe/client: Print runtime to fdinfo") Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240627235105.2631135-1-umesh.nerlige.ramappa@intel.com
2024-06-28i2c: testunit: discard write requests while old command is runningWolfram Sang
When clearing registers on new write requests was added, the protection for currently running commands was missed leading to concurrent access to the testunit registers. Check the flag beforehand. Fixes: b39ab96aa894 ("i2c: testunit: add support for block process calls") Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Andi Shyti <andi.shyti@kernel.org>
2024-06-28i2c: testunit: don't erase registers after STOPWolfram Sang
STOP fallsthrough to WRITE_REQUESTED but this became problematic when clearing the testunit registers was added to the latter. Actually, there is no reason to clear the testunit state after STOP. Doing it when a new WRITE_REQUESTED arrives is enough. So, no need to fallthrough, at all. Fixes: b39ab96aa894 ("i2c: testunit: add support for block process calls") Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Reviewed-by: Andi Shyti <andi.shyti@kernel.org>
2024-06-28Bluetooth: btnxpuart: Enable Power Save feature on startupNeeraj Sanjay Kale
This sets the default power save mode setting to enabled. The power save feature is now stable and stress test issues, such as the TX timeout error, have been resolved. commit c7ee0bc8db32 ("Bluetooth: btnxpuart: Resolve TX timeout error in power save stress test") With this setting, the driver will send the vendor command to FW at startup, to enable power save feature. User can disable this feature using the following vendor command: hcitool cmd 3f 23 03 00 00 (HCI_NXP_AUTO_SLEEP_MODE) Signed-off-by: Neeraj Sanjay Kale <neeraj.sanjaykale@nxp.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-06-28Bluetooth: qca: Fix BT enable failure again for QCA6390 after warm rebootZijun Hu
Commit 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev") will cause below regression issue: BT can't be enabled after below steps: cold boot -> enable BT -> disable BT -> warm reboot -> BT enable failure if property enable-gpios is not configured within DT|ACPI for QCA6390. The commit is to fix a use-after-free issue within qca_serdev_shutdown() by adding condition to avoid the serdev is flushed or wrote after closed but also introduces this regression issue regarding above steps since the VSC is not sent to reset controller during warm reboot. Fixed by sending the VSC to reset controller within qca_serdev_shutdown() once BT was ever enabled, and the use-after-free issue is also fixed by this change since the serdev is still opened before it is flushed or wrote. Verified by the reported machine Dell XPS 13 9310 laptop over below two kernel commits: commit e00fc2700a3f ("Bluetooth: btusb: Fix triggering coredump implementation for QCA") of bluetooth-next tree. commit b23d98d46d28 ("Bluetooth: btusb: Fix triggering coredump implementation for QCA") of linus mainline tree. Fixes: 272970be3dab ("Bluetooth: hci_qca: Fix driver shutdown on closed serdev") Cc: stable@vger.kernel.org Reported-by: Wren Turkal <wt@penguintechs.org> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218726 Signed-off-by: Zijun Hu <quic_zijuhu@quicinc.com> Tested-by: Wren Turkal <wt@penguintechs.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-06-28Bluetooth: btintel_pcie: Fix REVERSE_INULL issue reported by coverityVijay Satija
check pdata return of skb_pull_data, instead of data. Fixes: c2b636b3f788 ("Bluetooth: btintel_pcie: Add support for PCIe transport") Signed-off-by: Vijay Satija <vijay.satija@intel.com> Signed-off-by: Kiran K <kiran.k@intel.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-06-28Bluetooth: hci_bcm4377: Fix msgid releaseHector Martin
We are releasing a single msgid, so the order argument to bitmap_release_region must be zero. Fixes: 8a06127602de ("Bluetooth: hci_bcm4377: Add new driver for BCM4377 PCIe boards") Cc: stable@vger.kernel.org Signed-off-by: Hector Martin <marcan@marcan.st> Reviewed-by: Sven Peter <sven@svenpeter.dev> Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Sven Peter <sven@svenpeter.dev> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-06-28Bluetooth: Add quirk to ignore reserved PHY bits in LE Extended Adv ReportSven Peter
Some Broadcom controllers found on Apple Silicon machines abuse the reserved bits inside the PHY fields of LE Extended Advertising Report events for additional flags. Add a quirk to drop these and correctly extract the Primary/Secondary_PHY field. The following excerpt from a btmon trace shows a report received with "Reserved" for "Primary PHY" on a 4388 controller: > HCI Event: LE Meta Event (0x3e) plen 26 LE Extended Advertising Report (0x0d) Num reports: 1 Entry 0 Event type: 0x2515 Props: 0x0015 Connectable Directed Use legacy advertising PDUs Data status: Complete Reserved (0x2500) Legacy PDU Type: Reserved (0x2515) Address type: Random (0x01) Address: 00:00:00:00:00:00 (Static) Primary PHY: Reserved Secondary PHY: No packets SID: no ADI field (0xff) TX power: 127 dBm RSSI: -60 dBm (0xc4) Periodic advertising interval: 0.00 msec (0x0000) Direct address type: Public (0x00) Direct address: 00:00:00:00:00:00 (Apple, Inc.) Data length: 0x00 Cc: stable@vger.kernel.org Fixes: 2e7ed5f5e69b ("Bluetooth: hci_sync: Use advertised PHYs on hci_le_ext_create_conn_sync") Reported-by: Janne Grunau <j@jannau.net> Closes: https://lore.kernel.org/all/Zjz0atzRhFykROM9@robin Tested-by: Janne Grunau <j@jannau.net> Signed-off-by: Sven Peter <sven@svenpeter.dev> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-06-28ice: Distinguish driver reset and removal for AQ shutdownPiotr Gardocki
Admin queue command for shutdown AQ contains a flag to indicate driver unload. However, the flag is always set in the driver, even for resets. It can cause the firmware to consider driver as unloaded once the PF reset is triggered on all ports of device, which could lead to unexpected results. Add an additional function parameter to functions that shutdown AQ, indicating whether the driver is actually unloading. Reviewed-by: Ahmed Zaki <ahmed.zaki@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Piotr Gardocki <piotrx.gardocki@intel.com> Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-06-28ice: Allow different FW API versions based on MAC typePaul Greenwalt
Allow the driver to be compatible with different FW API versions based on the device's MAC type. Currently, E810 is only compatible with one FW API version. Now the driver can be compatible with different FW API versions for both E810 and E830. For example, E810 FW API version is 1.5.0 and E830 is 1.7.0. Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-06-28ice: Check all ice_vsi_rebuild() errors in functionEric Joyner
Check the return value from ice_vsi_rebuild() and prevent the usage of incorrectly configured VSI. Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Eric Joyner <eric.joyner@intel.com> Signed-off-by: Karen Ostrowska <karen.ostrowska@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-06-28interconnect: qcom: Add MSM8953 driverVladimir Lypak
Add driver for interconnect busses found in MSM8953 based platforms. The topology consists of four NoCs that are partially controlled by a RPM processor. Note that one of NoCs (System NoC) has a counterpart (System NoC MM) that is modelled as child device to avoid resource conflicts, since it uses same MMIO space for configuration. Signed-off-by: Vladimir Lypak <vladimir.lypak@gmail.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Barnabás Czémán <barnabas.czeman@mainlining.org> Link: https://lore.kernel.org/r/20240628-msm8953-interconnect-v3-2-a70d582182dc@mainlining.org Signed-off-by: Georgi Djakov <djakov@kernel.org>
2024-06-28drm/nouveau: fix null pointer dereference in nouveau_connector_get_modesMa Ke
In nouveau_connector_get_modes(), the return value of drm_mode_duplicate() is assigned to mode, which will lead to a possible NULL pointer dereference on failure of drm_mode_duplicate(). Add a check to avoid npd. Cc: stable@vger.kernel.org Fixes: 6ee738610f41 ("drm/nouveau: Add DRM driver for NVIDIA GPUs") Signed-off-by: Ma Ke <make24@iscas.ac.cn> Signed-off-by: Lyude Paul <lyude@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240627074204.3023776-1-make24@iscas.ac.cn
2024-06-28remoteproc: mediatek: Don't attempt to remap l1tcm memory if missingNícolas F. R. A. Prado
The current code doesn't check whether platform_get_resource_byname() succeeded to get the l1tcm memory, which is optional, before attempting to map it. This results in the following error message when it is missing: mtk-scp 10500000.scp: error -EINVAL: invalid resource (null) Add a check so that the remapping is only attempted if the memory region exists. This also allows to simplify the logic handling failure to remap, since a failure then is always a failure. Fixes: ca23ecfdbd44 ("remoteproc/mediatek: support L1TCM") Signed-off-by: Nícolas F. R. A. Prado <nfraprado@collabora.com> Reviewed-by: Tzung-Bi Shih <tzungbi@kernel.org> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Link: https://lore.kernel.org/r/20240627-scp-invalid-resource-l1tcm-v1-1-7d221e6c495a@collabora.com Signed-off-by: Mathieu Poirier <mathieu.poirier@linaro.org>
2024-06-28cpumask: Add enabled cpumask for present CPUs that can be brought onlineJames Morse
The 'offline' file in sysfs shows all offline CPUs, including those that aren't present. User-space is expected to remove not-present CPUs from this list to learn which CPUs could be brought online. CPUs can be present but not-enabled. These CPUs can't be brought online until the firmware policy changes, which comes with an ACPI notification that will register the CPUs. With only the offline and present files, user-space is unable to determine which CPUs it can try to bring online. Add a new CPU mask that shows this based on all the registered CPUs. Signed-off-by: James Morse <james.morse@arm.com> Tested-by: Miguel Luis <miguel.luis@oracle.com> Tested-by: Vishnu Pajjuri <vishnu@os.amperecomputing.com> Tested-by: Jianyong Wu <jianyong.wu@arm.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240529133446.28446-20-Jonathan.Cameron@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-28irqchip/gic-v3: Add support for ACPI's disabled but 'online capable' CPUsJames Morse
To support virtual CPU hotplug, ACPI has added an 'online capable' bit to the MADT GICC entries. This indicates a disabled CPU entry may not be possible to online via PSCI until firmware has set enabled bit in _STA. This means that a "usable" GIC redistributor is one that is marked as either enabled, or online capable. The meaning of the acpi_gicc_is_usable() would become less clear than just checking the pair of flags at call sites. As such, drop that helper function. The test in gic_acpi_match_gicc() remains as testing just the enabled bit so the count of enabled distributors is correct. What about the redistributor in the GICC entry? ACPI doesn't want to say. Assume the worst: When a redistributor is described in the GICC entry, but the entry is marked as disabled at boot, assume the redistributor is inaccessible. The GICv3 driver doesn't support late online of redistributors, so this means the corresponding CPU can't be brought online either. Rather than modifying cpu masks that may already have been used, register a new cpuhp callback to fail this case. This must run earlier than the main gic_starting_cpu() so that this case can be rejected before the section of cpuhp that runs on the CPU that is coming up as that is not allowed to fail. This solution keeps the handling of this broken firmware corner case local to the GIC driver. As precise ordering of this callback doesn't need to be controlled as long as it is in that initial prepare phase, use CPUHP_BP_PREPARE_DYN. Systems that want CPU hotplug in a VM can ensure their redistributors are always-on, and describe them that way with a GICR entry in the MADT. Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Tested-by: Miguel Luis <miguel.luis@oracle.com> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240529133446.28446-15-Jonathan.Cameron@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-28irqchip/gic-v3: Don't return errors from gic_acpi_match_gicc()James Morse
gic_acpi_match_gicc() is only called via gic_acpi_count_gicr_regions(). It should only count the number of enabled redistributors, but it also tries to sanity check the GICC entry, currently returning an error if the Enabled bit is set, but the gicr_base_address is zero. Adding support for the online-capable bit to the sanity check will complicate it, for no benefit. The existing check implicitly depends on gic_acpi_count_gicr_regions() previous failing to find any GICR regions (as it is valid to have gicr_base_address of zero if the redistributors are described via a GICR entry). Instead of complicating the check, remove it. Failures that happen at this point cause the irqchip not to register, meaning no irqs can be requested. The kernel grinds to a panic() pretty quickly. Without the check, MADT tables that exhibit this problem are still caught by gic_populate_rdist(), which helpfully also prints what went wrong: | CPU4: mpidr 100 has no re-distributor! Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Tested-by: Miguel Luis <miguel.luis@oracle.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240529133446.28446-14-Jonathan.Cameron@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-28ACPI: Add post_eject to struct acpi_scan_handler for cpu hotplugJames Morse
struct acpi_scan_handler has a detach callback that is used to remove a driver when a bus is changed. When interacting with an eject-request, the detach callback is called before _EJ0. This means the ACPI processor driver can't use _STA to determine if a CPU has been made not-present, or some of the other _STA bits have been changed. acpi_processor_remove() needs to know the value of _STA after _EJ0 has been called. Add a post_eject callback to struct acpi_scan_handler. This is called after acpi_scan_hot_remove() has successfully called _EJ0. Because acpi_scan_check_and_detach() also clears the handler pointer, it needs to be told if the caller will go on to call acpi_bus_post_eject(), so that acpi_device_clear_enumerated() and clearing the handler pointer can be deferred. An extra flag is added to flags field introduced in the previous patch to achieve this. Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Tested-by: Miguel Luis <miguel.luis@oracle.com> Tested-by: Vishnu Pajjuri <vishnu@os.amperecomputing.com> Tested-by: Jianyong Wu <jianyong.wu@arm.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240529133446.28446-11-Jonathan.Cameron@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-28ACPI: scan: switch to flags for acpi_scan_check_and_detach()Jonathan Cameron
Precursor patch adds the ability to pass a uintptr_t of flags into acpi_scan_check_and detach() so that additional flags can be added to indicate whether to defer portions of the eject flow. The new flag follows in the next patch. Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Tested-by: Miguel Luis <miguel.luis@oracle.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240529133446.28446-10-Jonathan.Cameron@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-28ACPI: processor: Register deferred CPUs from acpi_processor_get_info()James Morse
The arm64 specific arch_register_cpu() call may defer CPU registration until the ACPI interpreter is available and the _STA method can be evaluated. If this occurs, then a second attempt is made in acpi_processor_get_info(). Note that the arm64 specific call has not yet been added so for now this will be called for the original hotplug case. For architectures that do not defer until the ACPI Processor driver loads (e.g. x86), for initially present CPUs there will already be a CPU device. If present do not try to register again. Systems can still be booted with 'acpi=off', or not include an ACPI description at all as in these cases arch_register_cpu() will not have deferred registration when first called. This moves the CPU register logic back to a subsys_initcall(), while the memory nodes will have been registered earlier. Note this is where the call was prior to the cleanup series so there should be no side effects of moving it back again for this specific case. [PATCH 00/21] Initial cleanups for vCPU HP. https://lore.kernel.org/all/ZVyz%2FVe5pPu8AWoA@shell.armlinux.org.uk/ commit 5b95f94c3b9f ("x86/topology: Switch over to GENERIC_CPU_DEVICES") Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Gavin Shan <gshan@redhat.com> Tested-by: Miguel Luis <miguel.luis@oracle.com> Tested-by: Vishnu Pajjuri <vishnu@os.amperecomputing.com> Tested-by: Jianyong Wu <jianyong.wu@arm.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240529133446.28446-9-Jonathan.Cameron@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-28ACPI: processor: Add acpi_get_processor_handle() helperJonathan Cameron
If CONFIG_ACPI_PROCESSOR provide a helper to retrieve the acpi_handle for a given CPU allowing access to methods in DSDT. Tested-by: Miguel Luis <miguel.luis@oracle.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240529133446.28446-8-Jonathan.Cameron@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-28ACPI: processor: Move checks and availability of acpi_processor earlierJonathan Cameron
Make the per_cpu(processors, cpu) entries available earlier so that they are available in arch_register_cpu() as ARM64 will need access to the acpi_handle to distinguish between acpi_processor_add() and earlier registration attempts (which will fail as _STA cannot be checked). Reorder the remove flow to clear this per_cpu() after arch_unregister_cpu() has completed, allowing it to be used in there as well. Note that on x86 for the CPU hotplug case, the pr->id prior to acpi_map_cpu() may be invalid. Thus the per_cpu() structures must be initialized after that call or after checking the ID is valid (not hotplug path). Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240529133446.28446-7-Jonathan.Cameron@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-28ACPI: processor: Fix memory leaks in error paths of processor_add()Jonathan Cameron
If acpi_processor_get_info() returned an error, pr and the associated pr->throttling.shared_cpu_map were leaked. The unwind code was in the wrong order wrt to setup, relying on some unwind actions having no affect (clearing variables that were never set etc). That makes it harder to reason about so reorder and add appropriate labels to only undo what was actually set up in the first place. Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240529133446.28446-6-Jonathan.Cameron@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-28ACPI: processor: Return an error if acpi_processor_get_info() fails in ↵Jonathan Cameron
processor_add() Rafael observed [1] that returning 0 from processor_add() will result in acpi_default_enumeration() being called which will attempt to create a platform device, but that makes little sense when the processor is known to be not available. So just return the error code from acpi_processor_get_info() instead. Link: https://lore.kernel.org/all/CAJZ5v0iKU8ra9jR+EmgxbuNm=Uwx2m1-8vn_RAZ+aCiUVLe3Pw@mail.gmail.com/ [1] Suggested-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240529133446.28446-5-Jonathan.Cameron@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-28ACPI: processor: Drop duplicated check on _STA (enabled + present)Jonathan Cameron
The ACPI bus scan will only result in acpi_processor_add() being called if _STA has already been checked and the result is that the processor is enabled and present. Hence drop this additional check. Suggested-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Tested-by: Miguel Luis <miguel.luis@oracle.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240529133446.28446-4-Jonathan.Cameron@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-28cpu: Do not warn on arch_register_cpu() returning -EPROBE_DEFERJonathan Cameron
For arm64 the CPU registration cannot complete until the ACPI interpreter us up and running so in those cases the arch specific arch_register_cpu() will return -EPROBE_DEFER at this stage and the registration will be attempted later. Suggested-by: Rafael J. Wysocki <rafael@kernel.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Tested-by: Miguel Luis <miguel.luis@oracle.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240529133446.28446-3-Jonathan.Cameron@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-28ACPI: processor: Simplify initial onlining to use same path for cold and hotplugJonathan Cameron
Separate code paths, combined with a flag set in acpi_processor.c to indicate a struct acpi_processor was for a hotplugged CPU ensured that per CPU data was only set up the first time that a CPU was initialized. This appears to be unnecessary as the paths can be combined by letting the online logic also handle any CPUs online at the time of driver load. Motivation for this change, beyond simplification, is that ARM64 virtual CPU HP uses the same code paths for hotplug and cold path in acpi_processor.c so had no easy way to set the flag for hotplug only. Removing this necessity will enable ARM64 vCPU HP to reuse the existing code paths. Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Hanjun Guo <guohanjun@huawei.com> Tested-by: Miguel Luis <miguel.luis@oracle.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Miguel Luis <miguel.luis@oracle.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240529133446.28446-2-Jonathan.Cameron@huawei.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-06-28hwmon: add MP2891 driverNoah Wang
Add support for MPS VR controller mp2891. This driver exposes telemetry and limit value readings and writtings. Signed-off-by: Noah Wang <noahwang.wang@outlook.com> Link: https://lore.kernel.org/r/SEYPR04MB64828A352836982C0184AA10FAD62@SEYPR04MB6482.apcprd04.prod.outlook.com Signed-off-by: Guenter Roeck <linux@roeck-us.net>
2024-06-28ice: Add get/set hw address for VFs using devlink commandsKarthik Sundaravel
Changing the MAC address of the VFs is currently unsupported via devlink. Add the function handlers to set and get the HW address for the VFs. Signed-off-by: Karthik Sundaravel <ksundara@redhat.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-06-28Merge tag 'mtk-soc-for-v6.11' of ↵Arnd Bergmann
https://git.kernel.org/pub/scm/linux/kernel/git/mediatek/linux into soc/drivers MediaTek driver updates for v6.11 This adds the previously missed Tone Curve Conversion (TCC) MuteX bit to enable the same in MDP3 on the the MT8188 SoC, disables 9-bit Alpha for display HDR support in MT8195 and adds math operation support in the Global Command Engine for all MediaTek SoCs, which will be used in the near future in the ISP driver. * tag 'mtk-soc-for-v6.11' of https://git.kernel.org/pub/scm/linux/kernel/git/mediatek/linux: soc: mtk-cmdq: Add cmdq_pkt_logic_command to support math operation soc: mediatek: Disable 9-bit alpha in ETHDR soc: mediatek: mtk-mutex: Add MDP_TCC0 mod to MT8188 mutex table Link: https://lore.kernel.org/r/20240628093801.126013-3-angelogioacchino.delregno@collabora.com Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2024-06-28rnbd-cnt: don't set QUEUE_FLAG_SAME_FORCEChristoph Hellwig
QUEUE_FLAG_SAME_FORCE has been set by rnbd-cnt since the initial merge. There is no good reason for a driver to force exact core delivery, which is tunable for very specific workloads and not a driver setting. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240627124926.512662-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28rnbd: don't set QUEUE_FLAG_SAME_COMPChristoph Hellwig
QUEUE_FLAG_SAME_COMP is already set by default. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240627124926.512662-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28mpt3sas_scsih: don't set QUEUE_FLAG_NOMERGESChristoph Hellwig
Setting QUEUE_FLAG_NOMERGES was added in commit d1b01d14b7ba ("scsi: mpt3sas: Set NVMe device queue depth as 128") without any explanation. Drivers should second guess the block layer merge decisions, so remove the flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240627124926.512662-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28megaraid_sas: don't set QUEUE_FLAG_NOMERGESChristoph Hellwig
Setting QUEUE_FLAG_NOMERGES was added in commit 15dd03811d99dcf ("scsi: megaraid_sas: NVME Interface detection and prop settings") without any explanation. Drivers should second guess the block layer merge decisions, so remove the flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240627124926.512662-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28loop: don't set QUEUE_FLAG_NOMERGESChristoph Hellwig
QUEUE_FLAG_NOMERGES isn't really a driver interface, but a user tunable. There also isn't any good reason to set it in the loop driver. The original commit adding it (5b5e20f421c0b6d "block: loop: set QUEUE_FLAG_NOMERGES for request queue of loop") claims that "It doesn't make sense to enable merge because the I/O submitted to backing file is handled page by page." which of course isn't true for multi-page bvec now, and it never has been for direct I/O, for which commit 40326d8a33d ("block/loop: allow request merge for directio mode") alredy disabled the nomerges flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240627124926.512662-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28bcache: work around a __bitwise to bool conversion sparse warningChristoph Hellwig
Sparse is a bit dumb about bitwise operation on __bitwise types used in boolean contexts. Add a !! to explicitly propagate to boolean without a warning. Fixes: fcf865e357f8 ("block: convert features and flags to __bitwise types") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20240628131657.667797-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-06-28Merge tag 'block-6.10-20240628' of git://git.kernel.dk/linuxLinus Torvalds
Pull block fixes from Jens Axboe: "NVMe fixes via Keith: - Fabrics fixes (Hannes) - Missing module description (Jeff) - Clang warning fix (Nathan)" * tag 'block-6.10-20240628' of git://git.kernel.dk/linux: nvmet-fc: Remove __counted_by from nvmet_fc_tgt_queue.fod[] nvmet: make 'tsas' attribute idempotent for RDMA nvme: fixup comment for nvme RDMA Provider Type nvme-apple: add missing MODULE_DESCRIPTION() nvmet: do not return 'reserved' for empty TSAS values nvme: fix NVME_NS_DEAC may incorrectly identifying the disk as EXT_LBA.
2024-06-28Merge tag 'iommu-fixes-v6.10-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux Pull iommu fixes from Joerg Roedel: - Two cache flushing fixes for Intel and AMD drivers - AMD guest translation enabling fix - Update IOMMU tree location in MAINTAINERS file * tag 'iommu-fixes-v6.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: MAINTAINERS: Update IOMMU tree location iommu/amd: Fix GT feature enablement again iommu/vt-d: Fix missed device TLB cache tag iommu/amd: Invalidate cache before removing device from domain list
2024-06-28Merge tag 'gpio-fixes-for-v6.10-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: "An assortment of driver fixes and two commits addressing a bad behavior of the GPIO uAPI when reconfiguring requested lines. - fix a race condition in i2c transfers by adding a missing i2c lock section in gpio-pca953x - validate the number of obtained interrupts in gpio-davinci - add missing raw_spinlock_init() in gpio-graniterapids - fix bad character device behavior: disallow GPIO line reconfiguration without set direction both in v1 and v2 uAPI" * tag 'gpio-fixes-for-v6.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: gpiolib: cdev: Ignore reconfiguration without direction gpiolib: cdev: Disallow reconfiguration without direction (uAPI v1) gpio: graniterapids: Add missing raw_spinlock_init() gpio: davinci: Validate the obtained number of IRQs gpio: pca953x: fix pca953x_irq_bus_sync_unlock race
2024-06-28iommufd/iova_bitmap: Remove iterator logicJoao Martins
The newly introduced dynamic pinning/windowing greatly simplifies the code and there's no obvious performance advantage that has been identified that justifies maintinaing both schemes. Remove the iterator logic and have iova_bitmap_for_each() just invoke the callback with the total iova/length. Fixes: 2780025e01e2 ("iommufd/iova_bitmap: Handle recording beyond the mapped pages") Link: https://lore.kernel.org/r/20240627110105.62325-12-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Matt Ochs <mochs@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2024-06-28iommufd/iova_bitmap: Dynamic pinning on iova_bitmap_set()Joao Martins
Today zerocopy iova bitmaps use a static iteration scheme where it walks the bitmap data in a max iteration size of 2M of bitmap of data at a time. That translates to a fixed window of IOVA space that can span up to 64G (e.g. base pages, x86). Here 'window' refers to the IOVA space represented by the bitmap data it is iterating. This static scheme is the ideal one where the reported page-size is the same as the one behind the dirty tracker. However, problems start to appear when the dirty tracker may dirty in many PTE sizes beyond or unaligned at the boundaries of the iteration window. Such is the case for the IOMMU and commit 2780025e01e2 ("iommufd/iova_bitmap: Handle recording beyond the mapped pages") tried to fix the problem by handling the PTEs that get dirty which surprass the end of the iteration. But the fix was incomplete and it didn't handle all the data structure issues namely: 1) when there's nothing to dirty but the end of the iteration IOVA range is a IOMMU hugepage PTE that crosses iterations, when it goes to the next iteration it finds the other end of the said hugepage but don't account that it had checked for that IOPTE already. iommu driver then walk the IOVA space as if it is a new page without accounting that it is past the start of a bigger page which ends up setting (future) dirty bits slightly offset-ed. Note that the partial ranges here are self induced due as a result of the fixed 'window' scheme being unaligned to this hugepage IOPTE. 2) on the same line of thinking between pinning pages of different iterations it could allow DMA to mark PTEs as dirty on the second part of this previously mentioned partial hugepage. This leads to marking part of the hugepage as dirty but still clearing IOPTE leading to missed dirty data. So to fix these problems more fundamentally and avoid future ones: instead of iterating the whole bitmap in fixed chunks, instead only pin the bitmap pages when it has dirty bits to set. The logic is simple in iova_bitmap_set(): check where the current iova range to be marked as dirty is pinned and pin the bitmap pages where to-be-recorded @iova starts if it's not. If it's partially mapped out of the whole set, continue pinning it and set bits until the whole dirty-size is covered. The latter is more relevant with AMD iommu pgtable v1 format where you can have up 64G/128G/256G page sizes and thus you can set 64G at a time. Code also gets simpler and easier to follow. Fixing this without changing this iteration scheme means changing iommu drivers to ignore any partial pages and not clear dirty bits, which is a bit hacky. Though getting to walk only part of a IOMMU hugepage is a self-induced due to this iteration scheme as it doesn't (and can't) align the iteration boundary to the huge IOPTE at the end. Thus it can't know what the hugepage size the iteration should align to until it walks the begin/end. Dynamically pinning adds some comparisons inside iova_bitmap_set() to check if something needs to be pinned if the IOVA range is out of range. Though it has the benefit that non-dirty IOVA ranges only walk page tables without needing to pin any bitmap pages. This dynamic scheme should be better for IOMMUs where upper layers don't need or know what PTE sizes IOVAs map into (and there could be more than one PTE size[*]) until they walk the IOMMU page tables. A follow-up change will remove the iteration logic. [*] Specially on AMD v1 iommu pgtable format where most powers of two are supported as page-size. Link: https://lore.kernel.org/linux-iommu/6b90f949-48da-4cb3-ad9a-ed54f1351a9a@oracle.com/ Fixes: 2780025e01e2 ("iommufd/iova_bitmap: Handle recording beyond the mapped pages") Link: https://lore.kernel.org/r/20240627110105.62325-11-joao.m.martins@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Matt Ochs <mochs@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>