summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-03-11mlxsw: spectrum_router: Share nexthop counters in resilient groupsPetr Machata
For resilient groups, we can reuse the same counter for all the buckets that share the same nexthop. Keep a reference count per counter, and keep all these counters in a per-next hop group xarray, which serves as a NHID->counter cache. If a counter is already present for a given NHID, just take a reference and use the same counter. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/cdd00084533fc83ac5917562f54642f008205bf3.1709901020.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11mlxsw: spectrum_router: Support nexthop group hardware statisticsPetr Machata
When hw_stats is set on a group, install nexthop counters on members of a group. Counter allocation request is moved from nexthop object initialization to the update code. The previous placement made sense: when the counters are enabled by dpipe, the counters are installed to all existing nexthops and all nexthops created from then on get them. For the finer-grained nexthop group statistics, this is unsuitable. The existing placement was kept for the IPv4 and IPv6 nexthops. Resilient group replacement emits a pre_replace notification, and then any bucket_replace notifications if there were any replacements at all. If the group is balanced and the nexthop composition of the replaced group didn't change, there will be no such notifiers. Therefore hook to the pre_replace notifier and mark all buckets for update, to un/install the counters. When reporting deltas for resilient groups, use the nexthop ID that we stored in a previous patch to look up to which nexthop a bucket contributes. Co-developed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/87495a72f187df2e5d491d02729c550d235fcc85.1709901020.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11mlxsw: spectrum_router: Track NH ID's of group membersPetr Machata
The core interfaces for collecting per-NH statistics are built around nexthops even for resilient groups. Because mlxsw models each bucket as a nexthop, the core next hop that a given bucket contributes to needs to be looked up. In order to be able to match the two up, we need to track nexthop ID for members of group nexthop objects. For simplicity, do it for all nexthop objects, not just group members. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/184ceb6b154e08f5bcf116a705b0fcb01c31895c.1709901020.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11mlxsw: spectrum_router: Add helpers for nexthop countersPetr Machata
The next patch will add the ability to share nexthop counters among mlxsw nexthops backed by the same core nexthop. To have a place to store reference count, the counter should be kept in a dedicated structure. In this patch, introduce the structure together with the related helpers, sans the refcount, which comes in the next patch. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/61f23fa4f8c5d7879f68dacd793d8ab7425f33c0.1709901020.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11mlxsw: spectrum_router: Avoid allocating NH counters twicePetr Machata
mlxsw_sp_nexthop_counter_disable() decays to a nop when called on a disabled counter, but mlxsw_sp_nexthop_counter_enable() can't similarly be called on an enabled counter. This would be useful in the following patches. Add the missing condition. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/0cc9050e196366c1387ab5ee47f1cee8ecde9c86.1709901020.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11mlxsw: spectrum: Allow fetch-and-clear of flow countersPetr Machata
For the report_delta-like interface like a previous patch has added for collection of NH group statistics, it's easiest to read the counter and have the HW clear it right away. Thus, change mlxsw_sp_flow_counter_get() to take a bool indicating whether this should be done. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/6a096ede8ee92d5041e3832242c3bbc137198aba.1709901020.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11mlxsw: spectrum_router: Have mlxsw_sp_nexthop_counter_enable() return intPetr Machata
In order to be able to diagnose failures in counter allocation, have the function mlxsw_sp_nexthop_counter_enable() return an error code. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/e0bb5c0cc6234ade2ade1e92abac991359c3f446.1709901020.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11mlxsw: spectrum_router: Rename two functionsPetr Machata
The function mlxsw_sp_nexthop_counter_alloc() doesn't directly allocate anything, and mlxsw_sp_nexthop_counter_free() doesn't directly free. For the following patches, we will need names for functions that actually do those things. Therefore rename to mlxsw_sp_nexthop_counter_enable() and mlxsw_sp_nexthop_counter_disable() to free up the namespace. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/f59272958697a718f090f59f892d32beabcd8972.1709901020.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11net: nexthop: Have all NH notifiers carry NH IDPetr Machata
When sending the notifications to collect NH statistics for resilient groups, the driver will need to know the nexthop IDs in individual buckets to look up the right counter. To that end, move the nexthop ID from struct nh_notifier_grp_entry_info to nh_notifier_single_info. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/8f964cd50b1a56d3606ce7ab4c50354ae019c43b.1709901020.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11net: nexthop: Initialize NH group ID in resilient NH group notifiersPetr Machata
The NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE notifier currently keeps the group ID unset. That makes it impossible to look up the group for which the notifier is intended. This is not an issue at the moment, because the only client is netdevsim, and that just so that it veto replacements, which is a static property not tied to a particular group. But for any practical use, the ID is necessary. Set it. Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/025fef095dcfb408042568bb5439da014d47239e.1709901020.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11net: gro: move two declarations to include/net/gro.hEric Dumazet
Move gro_find_receive_by_type() and gro_find_complete_by_type() to include/net/gro.h where they belong. Also use _NET_GRO_H instead of _NET_IPV6_GRO_H to protect include/net/gro.h from multiple inclusions. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240308102230.296224-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11r8152: fix unknown device for choose_configurationHayes Wang
For the unknown device, rtl8152_cfgselector_choose_configuration() should return a negative value. Then, usb_choose_configuration() would set a configuration for CDC ECM or NCM mode. Otherwise, there is no usb interface driver for the device. Fixes: aa4f2b3e418e ("r8152: Choose our USB config with choose_configuration() rather than probe()") Signed-off-by: Hayes Wang <hayeswang@realtek.com> Link: https://lore.kernel.org/r/20240308075206.33553-436-nic_swsd@realtek.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11net: netconsole: Add continuation line prefix to userdata messagesMatthew Wood
Add a space (' ') prefix to every userdata line to match docs for dev-kmsg. To account for this extra character in each userdata entry, reduce userdata entry names (directory name) from 54 characters to 53. According to the dev-kmsg docs, a space is used for subsequent lines to mark them as continuation lines. > A line starting with ' ', is a continuation line, adding > key/value pairs to the log message, which provide the machine > readable context of the message, for reliable processing in > userspace. Testing for this patch:: cd /sys/kernel/config/netconsole && mkdir cmdline0 cd cmdline0 mkdir userdata/test && echo "hello" > userdata/test/value mkdir userdata/test2 && echo "hello2" > userdata/test2/value echo "message" > /dev/kmsg Outputs:: 6.8.0-rc5-virtme,12,493,231373579,-;message test=hello test2=hello2 And I confirmed all testing works as expected from the original patchset Fixes: df03f830d099 ("net: netconsole: cache userdata formatted string in netconsole_target") Signed-off-by: Matthew Wood <thepacketgeek@gmail.com> Reviewed-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240308002525.248672-1-thepacketgeek@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11Merge tag 'irq-msi-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull MSI updates from Thomas Gleixner: "Updates for the MSI interrupt subsystem and initial RISC-V MSI support. The core changes have been adopted from previous work which converted ARM[64] to the new per device MSI domain model, which was merged to support multiple MSI domain per device. The ARM[64] changes are being worked on too, but have not been ready yet. The core and platform-MSI changes have been split out to not hold up RISC-V and to avoid that RISC-V builds on the scheduled for removal interfaces. The core support provides new interfaces to handle wire to MSI bridges in a straight forward way and introduces new platform-MSI interfaces which are built on top of the per device MSI domain model. Once ARM[64] is converted over the old platform-MSI interfaces and the related ugliness in the MSI core code will be removed. The actual MSI parts for RISC-V were finalized late and have been post-poned for the next merge window. Drivers: - Add a new driver for the Andes hart-level interrupt controller - Rework the SiFive PLIC driver to prepare for MSI suport - Expand the RISC-V INTC driver to support the new RISC-V AIA controller which provides the basis for MSI on RISC-V - A few fixup for the fallout of the core changes" * tag 'irq-msi-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits) irqchip/riscv-intc: Fix low-level interrupt handler setup for AIA x86/apic/msi: Use DOMAIN_BUS_GENERIC_MSI for HPET/IO-APIC domain search genirq/matrix: Dynamic bitmap allocation irqchip/riscv-intc: Add support for RISC-V AIA irqchip/sifive-plic: Improve locking safety by using irqsave/irqrestore irqchip/sifive-plic: Parse number of interrupts and contexts early in plic_probe() irqchip/sifive-plic: Cleanup PLIC contexts upon irqdomain creation failure irqchip/sifive-plic: Use riscv_get_intc_hwnode() to get parent fwnode irqchip/sifive-plic: Use devm_xyz() for managed allocation irqchip/sifive-plic: Use dev_xyz() in-place of pr_xyz() irqchip/sifive-plic: Convert PLIC driver into a platform driver irqchip/riscv-intc: Introduce Andes hart-level interrupt controller irqchip/riscv-intc: Allow large non-standard interrupt number genirq/irqdomain: Don't call ops->select for DOMAIN_BUS_ANY tokens irqchip/imx-intmux: Handle pure domain searches correctly genirq/msi: Provide MSI_FLAG_PARENT_PM_DEV genirq/irqdomain: Reroute device MSI create_mapping genirq/msi: Provide allocation/free functions for "wired" MSI interrupts genirq/msi: Optionally use dev->fwnode for device domain genirq/msi: Provide DOMAIN_BUS_WIRED_TO_MSI ...
2024-03-11io_uring: don't save/restore iowait stateJens Axboe
This kind of state is per-syscall, and since we're doing the waiting off entering the io_uring_enter(2) syscall, there's no way that iowait can already be set for this case. Simplify it by setting it if we need to, and always clearing it to 0 when done. Fixes: 7b72d661f1f2 ("io_uring: gate iowait schedule on having pending requests") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-11Merge tag 'irq-core-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Core: - Make affinity changes take effect immediately for interrupt threads. This reduces the impact on isolated CPUs as it pulls over the thread right away instead of doing it after the next hardware interrupt arrived. - Cleanup and improvements for the interrupt chip simulator - Deduplication of the interrupt descriptor initialization code so the sparse and non-sparse mode share more code. Drivers: - A set of conversions to platform_drivers::remove_new() which gets rid of the pointless return value. - A new driver for the Starfive JH8100 SoC - Support for Amlogic-T7 SoCs - Improvement for the interrupt handling and EOI management for the loongson interrupt controller. - The usual fixes and improvements all over the place" * tag 'irq-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) irqchip/ts4800: Convert to platform_driver::remove_new() callback irqchip/stm32-exti: Convert to platform_driver::remove_new() callback irqchip/renesas-rza1: Convert to platform_driver::remove_new() callback irqchip/renesas-irqc: Convert to platform_driver::remove_new() callback irqchip/renesas-intc-irqpin: Convert to platform_driver::remove_new() callback irqchip/pruss-intc: Convert to platform_driver::remove_new() callback irqchip/mvebu-pic: Convert to platform_driver::remove_new() callback irqchip/madera: Convert to platform_driver::remove_new() callback irqchip/ls-scfg-msi: Convert to platform_driver::remove_new() callback irqchip/keystone: Convert to platform_driver::remove_new() callback irqchip/imx-irqsteer: Convert to platform_driver::remove_new() callback irqchip/imx-intmux: Convert to platform_driver::remove_new() callback irqchip/imgpdc: Convert to platform_driver::remove_new() callback irqchip: Add StarFive external interrupt controller dt-bindings: interrupt-controller: Add starfive,jh8100-intc arm64: dts: Add gpio_intc node for Amlogic-T7 SoCs irqchip/meson-gpio: Add support for Amlogic-T7 SoCs dt-bindings: interrupt-controller: Add support for Amlogic-T7 SoCs irqchip/vic: Fix a kernel-doc warning genirq: Wake interrupt threads immediately when changing affinity ...
2024-03-11r8169: switch to new function phy_support_eeeHeiner Kallweit
Switch to new function phy_support_eee. This allows to simplify the code because data->tx_lpi_enabled is now populated by phy_ethtool_get_eee(). Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/92462328-5c9b-4d82-9ce4-ea974cda4900@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11net: phy: simplify a check in phy_check_link_statusHeiner Kallweit
Handling case err == 0 in the other branch allows to simplify the code. In addition I assume in "err & phydev->eee_cfg.tx_lpi_enabled" it should have been a logical and operator. It works as expected also with the bitwise and, but using a bitwise and with a bool value looks ugly to me. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/de37bf30-61dd-49f9-b645-2d8ea11ddb5d@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11net: phy: marvell-88x2222: Remove unused of_gpio.hAndy Shevchenko
of_gpio.h is deprecated and subject to remove. The driver doesn't use it, simply remove the unused header. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20240307122346.3677534-1-andriy.shevchenko@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11net: dsa: mt7530: disable LEDs before resetJustin Swartz
Disable LEDs just before resetting the MT7530 to avoid situations where the ESW_P4_LED_0 and ESW_P3_LED_0 pin states may cause an unintended external crystal frequency to be selected. The HT_XTAL_FSEL (External Crystal Frequency Selection) field of HWTRAP (the Hardware Trap register) stores a 2-bit value that represents the state of the ESW_P4_LED_0 and ESW_P4_LED_0 pins (seemingly) sampled just after the MT7530 has been reset, as: ESW_P4_LED_0 ESW_P3_LED_0 Frequency ----------------------------------------- 0 1 20MHz 1 0 40MHz 1 1 25MHz The value of HT_XTAL_FSEL is bootstrapped by pulling ESW_P4_LED_0 and ESW_P3_LED_0 up or down accordingly, but: if a 40MHz crystal has been selected and the ESW_P3_LED_0 pin is high during reset, or a 20MHz crystal has been selected and the ESW_P4_LED_0 pin is high during reset, then the value of HT_XTAL_FSEL will indicate that a 25MHz crystal is present. By default, the state of the LED pins is PHY controlled to reflect the link state. To illustrate, if a board has: 5 ports with active low LED control, and HT_XTAL_FSEL bootstrapped for 40MHz. When the MT7530 is powered up without any external connection, only the LED associated with Port 3 is illuminated as ESW_P3_LED_0 is low. In this state, directly after mt7530_setup()'s reset is performed, the HWTRAP register (0x7800) reflects the intended HT_XTAL_FSEL (HWTRAP bits 10:9) of 40MHz: mt7530-mdio mdio-bus:1f: mt7530_read: 00007800 == 00007dcf >>> bin(0x7dcf >> 9 & 0b11) '0b10' But if a cable is connected to Port 3 and the link is active before mt7530_setup()'s reset takes place, then HT_XTAL_FSEL seems to be set for 25MHz: mt7530-mdio mdio-bus:1f: mt7530_read: 00007800 == 00007fcf >>> bin(0x7fcf >> 9 & 0b11) '0b11' Once HT_XTAL_FSEL reflects 25MHz, none of the ports are functional until the MT7621 (or MT7530 itself) is reset. By disabling the LED pins just before reset, the chance of an unintended HT_XTAL_FSEL value is reduced. Signed-off-by: Justin Swartz <justin.swartz@risingedge.co.za> Link: https://lore.kernel.org/r/20240305043952.21590-1-justin.swartz@risingedge.co.za Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11net: mdio_bus: Remove unused of_gpio.hAndy Shevchenko
of_gpio.h is deprecated and subject to remove. The driver doesn't use it, simply remove the unused header. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20240307122231.3677241-1-andriy.shevchenko@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11ptp: make ptp_class constantRicardo B. Marliere
Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the ptp_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240305-ptp-v1-1-ed253eb33c20@marliere.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11devlink: Fix length of eswitch inline-modeWilliam Tu
Set eswitch inline-mode to be u8, not u16. Otherwise, errors below $ devlink dev eswitch set pci/0000:08:00.0 mode switchdev \ inline-mode network Error: Attribute failed policy validation. kernel answers: Numerical result out of rang netlink: 'devlink': attribute type 26 has an invalid length. Fixes: f2f9dd164db0 ("netlink: specs: devlink: add the remaining command to generate complete split_ops") Signed-off-by: William Tu <witu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240310164547.35219-1-witu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11KVM/x86: Export RFDS_NO and RFDS_CLEAR to guestsPawan Gupta
Mitigation for RFDS requires RFDS_CLEAR capability which is enumerated by MSR_IA32_ARCH_CAPABILITIES bit 27. If the host has it set, export it to guests so that they can deploy the mitigation. RFDS_NO indicates that the system is not vulnerable to RFDS, export it to guests so that they don't deploy the mitigation unnecessarily. When the host is not affected by X86_BUG_RFDS, but has RFDS_NO=0, synthesize RFDS_NO to the guest. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-03-11x86/rfds: Mitigate Register File Data Sampling (RFDS)Pawan Gupta
RFDS is a CPU vulnerability that may allow userspace to infer kernel stale data previously used in floating point registers, vector registers and integer registers. RFDS only affects certain Intel Atom processors. Intel released a microcode update that uses VERW instruction to clear the affected CPU buffers. Unlike MDS, none of the affected cores support SMT. Add RFDS bug infrastructure and enable the VERW based mitigation by default, that clears the affected buffers just before exiting to userspace. Also add sysfs reporting and cmdline parameter "reg_file_data_sampling" to control the mitigation. For details see: Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-03-11Documentation/hw-vuln: Add documentation for RFDSPawan Gupta
Add the documentation for transient execution vulnerability Register File Data Sampling (RFDS) that affects Intel Atom CPUs. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-03-11x86/mmio: Disable KVM mitigation when X86_FEATURE_CLEAR_CPU_BUF is setPawan Gupta
Currently MMIO Stale Data mitigation for CPUs not affected by MDS/TAA is to only deploy VERW at VMentry by enabling mmio_stale_data_clear static branch. No mitigation is needed for kernel->user transitions. If such CPUs are also affected by RFDS, its mitigation may set X86_FEATURE_CLEAR_CPU_BUF to deploy VERW at kernel->user and VMentry. This could result in duplicate VERW at VMentry. Fix this by disabling mmio_stale_data_clear static branch when X86_FEATURE_CLEAR_CPU_BUF is enabled. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
2024-03-11Merge tag 'cgroup-for-6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A quiet cycle. One trivial doc update patch. Two patches to drop the now defunct memory_spread_slab feature from cgroup1 cpuset" * tag 'cgroup-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Mark memory_spread_slab as obsolete cgroup/cpuset: Remove cpuset_do_slab_mem_spread() docs: cgroup-v1: add missing code-block tags
2024-03-11netlink: specs: support unterminated-okHangbin Liu
ynl-gen-c.py supports check unterminated-ok, but the yaml schemas don't have this key. Add this to the yaml files. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20240308081239.3281710-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11tools: ynl-gen: support using pre-defined values in attr checksHangbin Liu
Support using pre-defined values in checks so we don't need to use hard code number for the string, binary length. e.g. we have a definition like #define TEAM_STRING_MAX_LEN 32 Which defined in yaml like: definitions: - name: string-max-len type: const value: 32 It can be used in the attribute-sets like attribute-sets: - name: attr-option name-prefix: team-attr-option- attributes: - name: name type: string checks: len: string-max-len With this patch it will be converted to [TEAM_ATTR_OPTION_NAME] = { .type = NLA_STRING, .len = TEAM_STRING_MAX_LEN, } Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20240311140727.109562-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11Merge tag 'wq-for-6.9-bh-conversions' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue BH conversions from Tejun Heo: "This contains two patches that convert tasklet users to BH workqueues: backtracetest and usb hcd. DM conversions are being routed through the respective subsystem tree. Hopefully, the next cycle will see a lot more conversions" * tag 'wq-for-6.9-bh-conversions' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: usb: core: hcd: Convert from tasklet to BH workqueue backtracetest: Convert from tasklet to BH workqueue
2024-03-11net: page_pool: factor out page_pool recycle checkMina Almasry
The check is duplicated in 2 places, factor it out into a common helper. Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/r/20240308204500.1112858-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-11Merge tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: "This cycle, a lot of workqueue changes including some that are significant and invasive. - During v6.6 cycle, unbound workqueues were updated so that they are more topology aware and flexible, which among other things improved workqueue behavior on modern multi-L3 CPUs. In the process, commit 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") switched unbound workqueues to use per-CPU frontend pool_workqueues as a part of increasing front-back mapping flexibility. An unwelcome side effect of this change was that this made max concurrency enforcement per-CPU blowing up the maximum number of allowed concurrent executions. I incorrectly assumed that this wouldn't cause practical problems as most unbound workqueue users are self-regulate max concurrency; however, there definitely are which don't (e.g. on IO paths) and the drastic increase in the allowed max concurrency led to noticeable perf regressions in some use cases. This is now addressed by separating out max concurrency enforcement to a separate struct - wq_node_nr_active - which makes @max_active consistently mean system-wide max concurrency regardless of the number of CPUs or (finally) NUMA nodes. This is a rather invasive and, in places, a bit clunky; however, the clunkiness rises from the the inherent requirement to handle the disagreement between the execution locality domain and max concurrency enforcement domain on some modern machines. See commit 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement for unbound workqueues") for more details. - BH workqueue support is added. They are similar to per-CPU workqueues but execute work items in the softirq context. This is expected to replace tasklet. However, currently, it's missing the ability to disable and enable work items which is needed to convert many tasklet users. To avoid crowding this merge window too much, this will be included in the next merge window. A separate pull request will be sent for the couple conversion patches that are currently pending. - Waiman plugged a long-standing hole in workqueue CPU isolation where ordered workqueues didn't follow wq_unbound_cpumask updates. Ordered workqueues now follow the same rules as other unbound workqueues. - More CPU isolation improvements: Juri fixed another deficit in workqueue isolation where unbound rescuers don't respect wq_unbound_cpumask. Leonardo fixed delayed_work timers firing on isolated CPUs. - Other misc changes" * tag 'wq-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (54 commits) workqueue: Drain BH work items on hot-unplugged CPUs workqueue: Introduce from_work() helper for cleaner callback declarations workqueue: Control intensive warning threshold through cmdline workqueue: Make @flags handling consistent across set_work_data() and friends workqueue: Remove clear_work_data() workqueue: Factor out work_grab_pending() from __cancel_work_sync() workqueue: Clean up enum work_bits and related constants workqueue: Introduce work_cancel_flags workqueue: Use variable name irq_flags for saving local irq flags workqueue: Reorganize flush and cancel[_sync] functions workqueue: Rename __cancel_work_timer() to __cancel_timer_sync() workqueue: Use rcu_read_lock_any_held() instead of rcu_read_lock_held() workqueue: Cosmetic changes workqueue, irq_work: Build fix for !CONFIG_IRQ_WORK workqueue: Fix queue_work_on() with BH workqueues async: Use a dedicated unbound workqueue with raised min_active workqueue: Implement workqueue_set_min_active() workqueue: Fix kernel-doc comment of unplug_oldest_pwq() workqueue: Bind unbound workqueue rescuer to wq_unbound_cpumask kernel/workqueue: Let rescuers follow unbound wq cpumask changes ...
2024-03-11Merge tag 'rust-6.9' of https://github.com/Rust-for-Linux/linuxLinus Torvalds
Pull Rust updates from Miguel Ojeda: "Another routine one in terms of features. We got two version upgrades this time, but in terms of lines, 'alloc' changes are not very large. Toolchain and infrastructure: - Upgrade to Rust 1.76.0 This time around, due to how the kernel and Rust schedules have aligned, there are two upgrades in fact. These allow us to remove two more unstable features ('const_maybe_uninit_zeroed' and 'ptr_metadata') from the list, among other improvements - Mark 'rustc' (and others) invocations as recursive, which fixes a new warning and prepares us for the future in case we eventually take advantage of the Make jobserver 'kernel' crate: - Add the 'container_of!' macro - Stop using the unstable 'ptr_metadata' feature by employing the now stable 'byte_sub' method to implement 'Arc::from_raw()' - Add the 'time' module with a 'msecs_to_jiffies()' conversion function to begin with, to be used by Rust Binder - Add 'notify_sync()' and 'wait_interruptible_timeout()' methods to 'CondVar', to be used by Rust Binder - Update integer types for 'CondVar' - Rename 'wait_list' field to 'wait_queue_head' in 'CondVar' - Implement 'Display' and 'Debug' for 'BStr' - Add the 'try_from_foreign()' method to the 'ForeignOwnable' trait - Add reexports for macros so that they can be used from the right module (in addition to the root) - A series of code documentation improvements, including adding intra-doc links, consistency improvements, typo fixes... 'macros' crate: - Place generated 'init_module()' function in '.init.text' Documentation: - Add documentation on Rust doctests and how they work" * tag 'rust-6.9' of https://github.com/Rust-for-Linux/linux: (29 commits) rust: upgrade to Rust 1.76.0 kbuild: mark `rustc` (and others) invocations as recursive rust: add `container_of!` macro rust: str: implement `Display` and `Debug` for `BStr` rust: module: place generated init_module() function in .init.text rust: types: add `try_from_foreign()` method docs: rust: Add description of Rust documentation test as KUnit ones docs: rust: Move testing to a separate page rust: kernel: stop using ptr_metadata feature rust: kernel: add reexports for macros rust: locked_by: shorten doclink preview rust: kernel: remove unneeded doclink targets rust: kernel: add doclinks rust: kernel: add blank lines in front of code blocks rust: kernel: mark code fragments in docs with backticks rust: kernel: unify spelling of refcount in docs rust: str: move SAFETY comment in front of unsafe block rust: str: use `NUL` instead of 0 in doc comments rust: kernel: add srctree-relative doclinks rust: ioctl: end top-level module docs with full stop ...
2024-03-11Merge tag 'compiler-attributes-6.9' of https://github.com/ojeda/linuxLinus Torvalds
Pull compiler attributes update from Miguel Ojeda: "Trivial fixes to the __counted_by comments" * tag 'compiler-attributes-6.9' of https://github.com/ojeda/linux: Compiler Attributes: counted_by: fixup clang URL Compiler Attributes: counted_by: bump min gcc version
2024-03-11vfio/fsl-mc: Block calling interrupt handler without triggerAlex Williamson
The eventfd_ctx trigger pointer of the vfio_fsl_mc_irq object is initially NULL and may become NULL if the user sets the trigger eventfd to -1. The interrupt handler itself is guaranteed that trigger is always valid between request_irq() and free_irq(), but the loopback testing mechanisms to invoke the handler function need to test the trigger. The triggering and setting ioctl paths both make use of igate and are therefore mutually exclusive. The vfio-fsl-mc driver does not make use of irqfds, nor does it support any sort of masking operations, therefore unlike vfio-pci and vfio-platform, the flow can remain essentially unchanged. Cc: Diana Craciun <diana.craciun@oss.nxp.com> Cc: <stable@vger.kernel.org> Fixes: cc0ee20bd969 ("vfio/fsl-mc: trigger an interrupt via eventfd") Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240308230557.805580-8-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/platform: Create persistent IRQ handlersAlex Williamson
The vfio-platform SET_IRQS ioctl currently allows loopback triggering of an interrupt before a signaling eventfd has been configured by the user, which thereby allows a NULL pointer dereference. Rather than register the IRQ relative to a valid trigger, register all IRQs in a disabled state in the device open path. This allows mask operations on the IRQ to nest within the overall enable state governed by a valid eventfd signal. This decouples @masked, protected by the @locked spinlock from @trigger, protected via the @igate mutex. In doing so, it's guaranteed that changes to @trigger cannot race the IRQ handlers because the IRQ handler is synchronously disabled before modifying the trigger, and loopback triggering of the IRQ via ioctl is safe due to serialization with trigger changes via igate. For compatibility, request_irq() failures are maintained to be local to the SET_IRQS ioctl rather than a fatal error in the open device path. This allows, for example, a userspace driver with polling mode support to continue to work regardless of moving the request_irq() call site. This necessarily blocks all SET_IRQS access to the failed index. Cc: Eric Auger <eric.auger@redhat.com> Cc: <stable@vger.kernel.org> Fixes: 57f972e2b341 ("vfio/platform: trigger an interrupt via eventfd") Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240308230557.805580-7-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/platform: Disable virqfds on cleanupAlex Williamson
irqfds for mask and unmask that are not specifically disabled by the user are leaked. Remove any irqfds during cleanup Cc: Eric Auger <eric.auger@redhat.com> Cc: <stable@vger.kernel.org> Fixes: a7fa7c77cf15 ("vfio/platform: implement IRQ masking/unmasking via an eventfd") Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240308230557.805580-6-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/pci: Create persistent INTx handlerAlex Williamson
A vulnerability exists where the eventfd for INTx signaling can be deconfigured, which unregisters the IRQ handler but still allows eventfds to be signaled with a NULL context through the SET_IRQS ioctl or through unmask irqfd if the device interrupt is pending. Ideally this could be solved with some additional locking; the igate mutex serializes the ioctl and config space accesses, and the interrupt handler is unregistered relative to the trigger, but the irqfd path runs asynchronous to those. The igate mutex cannot be acquired from the atomic context of the eventfd wake function. Disabling the irqfd relative to the eventfd registration is potentially incompatible with existing userspace. As a result, the solution implemented here moves configuration of the INTx interrupt handler to track the lifetime of the INTx context object and irq_type configuration, rather than registration of a particular trigger eventfd. Synchronization is added between the ioctl path and eventfd_signal() wrapper such that the eventfd trigger can be dynamically updated relative to in-flight interrupts or irqfd callbacks. Cc: <stable@vger.kernel.org> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver") Reported-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240308230557.805580-5-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio: Introduce interface to flush virqfd inject workqueueAlex Williamson
In order to synchronize changes that can affect the thread callback, introduce an interface to force a flush of the inject workqueue. The irqfd pointer is only valid under spinlock, but the workqueue cannot be flushed under spinlock. Therefore the flush work for the irqfd is queued under spinlock. The vfio_irqfd_cleanup_wq workqueue is re-used for queuing this work such that flushing the workqueue is also ordered relative to shutdown. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240308230557.805580-4-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/pci: Lock external INTx masking opsAlex Williamson
Mask operations through config space changes to DisINTx may race INTx configuration changes via ioctl. Create wrappers that add locking for paths outside of the core interrupt code. In particular, irq_type is updated holding igate, therefore testing is_intx() requires holding igate. For example clearing DisINTx from config space can otherwise race changes of the interrupt configuration. This aligns interfaces which may trigger the INTx eventfd into two camps, one side serialized by igate and the other only enabled while INTx is configured. A subsequent patch introduces synchronization for the latter flows. Cc: <stable@vger.kernel.org> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver") Reported-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240308230557.805580-3-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/pci: Disable auto-enable of exclusive INTx IRQAlex Williamson
Currently for devices requiring masking at the irqchip for INTx, ie. devices without DisINTx support, the IRQ is enabled in request_irq() and subsequently disabled as necessary to align with the masked status flag. This presents a window where the interrupt could fire between these events, resulting in the IRQ incrementing the disable depth twice. This would be unrecoverable for a user since the masked flag prevents nested enables through vfio. Instead, invert the logic using IRQF_NO_AUTOEN such that exclusive INTx is never auto-enabled, then unmask as required. Cc: <stable@vger.kernel.org> Fixes: 89e1f7d4c66d ("vfio: Add PCI device driver") Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20240308230557.805580-2-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11Merge tag 'rcu.next.v6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux Pull RCU updates from Boqun Feng: - Eliminate deadlocks involving do_exit() and RCU tasks, by Paul: Instead of SRCU read side critical sections, now a percpu list is used in do_exit() for scaning yet-to-exit tasks - Fix a deadlock due to the dependency between workqueue and RCU expedited grace period, reported by Anna-Maria Behnsen and Thomas Gleixner and fixed by Frederic: Now RCU expedited always uses its own kthread worker instead of a workqueue - RCU NOCB updates, code cleanups, unnecessary barrier removals and minor bug fixes - Maintain real-time response in rcu_tasks_postscan() and a minor fix for tasks trace quiescence check - Misc updates, comments and readibility improvement, boot time parameter for lazy RCU and rcutorture improvement - Documentation updates * tag 'rcu.next.v6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/boqun/linux: (34 commits) rcu-tasks: Maintain real-time response in rcu_tasks_postscan() rcu-tasks: Eliminate deadlocks involving do_exit() and RCU tasks rcu-tasks: Maintain lists to eliminate RCU-tasks/do_exit() deadlocks rcu-tasks: Initialize data to eliminate RCU-tasks/do_exit() deadlocks rcu-tasks: Initialize callback lists at rcu_init() time rcu-tasks: Add data to eliminate RCU-tasks/do_exit() deadlocks rcu-tasks: Repair RCU Tasks Trace quiescence check rcu/sync: remove un-used rcu_sync_enter_start function rcutorture: Suppress rtort_pipe_count warnings until after stalls srcu: Improve comments about acceleration leak rcu: Provide a boot time parameter to control lazy RCU rcu: Rename jiffies_till_flush to jiffies_lazy_flush doc: Update checklist.rst discussion of callback execution doc: Clarify use of slab constructors and SLAB_TYPESAFE_BY_RCU context_tracking: Fix kerneldoc headers for __ct_user_{enter,exit}() doc: Add EARLY flag to early-parsed kernel boot parameters doc: Add CONFIG_RCU_STRICT_GRACE_PERIOD to checklist.rst doc: Make checklist.rst note that spinlocks are implied RCU readers doc: Make whatisRCU.rst note that spinlocks are RCU readers doc: Spinlocks are implied RCU readers ...
2024-03-11Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - MD pull requests via Song: - Cleanup redundant checks (Yu Kuai) - Remove deprecated headers (Marc Zyngier, Song Liu) - Concurrency fixes (Li Lingfeng) - Memory leak fix (Li Nan) - Refactor raid1 read_balance (Yu Kuai, Paul Luse) - Clean up and fix for md_ioctl (Li Nan) - Other small fixes (Gui-Dong Han, Heming Zhao) - MD atomic limits (Christoph) - NVMe pull request via Keith: - RDMA target enhancements (Max) - Fabrics fixes (Max, Guixin, Hannes) - Atomic queue_limits usage (Christoph) - Const use for class_register (Ricardo) - Identification error handling fixes (Shin'ichiro, Keith) - Improvement and cleanup for cached request handling (Christoph) - Moving towards atomic queue limits. Core changes and driver bits so far (Christoph) - Fix UAF issues in aoeblk (Chun-Yi) - Zoned fix and cleanups (Damien) - s390 dasd cleanups and fixes (Jan, Miroslav) - Block issue timestamp caching (me) - noio scope guarding for zoned IO (Johannes) - block/nvme PI improvements (Kanchan) - Ability to terminate long running discard loop (Keith) - bdev revalidation fix (Li) - Get rid of old nr_queues hack for kdump kernels (Ming) - Support for async deletion of ublk (Ming) - Improve IRQ bio recycling (Pavel) - Factor in CPU capacity for remote vs local completion (Qais) - Add shared_tags configfs entry for null_blk (Shin'ichiro - Fix for a regression in page refcounts introduced by the folio unification (Tony) - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid, Ricardo, Roman, Tang, Uwe) * tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits) block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC block/swim: Convert to platform remove callback returning void cdrom: gdrom: Convert to platform remove callback returning void block: remove disk_stack_limits md: remove mddev->queue md: don't initialize queue limits md/raid10: use the atomic queue limit update APIs md/raid5: use the atomic queue limit update APIs md/raid1: use the atomic queue limit update APIs md/raid0: use the atomic queue limit update APIs md: add queue limit helpers md: add a mddev_is_dm helper md: add a mddev_add_trace_msg helper md: add a mddev_trace_remap helper bcache: move calculation of stripe_size and io_opt into bcache_device_init virtio_blk: Do not use disk_set_max_open/active_zones() aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts block: move capacity validation to blkpg_do_ioctl() block: prevent division by zero in blk_rq_stat_sum() drbd: atomically update queue limits in drbd_reconsider_queue_parameters ...
2024-03-11Revert "ARM64: Dynamically allocate cpumasks and increase supported CPUs to 512"Catalin Marinas
This reverts commit 0499a78369adacec1af29340b71ff8dd375b4697. Enabling CPUMASK_OFFSTACK on arm64 triggers a warning in the dev_pm_opp_set_config() function followed by a failure to set the regulators and cpufreq-dt probing error. There is no apparent reason why this happens, so revert this commit until further investigation. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/c1f2902d-cefc-4122-9b86-d1d32911f590@samsung.com
2024-03-11Merge tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: - Make running of task_work internal loops more fair, and unify how the different methods deal with them (me) - Support for per-ring NAPI. The two minor networking patches are in a shared branch with netdev (Stefan) - Add support for truncate (Tony) - Export SQPOLL utilization stats (Xiaobing) - Multishot fixes (Pavel) - Fix for a race in manipulating the request flags via poll (Pavel) - Cleanup the multishot checking by making it generic, moving it out of opcode handlers (Pavel) - Various tweaks and cleanups (me, Kunwu, Alexander) * tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux: (53 commits) io_uring: Fix sqpoll utilization check racing with dying sqpoll io_uring/net: dedup io_recv_finish req completion io_uring: refactor DEFER_TASKRUN multishot checks io_uring: fix mshot io-wq checks io_uring/net: add io_req_msg_cleanup() helper io_uring/net: simplify msghd->msg_inq checking io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io io_uring/net: correctly handle multishot recvmsg retry setup io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler io_uring: fix io_queue_proc modifying req->flags io_uring: fix mshot read defer taskrun cqe posting io_uring/net: fix overflow check in io_recvmsg_mshot_prep() io_uring/net: correct the type of variable io_uring/sqpoll: statistics of the true utilization of sq threads io_uring/net: move recv/recvmsg flags out of retry loop io_uring/kbuf: flag request if buffer pool is empty after buffer pick io_uring/net: improve the usercopy for sendmsg/recvmsg io_uring/net: move receive multishot out of the generic msghdr path io_uring/net: unify how recvmsg and sendmsg copy in the msghdr ...
2024-03-11vfio/pds: Refactor/simplify reset logicBrett Creeley
The current logic for handling resets is more complicated than it needs to be. The deferred_reset flag is used to indicate a reset is needed and the deferred_reset_state is the requested, post-reset, state. Also, the deferred_reset logic was added to vfio migration drivers to prevent a circular locking dependency with respect to mm_lock and state mutex. This is mainly because of the copy_to/from_user() functions(which takes mm_lock) invoked under state mutex. Remove all of the deferred reset logic and just pass the requested next state to pds_vfio_reset() so it can be used for VMM and DSC initiated resets. This removes the need for pds_vfio_state_mutex_lock(), so remove that and replace its use with a simple mutex_unlock(). Also, remove the reset_mutex as it's no longer needed since the state_mutex can be the driver's primary protector. Suggested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Brett Creeley <brett.creeley@amd.com> Link: https://lore.kernel.org/r/20240308182149.22036-3-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/pds: Make sure migration file isn't accessed after resetBrett Creeley
It's possible the migration file is accessed after reset when it has been cleaned up, especially when it's initiated by the device. This is because the driver doesn't rip out the filep when cleaning up it only frees the related page structures and sets its local struct pds_vfio_lm_file pointer to NULL. This can cause a NULL pointer dereference, which is shown in the example below during a restore after a device initiated reset: BUG: kernel NULL pointer dereference, address: 000000000000000c PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:pds_vfio_get_file_page+0x5d/0xf0 [pds_vfio_pci] [...] Call Trace: <TASK> pds_vfio_restore_write+0xf6/0x160 [pds_vfio_pci] vfs_write+0xc9/0x3f0 ? __fget_light+0xc9/0x110 ksys_write+0xb5/0xf0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] Add a disabled flag to the driver's struct pds_vfio_lm_file that gets set during cleanup. Then make sure to check the flag when the migration file is accessed via its file_operations. By default this flag will be false as the memory for struct pds_vfio_lm_file is kzalloc'd, which means the struct pds_vfio_lm_file is enabled and accessible. Also, since the file_operations and driver's migration file cleanup happen under the protection of the same pds_vfio_lm_file.lock, using this flag is thread safe. Fixes: 8512ed256334 ("vfio/pds: Always clear the save/restore FDs on reset") Reviewed-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Brett Creeley <brett.creeley@amd.com> Link: https://lore.kernel.org/r/20240308182149.22036-2-brett.creeley@amd.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/platform: Convert to platform remove callback returning voidUwe Kleine-König
The .remove() callback for a platform driver returns an int which makes many driver authors wrongly assume it's possible to do error handling by returning an error code. However the value returned is ignored (apart from emitting a warning) and this typically results in resource leaks. To improve here there is a quest to make the remove callback return void. In the first step of this quest all drivers are converted to .remove_new(), which already returns void. Eventually after all drivers are converted, .remove_new() will be renamed to .remove(). Trivially convert this driver from always returning zero in the remove callback to the void returning variant. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/79d3df42fe5b359a05b8061631e72e5ed249b234.1709886922.git.u.kleine-koenig@pengutronix.de Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2024-03-11vfio/mlx5: Enforce PRE_COPY supportYishai Hadas
Enable live migration only once the firmware supports PRE_COPY. PRE_COPY has been supported by the firmware for a long time already [1] and is required to achieve a low downtime upon live migration. This lets us clean up some old code that is not applicable those days while PRE_COPY is fully supported by the firmware. [1] The minimum firmware version that supports PRE_COPY is 28.36.1010, it was released in January 2023. No firmware without PRE_COPY support ever available to users. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20240306105624.114830-1-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>