summaryrefslogtreecommitdiff
path: root/drivers/infiniband
AgeCommit message (Collapse)Author
2025-07-01RDMA/mlx5: Check CAP_NET_RAW in user namespace for anchor createParav Pandit
Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the anchor. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 0c6ab0ca9a66 ("RDMA/mlx5: Expose steering anchor to userspace") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/c2376ca75e7658e2cbd1f619cf28fbe98c906419.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01RDMA/mlx5: Check CAP_NET_RAW in user namespace for flow createParav Pandit
Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the flow. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 322694412400 ("IB/mlx5: Introduce driver create and destroy flow methods") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/a4dcd5e3ac6904ef50b19e56942ca6ab0728794c.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01RDMA/uverbs: Check CAP_NET_RAW in user namespace for flow createParav Pandit
Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the flow resource. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 436f2ad05a0b ("IB/core: Export ib_create/destroy_flow through uverbs") Signed-off-by: Parav Pandit <parav@nvidia.com> Suggested-by: Eric W. Biederman <ebiederm@xmission.com> Link: https://patch.msgid.link/6df6f2f24627874c4f6d041c19dc1f6f29f68f84.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26RDMA/ipoib: Use parent rdma device net namespaceMark Bloch
Use the net namespace of the underlying rdma device. After honoring the rdma device's namespace, the ipoib netdev now also runs in the same net namespace of the rdma device. Add an API to read the net namespace of the rdma device so that ULP such as IPoIB can use it to initialize its netdev. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-06-26RDMA/mlx5: Allocate IB device with net namespace supplied from core devMark Bloch
Use the new ib_alloc_device_with_net() API to allocate the IB device so that it is properly bound to the network namespace obtained via mlx5_core_net(). This change ensures correct namespace association (e.g., for containerized setups). Additionally, expose mlx5_core_net so that RDMA driver can use it. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-06-26RDMA/core: Extend RDMA device registration to be net namespace awareMark Bloch
Presently, RDMA devices are always registered within the init network namespace, even if the associated devlink device's namespace was changed via a devlink reload. This mismatch leads to discrepancies between the network namespace of the devlink device and that of the RDMA device. Therefore, extend the RDMA device allocation API to optionally take the net namespace. This isn't limited to devices that support devlink but allows all users to provide the network namespace if they need to do so. If a network namespace is provided during device allocation, it's up to the caller to make sure the namespace stays valid until ib_register_device() is called. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-06-26RDMA/rxe: Fix a couple IS_ERR() vs NULL bugsDan Carpenter
The lookup_mr() function returns NULL on error. It never returns error pointers. Fixes: 9284bc34c773 ("RDMA/rxe: Enable asynchronous prefetch for ODP MRs") Fixes: 3576b0df1588 ("RDMA/rxe: Implement synchronous prefetch for ODP MRs") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/685c1430.050a0220.18b0ef.da83@mx.google.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26RDMA/siw: work around clang stack size warningArnd Bergmann
clang inlines a lot of functions into siw_qp_sq_process(), with the aggregate stack frame blowing the warning limit in some configurations: drivers/infiniband/sw/siw/siw_qp_tx.c:1014:5: error: stack frame size (1544) exceeds limit (1280) in 'siw_qp_sq_process' [-Werror,-Wframe-larger-than] The real problem here is the array of kvec structures in siw_tx_hdt that makes up the majority of the consumed stack space. Ideally there would be a way to avoid allocating the array on the stack, but that would require a larger rework. Add a noinline_for_stack annotation to avoid the warning for now, and make clang behave the same way as gcc here. The combined stack usage is still similar, but is spread over multiple functions now. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20250620114332.4072051-1-arnd@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Acked-by: Bernard Metzler <bmt@zurich.ibm.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26RDMI: hfi1: drop cpumask_empty() call in hfi1/affinity.cYury Norov [NVIDIA]
In few places, the driver tests a cpumask for emptiness immediately before calling functions that report emptiness themself. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Link: https://patch.msgid.link/20250604193947.11834-8-yury.norov@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26RDMA: hfi1: simplify hfi1_get_proc_affinity()Yury Norov [NVIDIA]
The function protects the for loop with affinity->num_core_siblings > 0 condition, which is redundant because the loop will break immediately in that case. Drop it and save one indentation level. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Link: https://patch.msgid.link/20250604193947.11834-7-yury.norov@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26RDMA: hfi1: use rounddown in find_hw_thread_mask()Yury Norov [NVIDIA]
num_cores_per_socket is calculated by dividing by node_affinity.num_online_nodes, but all users of this variable multiply it by node_affinity.num_online_nodes again. This effectively is the same as rounding it down by node_affinity.num_online_nodes. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Link: https://patch.msgid.link/20250604193947.11834-6-yury.norov@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26RDMA: hfi1: simplify init_real_cpu_mask()Yury Norov [NVIDIA]
The function opencodes cpumask_nth() and cpumask_clear_cpus(). The dedicated helpers are easier to use and usually much faster than opencoded for-loops. While there, drop useless clear of real_cpu_mask. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Link: https://patch.msgid.link/20250604193947.11834-5-yury.norov@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26RDMA: hfi1: simplify find_hw_thread_mask()Yury Norov [NVIDIA]
The function opencodes cpumask_nth() and cpumask_clear_cpus(). The dedicated helpers are easier to use and usually much faster than opencoded for-loops. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Link: https://patch.msgid.link/20250604193947.11834-4-yury.norov@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26RDMA: hfi1: fix possible divide-by-zero in find_hw_thread_mask()Yury Norov [NVIDIA]
The function divides number of online CPUs by num_core_siblings, and later checks the divider by zero. This implies a possibility to get and divide-by-zero runtime error. Fix it by moving the check prior to division. This also helps to save one indentation level. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Link: https://patch.msgid.link/20250604193947.11834-3-yury.norov@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25RDMA/core: reduce stack using in nldev_stat_get_doit()Arnd Bergmann
In the s390 defconfig, gcc-10 and earlier end up inlining three functions into nldev_stat_get_doit(), and each of them uses some 600 bytes of stack. The result is a function with an overly large stack frame and a warning: drivers/infiniband/core/nldev.c:2466:1: error: the frame size of 1720 bytes is larger than 1280 bytes [-Werror=frame-larger-than=] Mark the three functions noinline_for_stack to prevent this, ensuring that only one copy of the nlattr array is on the stack of each function. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20250620113335.3776965-1-arnd@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25RDMA/mlx5: Add multiple priorities support to RDMA TRANSPORT userspace tablesPatrisious Haddad
Support the creation of RDMA TRANSPORT tables over multiple priorities via matcher creation. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/bb38e50ae4504e979c6568d41939402a4cf15635.1750148083.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25RDMA/mlx5: Support driver APIs pre_destroy_cq and post_destroy_cqMark Zhang
- pre_destroy_cq: Destroy FW CQ object so that no new CQ event would be generated; - post_destroy_cq: Release all resources. This patch, along with last one, fixes the crash below. Unable to handle kernel paging request at virtual address ffff8000114b1180 Mem abort info: ESR = 0x96000047 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000047 CM = 0, WnR = 1 swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000f4582000 [ffff8000114b1180] pgd=00000447fffff003, p4d=00000447fffff003, pud=00000447ffffe003, pmd=00000447ffffb003, pte=0000000000000000 Internal error: Oops: 96000047 [#1] SMP Modules linked in: udp_diag uio_pci_generic uio tcp_diag inet_diag binfmt_misc sn_core_odd(OE) rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) ib_ipoib(OE) kpatch_9658536(OK) kpatch_9322385(OK) kpatch_8843421(OK) kpatch_8636216(OK) vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce ghash_ce sm4_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt sg acpi_ipmi ipmi_si ipmi_msghandler m1_uncore_ddrss_pmu m1_uncore_cmn_pmu team_yosemite9rc6(OE) vnic(OE) ip_tables mlx5_ib(OE) sd_mod ast mlx5_core(OE) i2c_algo_bit drm_vram_helper psample drm_kms_helper mlxdevm(OE) auxiliary(OE) mlxfw(OE) syscopyarea sysfillrect tls sysimgblt fb_sys_fops drm_ttm_helper nvme ttm nvme_core drm t10_pi i2c_designware_platform i2c_designware_core i2c_core ahci libahci libata rdma_ucm(OE) ib_uverbs(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_umad(OE) ib_core(OE) ib_ucm(OE) mlx_compat(OE) [last unloaded: ipmi_devintf] CPU: 83 PID: 59375 Comm: kworker/u253:1 Kdump: loaded Tainted: G OE K 5.10.84-004.ali5000.alios7.aarch64 #1 Hardware name: Inspur AliServer-Xuanwu2.0AM-02-2UM1P-5B/AS1221MG1, BIOS 1.2.M1.AL.P.158.00 08/31/2023 Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core] pstate: 82c00089 (Nzcv daIf +PAN +UAO +TCO BTYPE=--) pc : native_queued_spin_lock_slowpath+0x1c4/0x31c lr : mlx5_ib_poll_cq+0x18c/0x2f8 [mlx5_ib] sp : ffff80002be1bc80 x29: ffff80002be1bc80 x28: ffff000810e69000 x27: ffff000810e69000 x26: ffff000810e69200 x25: 0000000000000000 x24: ffff8000117db000 x23: ffff04000156b780 x22: 0000000000000000 x21: ffff04000ce6c160 x20: ffff0008196a4000 x19: 0000000000000010 x18: 0000000000000020 x17: 0000000000000000 x16: 0000000000000000 x15: ffff040055a364e8 x14: ffffffffffffffff x13: ffff80002318bda8 x12: ffff0400358836e8 x11: 0000000000000040 x10: 0000000000000eb0 x9 : 0000000000000000 x8 : 0000000000000000 x7 : ffff04477fa20140 x6 : ffff8000114b1140 x5 : ffff04477fa20140 x4 : ffff8000114b1180 x3 : ffff000810e69200 x2 : ffff8000114b1180 x1 : 0000000001500000 x0 : ffff04477fa20148 Call trace: native_queued_spin_lock_slowpath+0x1c4/0x31c __ib_process_cq+0x74/0x1b8 [ib_core] ib_cq_poll_work+0x34/0xa0 [ib_core] process_one_work+0x1d8/0x4b0 worker_thread+0x180/0x440 kthread+0x114/0x120 Code: 910020e0 8b0400c4 f862d929 aa0403e2 (f8296847) ---[ end trace 387be2290557729c ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x9850817,7a60aa38 Memory Limit: none Starting crashdump kernel... Bye! Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://patch.msgid.link/aaf0072f350d1c7e8731f43b79e11a560bafb9e0.1750070205.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25RDMA/core: Add driver APIs pre_destroy_cq() and post_destroy_cq()Mark Zhang
Currently in ib_free_cq, it disables IRQ or cancel the CQ work before driver destroy_cq. This isn't good as a new IRQ or a CQ work can be submitted immediately after disabling IRQ or canceling CQ work, which may run concurrently with destroy_cq and cause crashes. The right flow should be: 1. Driver disables CQ to make sure no new CQ event will be submitted; 2. Disables IRQ or Cancels CQ work in core layer, to make sure no CQ polling work is running; 3. Free all resources to destroy the CQ. This patch adds 2 driver APIs: - pre_destroy_cq(): Disable a CQ to prevent it from generating any new work completions, but not free any kernel resources; - post_destroy_cq(): Free all kernel resources. In ib_free_cq, the IRQ is disabled or CQ work is canceled after pre_destroy_cq, and before post_destroy_cq. Fixes: 14d3a3b2498e ("IB: add a proper completion queue abstraction") Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://patch.msgid.link/b5f7ae3d75f44a3e15ff3f4eb2bbdea13e06b97f.1750062328.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25RDMA/mlx5: Fix vport loopback for MPV devicePatrisious Haddad
Always enable vport loopback for both MPV devices on driver start. Previously in some cases related to MPV RoCE, packets weren't correctly executing loopback check at vport in FW, since it was disabled. Due to complexity of identifying such cases for MPV always enable vport loopback for both GVMIs when binding the slave to the master port. Fixes: 0042f9e458a5 ("RDMA/mlx5: Enable vport loopback when user context or QP mandate") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/d4298f5ebb2197459e9e7221c51ecd6a34699847.1750064969.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25RDMA/mlx5: Fix CC counters query for MPVPatrisious Haddad
In case, CC counters are querying for the second port use the correct core device for the query instead of always using the master core device. Fixes: aac4492ef23a ("IB/mlx5: Update counter implementation for dual port RoCE") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/9cace74dcf106116118bebfa9146d40d4166c6b0.1750064969.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25RDMA/mlx5: Fix HW counters query for non-representor devicesPatrisious Haddad
To get the device HW counters, a non-representor switchdev device should use the mlx5_ib_query_q_counters() function and query all of the available counters. While a representor device in switchdev mode should use the mlx5_ib_query_q_counters_vport() function and query only the Q_Counters without the PPCNT counters and congestion control counters, since they aren't relevant for a representor device. Currently a non-representor switchdev device skips querying the PPCNT counters and congestion control counters, leaving them unupdated. Fix that by properly querying those counters for non-representor devices. Fixes: d22467a71ebe ("RDMA/mlx5: Expand switchdev Q-counters to expose representor statistics") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Link: https://patch.msgid.link/56bf8af4ca8c58e3fb9f7e47b1dca2009eeeed81.1750064969.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25IB/core: Annotate umem_mutex acquisition under fs_reclaim for lockdepOr Har-Toov
Following the fix in the previous commit ("IB/mlx5: Fix potential deadlock in MR deregistration"), teach lockdep explicitly about the locking order between fs_reclaim and umem_mutex. The previous commit resolved a potential deadlock scenario where kzalloc(GFP_KERNEL) was called while holding umem_mutex, which could lead to reclaim and eventually invoke the MMU notifier (mlx5_ib_invalidate_range()), causing a recursive acquisition of umem_mutex. To prevent such issues from reoccurring unnoticed in future code changes, add a lockdep annotation in ib_init_umem_odp() that simulates taking umem_mutex inside a reclaim context. This makes lockdep aware of this locking dependency and ensures that future violations—such as calling kzalloc() or any memory allocator that may enter reclaim while holding umem_mutex—will immediately raise a lockdep warning. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/9d31b9d8fe1db648a9f47cec3df6b8463319dee5.1750061698.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25IB/mlx5: Fix potential deadlock in MR deregistrationOr Har-Toov
The issue arises when kzalloc() is invoked while holding umem_mutex or any other lock acquired under umem_mutex. This is problematic because kzalloc() can trigger fs_reclaim_aqcuire(), which may, in turn, invoke mmu_notifier_invalidate_range_start(). This function can lead to mlx5_ib_invalidate_range(), which attempts to acquire umem_mutex again, resulting in a deadlock. The problematic flow: CPU0 | CPU1 ---------------------------------------|------------------------------------------------ mlx5_ib_dereg_mr() | → revoke_mr() | → mutex_lock(&umem_odp->umem_mutex) | | mlx5_mkey_cache_init() | → mutex_lock(&dev->cache.rb_lock) | → mlx5r_cache_create_ent_locked() | → kzalloc(GFP_KERNEL) | → fs_reclaim() | → mmu_notifier_invalidate_range_start() | → mlx5_ib_invalidate_range() | → mutex_lock(&umem_odp->umem_mutex) → cache_ent_find_and_store() | → mutex_lock(&dev->cache.rb_lock) | Additionally, when kzalloc() is called from within cache_ent_find_and_store(), we encounter the same deadlock due to re-acquisition of umem_mutex. Solve by releasing umem_mutex in dereg_mr() after umr_revoke_mr() and before acquiring rb_lock. This ensures that we don't hold umem_mutex while performing memory allocations that could trigger the reclaim path. This change prevents the deadlock by ensuring proper lock ordering and avoiding holding locks during memory allocation operations that could trigger the reclaim path. The following lockdep warning demonstrates the deadlock: python3/20557 is trying to acquire lock: ffff888387542128 (&umem_odp->umem_mutex){+.+.}-{4:4}, at: mlx5_ib_invalidate_range+0x5b/0x550 [mlx5_ib] but task is already holding lock: ffffffff82f6b840 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: unmap_vmas+0x7b/0x1a0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}: fs_reclaim_acquire+0x60/0xd0 mem_cgroup_css_alloc+0x6f/0x9b0 cgroup_init_subsys+0xa4/0x240 cgroup_init+0x1c8/0x510 start_kernel+0x747/0x760 x86_64_start_reservations+0x25/0x30 x86_64_start_kernel+0x73/0x80 common_startup_64+0x129/0x138 -> #2 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x91/0xd0 __kmalloc_cache_noprof+0x4d/0x4c0 mlx5r_cache_create_ent_locked+0x75/0x620 [mlx5_ib] mlx5_mkey_cache_init+0x186/0x360 [mlx5_ib] mlx5_ib_stage_post_ib_reg_umr_init+0x3c/0x60 [mlx5_ib] __mlx5_ib_add+0x4b/0x190 [mlx5_ib] mlx5r_probe+0xd9/0x320 [mlx5_ib] auxiliary_bus_probe+0x42/0x70 really_probe+0xdb/0x360 __driver_probe_device+0x8f/0x130 driver_probe_device+0x1f/0xb0 __driver_attach+0xd4/0x1f0 bus_for_each_dev+0x79/0xd0 bus_add_driver+0xf0/0x200 driver_register+0x6e/0xc0 __auxiliary_driver_register+0x6a/0xc0 do_one_initcall+0x5e/0x390 do_init_module+0x88/0x240 init_module_from_file+0x85/0xc0 idempotent_init_module+0x104/0x300 __x64_sys_finit_module+0x68/0xc0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 -> #1 (&dev->cache.rb_lock){+.+.}-{4:4}: __mutex_lock+0x98/0xf10 __mlx5_ib_dereg_mr+0x6f2/0x890 [mlx5_ib] mlx5_ib_dereg_mr+0x21/0x110 [mlx5_ib] ib_dereg_mr_user+0x85/0x1f0 [ib_core] uverbs_free_mr+0x19/0x30 [ib_uverbs] destroy_hw_idr_uobject+0x21/0x80 [ib_uverbs] uverbs_destroy_uobject+0x60/0x3d0 [ib_uverbs] uobj_destroy+0x57/0xa0 [ib_uverbs] ib_uverbs_cmd_verbs+0x4d5/0x1210 [ib_uverbs] ib_uverbs_ioctl+0x129/0x230 [ib_uverbs] __x64_sys_ioctl+0x596/0xaa0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 -> #0 (&umem_odp->umem_mutex){+.+.}-{4:4}: __lock_acquire+0x1826/0x2f00 lock_acquire+0xd3/0x2e0 __mutex_lock+0x98/0xf10 mlx5_ib_invalidate_range+0x5b/0x550 [mlx5_ib] __mmu_notifier_invalidate_range_start+0x18e/0x1f0 unmap_vmas+0x182/0x1a0 exit_mmap+0xf3/0x4a0 mmput+0x3a/0x100 do_exit+0x2b9/0xa90 do_group_exit+0x32/0xa0 get_signal+0xc32/0xcb0 arch_do_signal_or_restart+0x29/0x1d0 syscall_exit_to_user_mode+0x105/0x1d0 do_syscall_64+0x79/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 Chain exists of: &dev->cache.rb_lock --> mmu_notifier_invalidate_range_start --> &umem_odp->umem_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&umem_odp->umem_mutex); lock(mmu_notifier_invalidate_range_start); lock(&umem_odp->umem_mutex); lock(&dev->cache.rb_lock); *** DEADLOCK *** Fixes: abb604a1a9c8 ("RDMA/mlx5: Fix a race for an ODP MR which leads to CQE with error") Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/3c8f225a8a9fade647d19b014df1172544643e4a.1750061612.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-24scsi: RDMA/srp: Don't set a max_segment_size when virt_boundary_mask is setChristoph Hellwig
virt_boundary_mask implies an unlimited max_segment_size. Setting both can lead to data corruption because __blk_rq_map_sg() can split requests so that the virt_boundary_mask is not respected if max_segment_size is not UINT_MAX. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250624125233.219635-2-hch@lst.de Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Acked-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2025-06-17RDMA/mlx5: Initialize obj_event->obj_sub_list before xa_insertMark Zhang
The obj_event may be loaded immediately after inserted, then if the list_head is not initialized then we may get a poisonous pointer. This fixes the crash below: mlx5_core 0000:03:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced) mlx5_core.sf mlx5_core.sf.4: firmware version: 32.38.3056 mlx5_core 0000:03:00.0 en3f0pf0sf2002: renamed from eth0 mlx5_core.sf mlx5_core.sf.4: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps IPv6: ADDRCONF(NETDEV_CHANGE): en3f0pf0sf2002: link becomes ready Unable to handle kernel NULL pointer dereference at virtual address 0000000000000060 Mem abort info: ESR = 0x96000006 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000006 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=00000007760fb000 [0000000000000060] pgd=000000076f6d7003, p4d=000000076f6d7003, pud=0000000777841003, pmd=0000000000000000 Internal error: Oops: 96000006 [#1] SMP Modules linked in: ipmb_host(OE) act_mirred(E) cls_flower(E) sch_ingress(E) mptcp_diag(E) udp_diag(E) raw_diag(E) unix_diag(E) tcp_diag(E) inet_diag(E) binfmt_misc(E) bonding(OE) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) isofs(E) cdrom(E) mst_pciconf(OE) ib_umad(OE) mlx5_ib(OE) ipmb_dev_int(OE) mlx5_core(OE) kpatch_15237886(OEK) mlxdevm(OE) auxiliary(OE) ib_uverbs(OE) ib_core(OE) psample(E) mlxfw(OE) tls(E) sunrpc(E) vfat(E) fat(E) crct10dif_ce(E) ghash_ce(E) sha1_ce(E) sbsa_gwdt(E) virtio_console(E) ext4(E) mbcache(E) jbd2(E) xfs(E) libcrc32c(E) mmc_block(E) virtio_net(E) net_failover(E) failover(E) sha2_ce(E) sha256_arm64(E) nvme(OE) nvme_core(OE) gpio_mlxbf3(OE) mlx_compat(OE) mlxbf_pmc(OE) i2c_mlxbf(OE) sdhci_of_dwcmshc(OE) pinctrl_mlxbf3(OE) mlxbf_pka(OE) gpio_generic(E) i2c_core(E) mmc_core(E) mlxbf_gige(OE) vitesse(E) pwr_mlxbf(OE) mlxbf_tmfifo(OE) micrel(E) mlxbf_bootctl(OE) virtio_ring(E) virtio(E) ipmi_devintf(E) ipmi_msghandler(E) [last unloaded: mst_pci] CPU: 11 PID: 20913 Comm: rte-worker-11 Kdump: loaded Tainted: G OE K 5.10.134-13.1.an8.aarch64 #1 Hardware name: https://www.mellanox.com BlueField-3 SmartNIC Main Card/BlueField-3 SmartNIC Main Card, BIOS 4.2.2.12968 Oct 26 2023 pstate: a0400089 (NzCv daIf +PAN -UAO -TCO BTYPE=--) pc : dispatch_event_fd+0x68/0x300 [mlx5_ib] lr : devx_event_notifier+0xcc/0x228 [mlx5_ib] sp : ffff80001005bcf0 x29: ffff80001005bcf0 x28: 0000000000000001 x27: ffff244e0740a1d8 x26: ffff244e0740a1d0 x25: ffffda56beff5ae0 x24: ffffda56bf911618 x23: ffff244e0596a480 x22: ffff244e0596a480 x21: ffff244d8312ad90 x20: ffff244e0596a480 x19: fffffffffffffff0 x18: 0000000000000000 x17: 0000000000000000 x16: ffffda56be66d620 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000040 x10: ffffda56bfcafb50 x9 : ffffda5655c25f2c x8 : 0000000000000010 x7 : 0000000000000000 x6 : ffff24545a2e24b8 x5 : 0000000000000003 x4 : ffff80001005bd28 x3 : 0000000000000000 x2 : 0000000000000000 x1 : ffff244e0596a480 x0 : ffff244d8312ad90 Call trace: dispatch_event_fd+0x68/0x300 [mlx5_ib] devx_event_notifier+0xcc/0x228 [mlx5_ib] atomic_notifier_call_chain+0x58/0x80 mlx5_eq_async_int+0x148/0x2b0 [mlx5_core] atomic_notifier_call_chain+0x58/0x80 irq_int_handler+0x20/0x30 [mlx5_core] __handle_irq_event_percpu+0x60/0x220 handle_irq_event_percpu+0x3c/0x90 handle_irq_event+0x58/0x158 handle_fasteoi_irq+0xfc/0x188 generic_handle_irq+0x34/0x48 ... Fixes: 759738537142 ("IB/mlx5: Enable subscription for device events over DEVX") Link: https://patch.msgid.link/r/3ce7f20e0d1a03dc7de6e57494ec4b8eaf1f05c2.1750147949.git.leon@kernel.org Signed-off-by: Mark Zhang <markzhang@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-06-17RDMA/core: Rate limit GID cache warning messagesMaor Gottlieb
The GID cache warning messages can flood the kernel log when there are multiple failed attempts to add GIDs. This can happen when creating many virtual interfaces without having enough space for their GIDs in the GID table. Change pr_warn to pr_warn_ratelimited to prevent log flooding while still maintaining visibility of the issue. Link: https://patch.msgid.link/r/fd45ed4a1078e743f498b234c3ae816610ba1b18.1750062357.git.leon@kernel.org Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-06-17RDMA/mlx5: Fix unsafe xarray access in implicit ODP handlingOr Har-Toov
__xa_store() and __xa_erase() were used without holding the proper lock, which led to a lockdep warning due to unsafe RCU usage. This patch replaces them with xa_store() and xa_erase(), which perform the necessary locking internally. ============================= WARNING: suspicious RCPU usage 6.14.0-rc7_for_upstream_debug_2025_03_18_15_01 #1 Not tainted ----------------------------- ./include/linux/xarray.h:1211 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by kworker/u136:0/219: at: process_one_work+0xbe4/0x15f0 process_one_work+0x75c/0x15f0 pagefault_mr+0x9a5/0x1390 [mlx5_ib] stack backtrace: CPU: 14 UID: 0 PID: 219 Comm: kworker/u136:0 Not tainted 6.14.0-rc7_for_upstream_debug_2025_03_18_15_01 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: mlx5_ib_page_fault mlx5_ib_eqe_pf_action [mlx5_ib] Call Trace: dump_stack_lvl+0xa8/0xc0 lockdep_rcu_suspicious+0x1e6/0x260 xas_create+0xb8a/0xee0 xas_store+0x73/0x14c0 __xa_store+0x13c/0x220 ? xa_store_range+0x390/0x390 ? spin_bug+0x1d0/0x1d0 pagefault_mr+0xcb5/0x1390 [mlx5_ib] ? _raw_spin_unlock+0x1f/0x30 mlx5_ib_eqe_pf_action+0x3be/0x2620 [mlx5_ib] ? lockdep_hardirqs_on_prepare+0x400/0x400 ? mlx5_ib_invalidate_range+0xcb0/0xcb0 [mlx5_ib] process_one_work+0x7db/0x15f0 ? pwq_dec_nr_in_flight+0xda0/0xda0 ? assign_work+0x168/0x240 worker_thread+0x57d/0xcd0 ? rescuer_thread+0xc40/0xc40 kthread+0x3b3/0x800 ? kthread_is_per_cpu+0xb0/0xb0 ? lock_downgrade+0x680/0x680 ? do_raw_spin_lock+0x12d/0x270 ? spin_bug+0x1d0/0x1d0 ? finish_task_switch.isra.0+0x284/0x9e0 ? lockdep_hardirqs_on_prepare+0x284/0x400 ? kthread_is_per_cpu+0xb0/0xb0 ret_from_fork+0x2d/0x70 ? kthread_is_per_cpu+0xb0/0xb0 ret_from_fork_asm+0x11/0x20 Fixes: d3d930411ce3 ("RDMA/mlx5: Fix implicit ODP use after free") Link: https://patch.msgid.link/r/a85ddd16f45c8cb2bc0a188c2b0fcedfce975eb8.1750061791.git.leon@kernel.org Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-06-17RDMA/qib: Remove outdated driverDennis Dalessandro
The qib Infiniband device has long been decommissioned from distros and unsupported by the company. We have kept it going in a compile only mode for a while now and it's time to remove the driver. It has a number of issues that are never going to be addressed and is a hindrance to improving hfi1 and rdmavt. Link: https://patch.msgid.link/r/174836076755.2436819.5981097575800950899.stgit@awdrv-04.cornelisnetworks.com Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-06-17sysfs: treewide: switch back to attribute_group::bin_attrsThomas Weißschuh
The normal bin_attrs field can now handle const pointers. This makes the _new variant unnecessary. Switch all users back. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20250530-sysfs-const-bin_attr-final-v3-4-724bfcf05b99@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-12RDMA/rxe: Remove redundant page presence checkDaisuke Matsuda
hmm_pfn_to_page() does not return NULL. ib_umem_odp_map_dma_and_lock() should return an error in case the target pages cannot be mapped until timeout, so these checks can safely be removed. Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com> Link: https://patch.msgid.link/20250611162758.10000-1-dskmtsd@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-12RDMA/mana_ib: Add device statistics supportShiraz Saleem
Add support for mana device level statistics. Co-developed-by: Solom Tamawy <solom.tamawy@microsoft.com> Signed-off-by: Solom Tamawy <solom.tamawy@microsoft.com> Signed-off-by: Shiraz Saleem <shirazsaleem@microsoft.com> Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> Link: https://patch.msgid.link/1749559717-3424-1-git-send-email-kotaranov@linux.microsoft.com Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-12RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanupArnd Bergmann
This function has an array of eight mlx5_async_cmd structures, which often fits on the stack, but depending on the configuration can end up blowing the stack frame warning limit: drivers/infiniband/hw/mlx5/devx.c:2670:6: error: stack frame size (1392) exceeds limit (1280) in 'mlx5_ib_ufile_hw_cleanup' [-Werror,-Wframe-larger-than] Change this to a dynamic allocation instead. While a kmalloc() can theoretically fail, a GFP_KERNEL allocation under a page will block until memory has been freed up, so in the worst case, this only adds extra time in an already constrained environment. Fixes: 7c891a4dbcc1 ("RDMA/mlx5: Add implementation for ufile_hw_cleanup device operation") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20250610092846.2642535-1-arnd@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-12RDMA/cxgb4: Delete an unnecessary check before kfree() in c4iw_rdev_open()Markus Elfring
It can be known that the function “kfree” performs a null pointer check for its input parameter. It is therefore not needed to repeat such a check before its call. Thus remove a redundant pointer check. The source code was transformed by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Link: https://patch.msgid.link/cdc577a5-cebd-404a-b762-cc6fee0870dc@web.de Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-12IB/iser: Remove unnecessary local variableLi Jun
The 'error' variable is no longer needed,as iscsi_iser_mtask_xmit can return the result of iser_send_control(conn,task) directly. Signed-off-by: Li Jun <lijun01@kylinos.cn> Link: https://patch.msgid.link/20250604102049.130039-1-lijun01@kylinos.cn Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-12RDMA/hns: Remove MW supportJunxian Huang
MW is no longer supported in hns. Delete relevant codes. Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250605024917.1132393-1-huangjunxian6@hisilicon.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-12RDMA/hns: ZERO_OR_NULL_PTR macro overdetectionluoqing
sizeof(xx) these variable values' return values cannot be 0. For memory allocation requests of non-zero length, there is no need to check other return values; it is sufficient to only verify that it is not null. Signed-off-by: luoqing <luoqing@kylinos.cn> Link: https://patch.msgid.link/20250605034918.242682-1-l1138897701@163.com Reviewed-by: Junxian Huang <huangjunxian6@hisilicon.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-12RDMA/rxe: Enable asynchronous prefetch for ODP MRsDaisuke Matsuda
Calling ibv_advise_mr(3) with flags other than IBV_ADVISE_MR_FLAG_FLUSH invokes an asynchronous request. It is best-effort, and thus can safely be deferred to the system-wide workqueue. The reference counter in rxe_mr is used to ensure that the MRs persist and that rxe is not terminated until the queued work is done. Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com> Link: https://patch.msgid.link/20250522111955.3227-3-dskmtsd@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-12RDMA/rxe: Implement synchronous prefetch for ODP MRsDaisuke Matsuda
Minimal implementation of ibv_advise_mr(3) requires synchronous calls being successful with the IBV_ADVISE_MR_FLAG_FLUSH flag. Asynchronous requests, which are best-effort, will be supported subsequently. Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com> Link: https://patch.msgid.link/20250522111955.3227-2-dskmtsd@gmail.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-08treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-05-30Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "Usual collection of driver fixes: - Small bug fixes and cleansup in hfi, hns, rxe, mlx5, mana siw - Further ODP functionality in rxe - Remote access MRs in mana, along with more page sizes - Improve CM scalability with a rwlock around the agent - More trace points for hns - ODP hmm conversion to the new two step dma API - Support the ethernet HW device in mana as well as the RNIC - Cleanups: - Use secs_to_jiffies() when appropriate - Use ERR_CAST() instead of naked casts - Don't use %pK in printk - Unusued functions removed - Allocation type matching" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (57 commits) RDMA/cma: Fix hang when cma_netevent_callback fails to queue_work RDMA/bnxt_re: Support extended stats for Thor2 VF RDMA/hns: Fix endian issue in trace events RDMA/mlx5: Avoid flexible array warning IB/cm: Remove dead code and adjust naming RDMA/core: Avoid hmm_dma_map_alloc() for virtual DMA devices RDMA/rxe: Break endless pagefault loop for RO pages RDMA/bnxt_re: Fix return code of bnxt_re_configure_cc RDMA/bnxt_re: Fix missing error handling for tx_queue RDMA/bnxt_re: Fix incorrect display of inactivity_cp in debugfs output RDMA/mlx5: Add support for 200Gbps per lane speeds RDMA/mlx5: Remove the redundant MLX5_IB_STAGE_UAR stage RDMA/iwcm: Fix use-after-free of work objects after cm_id destruction net: mana: Add support for auxiliary device servicing events RDMA/mana_ib: unify mana_ib functions to support any gdma device RDMA/mana_ib: Add support of mana_ib for RNIC and ETH nic net: mana: Probe rdma device in mana driver RDMA/siw: replace redundant ternary operator with just rv RDMA/umem: Separate implicit ODP initialization from explicit ODP RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage ...
2025-05-28Merge tag 'net-next-6.16' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Implement the Device Memory TCP transmit path, allowing zero-copy data transmission on top of TCP from e.g. GPU memory to the wire. - Move all the IPv6 routing tables management outside the RTNL scope, under its own lock and RCU. The route control path is now 3x times faster. - Convert queue related netlink ops to instance lock, reducing again the scope of the RTNL lock. This improves the control plane scalability. - Refactor the software crc32c implementation, removing unneeded abstraction layers and improving significantly the related micro-benchmarks. - Optimize the GRO engine for UDP-tunneled traffic, for a 10% performance improvement in related stream tests. - Cover more per-CPU storage with local nested BH locking; this is a prep work to remove the current per-CPU lock in local_bh_disable() on PREMPT_RT. - Introduce and use nlmsg_payload helper, combining buffer bounds verification with accessing payload carried by netlink messages. Netfilter: - Rewrite the procfs conntrack table implementation, improving considerably the dump performance. A lot of user-space tools still use this interface. - Implement support for wildcard netdevice in netdev basechain and flowtables. - Integrate conntrack information into nft trace infrastructure. - Export set count and backend name to userspace, for better introspection. BPF: - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops programs and can be controlled in similar way to traditional qdiscs using the "tc qdisc" command. - Refactor the UDP socket iterator, addressing long standing issues WRT duplicate hits or missed sockets. Protocols: - Improve TCP receive buffer auto-tuning and increase the default upper bound for the receive buffer; overall this improves the single flow maximum thoughput on 200Gbs link by over 60%. - Add AFS GSSAPI security class to AF_RXRPC; it provides transport security for connections to the AFS fileserver and VL server. - Improve TCP multipath routing, so that the sources address always matches the nexthop device. - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS, and thus preventing DoS caused by passing around problematic FDs. - Retire DCCP socket. DCCP only receives updates for bugs, and major distros disable it by default. Its removal allows for better organisation of TCP fields to reduce the number of cache lines hit in the fast path. - Extend TCP drop-reason support to cover PAWS checks. Driver API: - Reorganize PTP ioctl flag support to require an explicit opt-in for the drivers, avoiding the problem of drivers not rejecting new unsupported flags. - Converted several device drivers to timestamping APIs. - Introduce per-PHY ethtool dump helpers, improving the support for dump operations targeting PHYs. Tests and tooling: - Add support for classic netlink in user space C codegen, so that ynl-c can now read, create and modify links, routes addresses and qdisc layer configuration. - Add ynl sub-types for binary attributes, allowing ynl-c to output known struct instead of raw binary data, clarifying the classic netlink output. - Extend MPTCP selftests to improve the code-coverage. - Add tests for XDP tail adjustment in AF_XDP. New hardware / drivers: - OpenVPN virtual driver: offload OpenVPN data channels processing to the kernel-space, increasing the data transfer throughput WRT the user-space implementation. - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC. - Broadcom asp-v3.0 ethernet driver. - AMD Renoir ethernet device. - ReakTek MT9888 2.5G ethernet PHY driver. - Aeonsemi 10G C45 PHYs driver. Drivers: - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - refactor the steering table handling to significantly reduce the amount of memory used - add support for complex matches in H/W flow steering - improve flow streeing error handling - convert to netdev instance locking - Intel (100G, ice, igb, ixgbe, idpf): - ice: add switchdev support for LLDP traffic over VF - ixgbe: add firmware manipulation and regions devlink support - igb: introduce support for frame transmission premption - igb: adds persistent NAPI configuration - idpf: introduce RDMA support - idpf: add initial PTP support - Meta (fbnic): - extend hardware stats coverage - add devlink dev flash support - Broadcom (bnxt): - add support for RX-side device memory TCP - Wangxun (txgbe): - implement support for udp tunnel offload - complete PTP and SRIOV support for AML 25G/10G devices - Ethernet NICs embedded and virtual: - Google (gve): - add device memory TCP TX support - Amazon (ena): - support persistent per-NAPI config - Airoha: - add H/W support for L2 traffic offload - add per flow stats for flow offloading - RealTek (rtl8211): add support for WoL magic packet - Synopsys (stmmac): - dwmac-socfpga 1000BaseX support - add Loongson-2K3000 support - introduce support for hardware-accelerated VLAN stripping - Broadcom (bcmgenet): - expose more H/W stats - Freescale (enetc, dpaa2-eth): - enetc: add MAC filter, VLAN filter RSS and loopback support - dpaa2-eth: convert to H/W timestamping APIs - vxlan: convert FDB table to rhashtable, for better scalabilty - veth: apply qdisc backpressure on full ring to reduce TX drops - Ethernet switches: - Microchip (kzZ88x3): add ETS scheduler support - Ethernet PHYs: - RealTek (rtl8211): - add support for WoL magic packet - add support for PHY LEDs - CAN: - Adds RZ/G3E CANFD support to the rcar_canfd driver. - Preparatory work for CAN-XL support. - Add self-tests framework with support for CAN physical interfaces. - WiFi: - mac80211: - scan improvements with multi-link operation (MLO) - Qualcomm (ath12k): - enable AHB support for IPQ5332 - add monitor interface support to QCN9274 - add multi-link operation support to WCN7850 - add 802.11d scan offload support to WCN7850 - monitor mode for WCN7850, better 6 GHz regulatory - Qualcomm (ath11k): - restore hibernation support - MediaTek (mt76): - WiFi-7 improvements - implement support for mt7990 - Intel (iwlwifi): - enhanced multi-link single-radio (EMLSR) support on 5 GHz links - rework device configuration - RealTek (rtw88): - improve throughput for RTL8814AU - RealTek (rtw89): - add multi-link operation support - STA/P2P concurrency improvements - support different SAR configs by antenna - Bluetooth: - introduce HCI Driver protocol - btintel_pcie: do not generate coredump for diagnostic events - btusb: add HCI Drv commands for configuring altsetting - btusb: add RTL8851BE device 0x0bda:0xb850 - btusb: add new VID/PID 13d3/3584 for MT7922 - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925 - btnxpuart: implement host-wakeup feature" * tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits) selftests/bpf: Fix bpf selftest build warning selftests: netfilter: Fix skip of wildcard interface test net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames net: openvswitch: Fix the dead loop of MPLS parse calipso: Don't call calipso functions for AF_INET sk. selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem net_sched: hfsc: Address reentrant enqueue adding class to eltree twice octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback octeontx2-pf: QOS: Perform cache sync on send queue teardown net: mana: Add support for Multi Vports on Bare metal net: devmem: ncdevmem: remove unused variable net: devmem: ksft: upgrade rx test to send 1K data net: devmem: ksft: add 5 tuple FS support net: devmem: ksft: add exit_wait to make rx test pass net: devmem: ksft: add ipv4 support net: devmem: preserve sockc_err page_pool: fix ugly page_pool formatting net: devmem: move list_add to net_devmem_bind_dmabuf. selftests: netfilter: nft_queue.sh: include file transfer duration in log message net: phy: mscc: Fix memory leak when using one step timestamping ...
2025-05-26RDMA/cma: Fix hang when cma_netevent_callback fails to queue_workJack Morgenstein
The cited commit fixed a crash when cma_netevent_callback was called for a cma_id while work on that id from a previous call had not yet started. The work item was re-initialized in the second call, which corrupted the work item currently in the work queue. However, it left a problem when queue_work fails (because the item is still pending in the work queue from a previous call). In this case, cma_id_put (which is called in the work handler) is therefore not called. This results in a userspace process hang (zombie process). Fix this by calling cma_id_put() if queue_work fails. Fixes: 45f5dcdd0497 ("RDMA/cma: Fix workqueue crash in cma_netevent_work_handler") Link: https://patch.msgid.link/r/4f3640b501e48d0166f312a64fdadf72b059bd04.1747827103.git.leon@kernel.org Signed-off-by: Jack Morgenstein <jackm@nvidia.com> Signed-off-by: Feng Liu <feliu@nvidia.com> Reviewed-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-05-26Merge tag 'v6.15' into rdma.git for-nextJason Gunthorpe
Following patches need the RDMA rc branch since we are past the RC cycle now. Merge conflicts resolved based on Linux-next: - For RXE odp changes keep for-next version and fixup new places that need to call is_odp_mr() https://lore.kernel.org/r/20250422143019.500201bd@canb.auug.org.au https://lore.kernel.org/r/20250514122455.3593b083@canb.auug.org.au - irdma is keeping the while/kfree bugfix from -rc and the pf/cdev_info change from for-next https://lore.kernel.org/r/20250513130630.280ee6c5@canb.auug.org.au Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-05-26Merge tag 'vfs-6.16-rc1.async.dir' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs directory lookup updates from Christian Brauner: "This contains cleanups for the lookup_one*() family of helpers. We expose a set of functions with names containing "lookup_one_len" and others without the "_len". This difference has nothing to do with "len". It's rater a historical accident that can be confusing. The functions without "_len" take a "mnt_idmap" pointer. This is found in the "vfsmount" and that is an important question when choosing which to use: do you have a vfsmount, or are you "inside" the filesystem. A related question is "is permission checking relevant here?". nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len functions. They pass nop_mnt_idmap and refuse to work on filesystems which have any other idmap. This work changes nfsd and cachefile to use the lookup_one family of functions and to explictily pass &nop_mnt_idmap which is consistent with all other vfs interfaces used where &nop_mnt_idmap is explicitly passed. The remaining uses of the "_one" functions do not require permission checks so these are renamed to be "_noperm" and the permission checking is removed. This series also changes these lookup function to take a qstr instead of separate name and len. In many cases this simplifies the call" * tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: VFS: change lookup_one_common and lookup_noperm_common to take a qstr Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS VFS: rename lookup_one_len family to lookup_noperm and remove permission check cachefiles: Use lookup_one() rather than lookup_one_len() nfsd: Use lookup_one() rather than lookup_one_len() VFS: improve interface for lookup_one functions
2025-05-25RDMA/bnxt_re: Support extended stats for Thor2 VFAjit Khaparde
The driver currently checks if the user is querying VF RoCE statistics. It will not send the query_roce_stats_ext HWRM command if it is for a VF. But Thor2 VF can support extended statistics. Allow query of extended stats for Thor2 VFs. Signed-off-by: Ajit Khaparde <ajit.khaparde@broadcom.com> Signed-off-by: Shravya KN <shravya.k-n@broadcom.com> Link: https://patch.msgid.link/20250523075952.1267827-1-kalesh-anakkur.purayil@broadcom.com Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-25RDMA/hns: Fix endian issue in trace eventsJunxian Huang
Avoid using __le32 directly in trace events to fix sparse complains: drivers/infiniband/hw/hns/./hns_roce_trace.h:48:1: sparse: sparse: cast to restricted __le32 drivers/infiniband/hw/hns/./hns_roce_trace.h:48:1: sparse: sparse: restricted __le32 degrades to integer drivers/infiniband/hw/hns/./hns_roce_trace.h:48:1: sparse: sparse: restricted __le32 degrades to integer drivers/infiniband/hw/hns/./hns_roce_trace.h:90:1: sparse: sparse: cast to restricted __le32 drivers/infiniband/hw/hns/./hns_roce_trace.h:90:1: sparse: sparse: restricted __le32 degrades to integer drivers/infiniband/hw/hns/./hns_roce_trace.h:90:1: sparse: sparse: restricted __le32 degrades to integer drivers/infiniband/hw/hns/./hns_roce_trace.h:173:1: sparse: sparse: cast to restricted __le32 drivers/infiniband/hw/hns/./hns_roce_trace.h:173:1: sparse: sparse: restricted __le32 degrades to integer drivers/infiniband/hw/hns/./hns_roce_trace.h:173:1: sparse: sparse: restricted __le32 degrades to integer Fixes: 6c98c8670806 ("RDMA/hns: Add trace for WQE dumping") Fixes: 1e63e2f96613 ("RDMA/hns: Add trace for AEQE dumping") Fixes: 6bd18dabf1c9 ("RDMA/hns: Add trace for CMDQ dumping") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505170327.TNOpreil-lkp@intel.com/ Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250523023433.2171003-1-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-25RDMA/mlx5: Avoid flexible array warningLeon Romanovsky
The following warning is reported by sparse tool: drivers/infiniband/hw/mlx5/fs.c:1664:26: warning: array of flexible structures Avoid it by simply splitting array into two separate structs. Link: https://patch.msgid.link/7b891b96a9fc053d01284c184d25ae98d35db2d4.1747827041.git.leon@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-25IB/cm: Remove dead code and adjust namingVlad Dumitrescu
Drop ib_send_cm_mra parameters which are always constant. Remove branch which is never taken. Adjust name to ib_prepare_cm_mra, which better reflects its functionality - no MRA is actually sent. Adjust name of related tracepoints. Push setting of the constant service timeout to cm.c and drop IB_CM_MRA_FLAG_DELAY. Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Reviewed-by: Sean Hefty <shefty@nvidia.com> Link: https://patch.msgid.link/cdd2a237acf2b495c19ce02e4b1c42c41c6751c2.1747827207.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-25RDMA/core: Avoid hmm_dma_map_alloc() for virtual DMA devicesDaisuke Matsuda
Drivers such as rxe, which use virtual DMA, must not call into the DMA mapping core since they lack physical DMA capabilities. Otherwise, a NULL pointer dereference is observed as shown below. This patch ensures the RDMA core handles virtual and physical DMA paths appropriately. This fixes the following kernel oops: BUG: kernel NULL pointer dereference, address: 00000000000002fc #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1028eb067 P4D 1028eb067 PUD 105da0067 PMD 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 3 UID: 1000 PID: 1854 Comm: python3 Tainted: G W 6.15.0-rc1+ #11 PREEMPT(voluntary) Tainted: [W]=WARN Hardware name: Trigkey Key N/Key N, BIOS KEYN101 09/02/2024 RIP: 0010:hmm_dma_map_alloc+0x25/0x100 Code: 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 d6 49 c1 e6 0c 41 55 41 54 53 49 39 ce 0f 82 c6 00 00 00 49 89 fc <f6> 87 fc 02 00 00 20 0f 84 af 00 00 00 49 89 f5 48 89 d3 49 89 cf RSP: 0018:ffffd3d3420eb830 EFLAGS: 00010246 RAX: 0000000000001000 RBX: ffff8b727c7f7400 RCX: 0000000000001000 RDX: 0000000000000001 RSI: ffff8b727c7f74b0 RDI: 0000000000000000 RBP: ffffd3d3420eb858 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 00007262a622a000 R14: 0000000000001000 R15: ffff8b727c7f74b0 FS: 00007262a62a1080(0000) GS:ffff8b762ac3e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000002fc CR3: 000000010a1f0004 CR4: 0000000000f72ef0 PKRU: 55555554 Call Trace: <TASK> ib_init_umem_odp+0xb6/0x110 [ib_uverbs] ib_umem_odp_get+0xf0/0x150 [ib_uverbs] rxe_odp_mr_init_user+0x71/0x170 [rdma_rxe] rxe_reg_user_mr+0x217/0x2e0 [rdma_rxe] ib_uverbs_reg_mr+0x19e/0x2e0 [ib_uverbs] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xd9/0x150 [ib_uverbs] ib_uverbs_cmd_verbs+0xd19/0xee0 [ib_uverbs] ? mmap_region+0x63/0xd0 ? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs] ib_uverbs_ioctl+0xba/0x130 [ib_uverbs] __x64_sys_ioctl+0xa4/0xe0 x64_sys_call+0x1178/0x2660 do_syscall_64+0x7e/0x170 ? syscall_exit_to_user_mode+0x4e/0x250 ? do_syscall_64+0x8a/0x170 ? do_syscall_64+0x8a/0x170 ? syscall_exit_to_user_mode+0x4e/0x250 ? do_syscall_64+0x8a/0x170 ? syscall_exit_to_user_mode+0x4e/0x250 ? do_syscall_64+0x8a/0x170 ? do_user_addr_fault+0x1d2/0x8d0 ? irqentry_exit_to_user_mode+0x43/0x250 ? irqentry_exit+0x43/0x50 ? exc_page_fault+0x93/0x1d0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7262a6124ded Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 RSP: 002b:00007fffd08c3960 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fffd08c39f0 RCX: 00007262a6124ded RDX: 00007fffd08c3a10 RSI: 00000000c0181b01 RDI: 0000000000000007 RBP: 00007fffd08c39b0 R08: 0000000014107820 R09: 00007fffd08c3b44 R10: 000000000000000c R11: 0000000000000246 R12: 00007fffd08c3b44 R13: 000000000000000c R14: 00007fffd08c3b58 R15: 0000000014107960 </TASK> Fixes: 1efe8c0670d6 ("RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkage") Closes: https://lore.kernel.org/all/3e8f343f-7d66-4f7a-9f08-3910623e322f@gmail.com/ Signed-off-by: Daisuke Matsuda <dskmtsd@gmail.com> Link: https://patch.msgid.link/20250524144328.4361-1-dskmtsd@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc8). Conflicts: 80f2ab46c2ee ("irdma: free iwdev->rf after removing MSI-X") 4bcc063939a5 ("ice, irdma: fix an off by one in error handling code") c24a65b6a27c ("iidc/ice/irdma: Update IDC to support multiple consumers") https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au No extra adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>