summaryrefslogtreecommitdiff
path: root/drivers/infiniband/hw/mlx5
AgeCommit message (Collapse)Author
2025-05-26Merge tag 'v6.15' into rdma.git for-nextJason Gunthorpe
Following patches need the RDMA rc branch since we are past the RC cycle now. Merge conflicts resolved based on Linux-next: - For RXE odp changes keep for-next version and fixup new places that need to call is_odp_mr() https://lore.kernel.org/r/20250422143019.500201bd@canb.auug.org.au https://lore.kernel.org/r/20250514122455.3593b083@canb.auug.org.au - irdma is keeping the while/kfree bugfix from -rc and the pf/cdev_info change from for-next https://lore.kernel.org/r/20250513130630.280ee6c5@canb.auug.org.au Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-05-25RDMA/mlx5: Avoid flexible array warningLeon Romanovsky
The following warning is reported by sparse tool: drivers/infiniband/hw/mlx5/fs.c:1664:26: warning: array of flexible structures Avoid it by simply splitting array into two separate structs. Link: https://patch.msgid.link/7b891b96a9fc053d01284c184d25ae98d35db2d4.1747827041.git.leon@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-18RDMA/mlx5: Add support for 200Gbps per lane speedsPatrisious Haddad
Add support for 200Gbps per lane speeds speed when querying PTYS and report it back correctly when needed. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://patch.msgid.link/b842d2f523e9b82e221378c444ebd5860d612959.1747134197.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-18RDMA/mlx5: Remove the redundant MLX5_IB_STAGE_UAR stageYishai Hadas
The MLX5_IB_STAGE_UAR stage in the RDMA driver is redundant and should be removed. Responsibility for initializing the device's UAR pointer (mdev->priv.uar) lies with mlx5_core, which already sets it during the mlx5_load() process. At present, the RDMA UAR stage overwrites this pointer, which was correctly initialized by mlx5_core, creating the risk of inconsistency. Ownership and management of the UAR pointer should remain exclusively within mlx5_core. In the current upstream code, we luckily receive the same pointer, since mlx5_get_uars_page() still finds available BF registers for that UAR, allowing it to be shared. However, future changes in mlx5_core may expose this flaw. For instance, if mlx5_alloc_bfreg() is invoked twice before the RDMA UAR stage runs, the RDMA driver may overwrite the UAR allocated by mlx5_core. This could lead to real bugs. For example, if mlx5_ib is unloaded (rmmod), it might free the UAR, leaving mlx5_core with a dangling reference to an invalid UAR. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Fan Li <fanl@nvidia.com> Link: https://patch.msgid.link/feaa84ec6f20468b4935c439923e9266122a93d0.1747134130.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-05-12RDMA/core: Convert UMEM ODP DMA mapping to caching IOVA and page linkageLeon Romanovsky
Reuse newly added DMA API to cache IOVA and only link/unlink pages in fast path for UMEM ODP flow. Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-12RDMA/umem: Store ODP access mask information in PFNLeon Romanovsky
As a preparation to remove dma_list, store access mask in PFN pointer and not in dma_addr_t. Tested-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-05-05RDMA/mlx5: Fix error flow upon firmware failure for RQ destructionPatrisious Haddad
Upon RQ destruction if the firmware command fails which is the last resource to be destroyed some SW resources were already cleaned regardless of the failure. Now properly rollback the object to its original state upon such failure. In order to avoid a use-after free in case someone tries to destroy the object again, which results in the following kernel trace: refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 37589 at lib/refcount.c:28 refcount_warn_saturate+0xf4/0x148 Modules linked in: rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) rfkill mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) psample mlxfw(OE) mlx_compat(OE) macsec tls pci_hyperv_intf sunrpc vfat fat virtio_net net_failover failover fuse loop nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci vsock xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_console virtio_gpu virtio_blk virtio_dma_buf virtio_mmio dm_mirror dm_region_hash dm_log dm_mod xpmem(OE) CPU: 0 UID: 0 PID: 37589 Comm: python3 Kdump: loaded Tainted: G OE ------- --- 6.12.0-54.el10.aarch64 #1 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : refcount_warn_saturate+0xf4/0x148 lr : refcount_warn_saturate+0xf4/0x148 sp : ffff80008b81b7e0 x29: ffff80008b81b7e0 x28: ffff000133d51600 x27: 0000000000000001 x26: 0000000000000000 x25: 00000000ffffffea x24: ffff00010ae80f00 x23: ffff00010ae80f80 x22: ffff0000c66e5d08 x21: 0000000000000000 x20: ffff0000c66e0000 x19: ffff00010ae80340 x18: 0000000000000006 x17: 0000000000000000 x16: 0000000000000020 x15: ffff80008b81b37f x14: 0000000000000000 x13: 2e656572662d7265 x12: ffff80008283ef78 x11: ffff80008257efd0 x10: ffff80008283efd0 x9 : ffff80008021ed90 x8 : 0000000000000001 x7 : 00000000000bffe8 x6 : c0000000ffff7fff x5 : ffff0001fb8e3408 x4 : 0000000000000000 x3 : ffff800179993000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000133d51600 Call trace: refcount_warn_saturate+0xf4/0x148 mlx5_core_put_rsc+0x88/0xa0 [mlx5_ib] mlx5_core_destroy_rq_tracked+0x64/0x98 [mlx5_ib] mlx5_ib_destroy_wq+0x34/0x80 [mlx5_ib] ib_destroy_wq_user+0x30/0xc0 [ib_core] uverbs_free_wq+0x28/0x58 [ib_uverbs] destroy_hw_idr_uobject+0x34/0x78 [ib_uverbs] uverbs_destroy_uobject+0x48/0x240 [ib_uverbs] __uverbs_cleanup_ufile+0xd4/0x1a8 [ib_uverbs] uverbs_destroy_ufile_hw+0x48/0x120 [ib_uverbs] ib_uverbs_close+0x2c/0x100 [ib_uverbs] __fput+0xd8/0x2f0 __fput_sync+0x50/0x70 __arm64_sys_close+0x40/0x90 invoke_syscall.constprop.0+0x74/0xd0 do_el0_svc+0x48/0xe8 el0_svc+0x44/0x1d0 el0t_64_sync_handler+0x120/0x130 el0t_64_sync+0x1a4/0x1a8 Fixes: e2013b212f9f ("net/mlx5_core: Add RQ and SQ event handling") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://patch.msgid.link/3181433ccdd695c63560eeeb3f0c990961732101.1745839855.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-04-07RDMA/mlx5: Fix compilation warning when USER_ACCESS isn't setMark Bloch
The cited commit made fs.c always compile, even when INFINIBAND_USER_ACCESS isn't set. This results in a compilation warning about an unused object when compiling with W=1 and USER_ACCESS is unset. Fix this by defining uverbs_destroy_def_handler() even when USER_ACCESS isn't set. Fixes: 36e0d433672f ("RDMA/mlx5: Compile fs.c regardless of INFINIBAND_USER_ACCESS config") Link: https://patch.msgid.link/r/20250402070944.1022093-1-mbloch@nvidia.com Signed-off-by: Mark Bloch <mbloch@nvidia.com> Tested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-04-07RDMA/mlx5: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies(E * 1000) +secs_to_jiffies(E) -msecs_to_jiffies(E * MSEC_PER_SEC) +secs_to_jiffies(E) Link: https://patch.msgid.link/r/20250219-rdma-secs-to-jiffies-v1-2-b506746561a9@linux.microsoft.com Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-04-05treewide: Switch/rename to timer_delete[_sync]()Thomas Gleixner
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree over and remove the historical wrapper inlines. Conversion was done with coccinelle plus manual fixups where necessary. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-29Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: - Usual minor updates and fixes for bnxt_re, hfi1, rxe, mana, iser, mlx5, vmw_pvrdma, hns - Make rxe work on tun devices - mana gains more standard verbs as it moves toward supporting in-kernel verbs - DMABUF support for mana - Fix page size calculations when memory registration exceeds 4G - On Demand Paging support for rxe - mlx5 support for RDMA TRANSPORT flow tables and a new ucap mechanism to access control use of them - Optional RDMA_TX/RX counters per QP in mlx5 * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (73 commits) IB/mad: Check available slots before posting receive WRs RDMA/mana_ib: Fix integer overflow during queue creation RDMA/mlx5: Fix calculation of total invalidated pages RDMA/mlx5: Fix mlx5_poll_one() cur_qp update flow RDMA/mlx5: Fix page_size variable overflow RDMA/mlx5: Drop access_flags from _mlx5_mr_cache_alloc() RDMA/mlx5: Fix cache entry update on dereg error RDMA/mlx5: Fix MR cache initialization error flow RDMA/mlx5: Support optional-counters binding for QPs RDMA/mlx5: Compile fs.c regardless of INFINIBAND_USER_ACCESS config RDMA/core: Pass port to counter bind/unbind operations RDMA/core: Add support to optional-counters binding configuration RDMA/core: Create and destroy rdma_counter using rdma_zalloc_drv_obj() RDMA/mlx5: Add optional counters for RDMA_TX/RX_packets/bytes RDMA/core: Fix use-after-free when rename device name RDMA/bnxt_re: Support perf management counters RDMA/rxe: Fix incorrect return value of rxe_odp_atomic_op() RDMA/uverbs: Propagate errors from rdma_lookup_get_uobject() RDMA/mana_ib: Handle net event for pointing to the current netdev net: mana: Change the function signature of mana_get_primary_netdev_rcu ...
2025-03-18RDMA/mlx5: Fix calculation of total invalidated pagesChiara Meiohas
When invalidating an address range in mlx5, there is an optimization to do UMR operations in chunks. Previously, the invalidation counter was incorrectly updated for the same indexes within a chunk. Now, the invalidation counter is updated only when a chunk is complete and mlx5r_umr_update_xlt() is called. This ensures that the counter accurately represents the number of pages invalidated using UMR. Fixes: a3de94e3d61e ("IB/mlx5: Introduce ODP diagnostic counters") Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/560deb2433318e5947282b070c915f3c81fef77f.1741875692.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-18RDMA/mlx5: Fix mlx5_poll_one() cur_qp update flowPatrisious Haddad
When cur_qp isn't NULL, in order to avoid fetching the QP from the radix tree again we check if the next cqe QP is identical to the one we already have. The bug however is that we are checking if the QP is identical by checking the QP number inside the CQE against the QP number inside the mlx5_ib_qp, but that's wrong since the QP number from the CQE is from FW so it should be matched against mlx5_core_qp which is our FW QP number. Otherwise we could use the wrong QP when handling a CQE which could cause the kernel trace below. This issue is mainly noticeable over QPs 0 & 1, since for now they are the only QPs in our driver whereas the QP number inside mlx5_ib_qp doesn't match the QP number inside mlx5_core_qp. BUG: kernel NULL pointer dereference, address: 0000000000000012 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP CPU: 0 UID: 0 PID: 7927 Comm: kworker/u62:1 Not tainted 6.14.0-rc3+ #189 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core] RIP: 0010:mlx5_ib_poll_cq+0x4c7/0xd90 [mlx5_ib] Code: 03 00 00 8d 58 ff 21 cb 66 39 d3 74 39 48 c7 c7 3c 89 6e a0 0f b7 db e8 b7 d2 b3 e0 49 8b 86 60 03 00 00 48 c7 c7 4a 89 6e a0 <0f> b7 5c 98 02 e8 9f d2 b3 e0 41 0f b7 86 78 03 00 00 83 e8 01 21 RSP: 0018:ffff88810511bd60 EFLAGS: 00010046 RAX: 0000000000000010 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88885fa1b3c0 RDI: ffffffffa06e894a RBP: 00000000000000b0 R08: 0000000000000000 R09: ffff88810511bc10 R10: 0000000000000001 R11: 0000000000000001 R12: ffff88810d593000 R13: ffff88810e579108 R14: ffff888105146000 R15: 00000000000000b0 FS: 0000000000000000(0000) GS:ffff88885fa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000012 CR3: 00000001077e6001 CR4: 0000000000370eb0 Call Trace: <TASK> ? __die+0x20/0x60 ? page_fault_oops+0x150/0x3e0 ? exc_page_fault+0x74/0x130 ? asm_exc_page_fault+0x22/0x30 ? mlx5_ib_poll_cq+0x4c7/0xd90 [mlx5_ib] __ib_process_cq+0x5a/0x150 [ib_core] ib_cq_poll_work+0x31/0x90 [ib_core] process_one_work+0x169/0x320 worker_thread+0x288/0x3a0 ? work_busy+0xb0/0xb0 kthread+0xd7/0x1f0 ? kthreads_online_cpu+0x130/0x130 ? kthreads_online_cpu+0x130/0x130 ret_from_fork+0x2d/0x50 ? kthreads_online_cpu+0x130/0x130 ret_from_fork_asm+0x11/0x20 </TASK> Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/4ada09d41f1e36db62c44a9b25c209ea5f054316.1741875692.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-18RDMA/mlx5: Fix page_size variable overflowMichael Guralnik
Change all variables storing mlx5_umem_mkc_find_best_pgsz() result to unsigned long to support values larger than 31 and avoid overflow. For example: If we try to register 4GB of memory that is contiguous in physical memory, the driver will optimize the page_size and try to use an mkey with 4GB entity size. The 'unsigned int' page_size variable will overflow to '0' and we'll hit the WARN_ON() in alloc_cacheable_mr(). WARNING: CPU: 2 PID: 1203 at drivers/infiniband/hw/mlx5/mr.c:1124 alloc_cacheable_mr+0x22/0x580 [mlx5_ib] Modules linked in: mlx5_ib mlx5_core bonding ip6_gre ip6_tunnel tunnel6 ip_gre gre rdma_rxe rdma_ucm ib_uverbs ib_ipoib ib_umad rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm fuse ib_core [last unloaded: mlx5_core] CPU: 2 UID: 70878 PID: 1203 Comm: rdma_resource_l Tainted: G W 6.14.0-rc4-dirty #43 Tainted: [W]=WARN Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:alloc_cacheable_mr+0x22/0x580 [mlx5_ib] Code: 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 41 52 53 48 83 ec 30 f6 46 28 04 4c 8b 77 08 75 21 <0f> 0b 49 c7 c2 ea ff ff ff 48 8d 65 d0 4c 89 d0 5b 41 5a 41 5c 41 RSP: 0018:ffffc900006ffac8 EFLAGS: 00010246 RAX: 0000000004c0d0d0 RBX: ffff888217a22000 RCX: 0000000000100001 RDX: 00007fb7ac480000 RSI: ffff8882037b1240 RDI: ffff8882046f0600 RBP: ffffc900006ffb28 R08: 0000000000000001 R09: 0000000000000000 R10: 00000000000007e0 R11: ffffea0008011d40 R12: ffff8882037b1240 R13: ffff8882046f0600 R14: ffff888217a22000 R15: ffffc900006ffe00 FS: 00007fb7ed013340(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb7ed1d8000 CR3: 00000001fd8f6006 CR4: 0000000000772eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __warn+0x81/0x130 ? alloc_cacheable_mr+0x22/0x580 [mlx5_ib] ? report_bug+0xfc/0x1e0 ? handle_bug+0x55/0x90 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? alloc_cacheable_mr+0x22/0x580 [mlx5_ib] create_real_mr+0x54/0x150 [mlx5_ib] ib_uverbs_reg_mr+0x17f/0x2a0 [ib_uverbs] ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0xca/0x140 [ib_uverbs] ib_uverbs_run_method+0x6d0/0x780 [ib_uverbs] ? __pfx_ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x10/0x10 [ib_uverbs] ib_uverbs_cmd_verbs+0x19b/0x360 [ib_uverbs] ? walk_system_ram_range+0x79/0xd0 ? ___pte_offset_map+0x1b/0x110 ? __pte_offset_map_lock+0x80/0x100 ib_uverbs_ioctl+0xac/0x110 [ib_uverbs] __x64_sys_ioctl+0x94/0xb0 do_syscall_64+0x50/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fb7ecf0737b Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7d 2a 0f 00 f7 d8 64 89 01 48 RSP: 002b:00007ffdbe03ecc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffdbe03edb8 RCX: 00007fb7ecf0737b RDX: 00007ffdbe03eda0 RSI: 00000000c0181b01 RDI: 0000000000000003 RBP: 00007ffdbe03ed80 R08: 00007fb7ecc84010 R09: 00007ffdbe03eed4 R10: 0000000000000009 R11: 0000000000000246 R12: 00007ffdbe03eed4 R13: 000000000000000c R14: 000000000000000c R15: 00007fb7ecc84150 </TASK> Fixes: cef7dde8836a ("net/mlx5: Expand mkey page size to support 6 bits") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/2479a4a3f6fd9bd032e1b6d396274a89c4c5e22f.1741875692.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-18RDMA/mlx5: Drop access_flags from _mlx5_mr_cache_alloc()Michael Guralnik
Drop the unused access_flags parameter. Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/4d769c51eb012c62b3a92fd916b7886c25b56fbf.1741875692.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-18RDMA/mlx5: Fix cache entry update on dereg errorMichael Guralnik
Fix double decrement of 'in_use' counter on push_mkey_locked() failure while deregistering an MR. If we fail to return an mkey to the cache in cache_ent_find_and_store() it'll update the 'in_use' counter. Its caller, revoke_mr(), also updates it, thus having double decrement. Wrong value of 'in_use' counter will be exposed through debugfs and can also cause wrong resizing of the cache when users try to set cache entry size using the 'size' debugfs. To address this issue, the 'in_use' counter is now decremented within mlx5_revoke_mr() also after a successful call to cache_ent_find_and_store() and not within cache_ent_find_and_store(). Other success or failure flows remains unchanged where it was also decremented. Fixes: 8c1185fef68c ("RDMA/mlx5: Change check for cacheable mkeys") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/97e979dff636f232ff4c83ce709c17c727da1fdb.1741875692.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-18RDMA/mlx5: Fix MR cache initialization error flowMichael Guralnik
Destroy all previously created cache entries and work queue when rolling back the MR cache initialization upon an error. Fixes: 73d09b2fe833 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/c41d525fb3c72e28dd38511bf3aaccb5d584063e.1741875692.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-18RDMA/mlx5: Support optional-counters binding for QPsPatrisious Haddad
Add support to allow optional-counters binding to a QP, whereas when a bind operation is requested depending on the counter optional-counter binding state the driver will determine if to also add optional-counters to this QP binding. The optional-counter binding is done by simply adding a steering rule for the specific optional-counter condition with the additional match over that QP number. Note that optional-counters per QP rules are handled on an earlier prio than per device counters, and per device counter correctness is maintained by core whereas it is responsible to sum active counters when checking device counter and to add them to history count when they are deallocated. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/2cad1b891a6641ae61fe8d92f867e1059121813a.1741875070.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-18RDMA/mlx5: Compile fs.c regardless of INFINIBAND_USER_ACCESS configPatrisious Haddad
Change mlx5 Makefile, fs.c and fs.h to support fs compilation regardless of INFINIBAND_USER_ACCESS config. In addition allow optional counters support regardless of the config. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://patch.msgid.link/b8dd220456a91538b22c3aff150ab021d7b9e1bf.1741875070.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-18RDMA/core: Pass port to counter bind/unbind operationsPatrisious Haddad
This will be useful for the next patches in the series since port number is needed for optional counters binding and unbinding. Note that this change is needed since when the operation is done qp->port isn't necessarily initialized yet and can't be used. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/b6f6797844acbd517358e8d2a270ea9b3e6ecba1.1741875070.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-18RDMA/core: Create and destroy rdma_counter using rdma_zalloc_drv_obj()Patrisious Haddad
Change rdma_counter allocation to use rdma_zalloc_drv_obj() instead of, explicitly allocating at core, in order to be contained inside driver specific structures. Adjust all drivers that use it to have their containing structure, and add driver specific initialization operation. This change is needed to allow upcoming patches to implement optional-counters binding whereas inside each driver specific counter struct his bound optional-counters will be maintained. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/a5a484f421fc2e5595158e61a354fba43272b02d.1741875070.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-18RDMA/mlx5: Add optional counters for RDMA_TX/RX_packets/bytesPatrisious Haddad
Add the following optional counters: rdma_tx_packets,rdma_rx_bytes,rdma_rx_packets,rdma_tx_bytes. Which counts all RDMA packets/bytes sent and received per link. Note that since each direction packet and byte counter are shared, the counter is only reset when both counters of that direction are removed. But from user-perspective each can be enabled/disabled separately. The counters can be enabled using: sudo rdma stat set link rocep8s0f0/1 optional-counters rdma_tx_packets And can be seen using: rdma stat -j show link rocep8s0f0/1 Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/9f2753ad636f21704416df64b47395c8991d1123.1741875070.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-09RDMA/mlx5: Expose RDMA TRANSPORT flow table types to userspacePatrisious Haddad
This patch adds RDMA_TRANSPORT_RX and RDMA_TRANSPORT_TX as a new flow table type for matcher creation. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Link: https://patch.msgid.link/2287d8c50483e880450c7e8e08d9de34cdec1b14.1741261611.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-09RDMA/mlx5: Check enabled UCAPs when creating ucontextChiara Meiohas
Verify that the enabled UCAPs are supported by the device before creating the ucontext. If supported, create the ucontext with the associated capabilities. Store the privileged ucontext UID on creation and remove it when destroying the privileged ucontext. This allows the command interface to recognize privileged commands through its UID. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/8b180583a207cb30deb7a2967934079749cdcc44.1741261611.git.leon@kernel.org Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-09RDMA/mlx5: Create UCAP char devices for supported device capabilitiesChiara Meiohas
Create UCAP character devices when probing an IB device with supported firmware capabilities. If the RDMA_CTRL general object type is supported, check for specific UCTX capabilities: Create /dev/infiniband/mlx5_perm_ctrl_local for RDMA_UCAP_MLX5_CTRL_LOCAL Create /dev/infiniband/mlx5_perm_ctrl_other_vhca for RDMA_UCAP_MLX5_CTRL_OTHER_VHCA Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/30ed40e7a12a694cf4ee257459ed61b145b7837d.1741261611.git.leon@kernel.org Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-06RDMA/mlx5: Handle errors returned from mlx5r_ib_rate()Qasim Ijaz
In function create_ib_ah() the following line attempts to left shift the return value of mlx5r_ib_rate() by 4 and store it in the stat_rate_sl member of av: However the code overlooks the fact that mlx5r_ib_rate() may return -EINVAL if the rate passed to it is less than IB_RATE_2_5_GBPS or greater than IB_RATE_800_GBPS. Because of this, the code may invoke undefined behaviour when shifting a signed negative value when doing "-EINVAL << 4". To fix this check for errors before assigning stat_rate_sl and propagate any error value to the callers. Fixes: c534ffda781f ("RDMA/mlx5: Fix AH static rate parsing") Signed-off-by: Qasim Ijaz <qasdev00@gmail.com> Link: https://patch.msgid.link/20250304140246.205919-1-qasdev00@gmail.com Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-03RDMA/mlx5: Reorder capability check lastChristian Göttsche
capable() calls refer to enabled LSMs whether to permit or deny the request. This is relevant in connection with SELinux, where a capability check results in a policy decision and by default a denial message on insufficient permission is issued. It can lead to three undesired cases: 1. A denial message is generated, even in case the operation was an unprivileged one and thus the syscall succeeded, creating noise. 2. To avoid the noise from 1. the policy writer adds a rule to ignore those denial messages, hiding future syscalls, where the task performs an actual privileged operation, leading to hidden limited functionality of that task. 3. To avoid the noise from 1. the policy writer adds a rule to permit the task the requested capability, while it does not need it, violating the principle of least privilege. Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Link: https://patch.msgid.link/20250302160657.127253-10-cgoettsche@seltendoof.de Reviewed-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-23RDMA/mlx5: Fix bind QP error cleanup flowPatrisious Haddad
When there is a failure during bind QP, the cleanup flow destroys the counter regardless if it is the one that created it or not, which is problematic since if it isn't the one that created it, that counter could still be in use. Fix that by destroying the counter only if it was created during this call. Fixes: 45842fc627c7 ("IB/mlx5: Support statistic q counter configuration") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Zhang <markzhang@nvidia.com> Link: https://patch.msgid.link/25dfefddb0ebefa668c32e06a94d84e3216257cf.1740033937.git.leon@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-20RDMA/mlx5: Fix AH static rate parsingPatrisious Haddad
Previously static rate wasn't translated according to our PRM but simply used the 4 lower bytes. Correctly translate static rate value passed in AH creation attribute according to our PRM expected values. In addition change 800GB mapping to zero, which is the PRM specified value. Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://patch.msgid.link/18ef4cc5396caf80728341eb74738cd777596f60.1739187089.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-20RDMA/mlx5: Fix implicit ODP hang on parent deregistrationYishai Hadas
Fix the destroy_unused_implicit_child_mr() to prevent hanging during parent deregistration as of below [1]. Upon entering destroy_unused_implicit_child_mr(), the reference count for the implicit MR parent is incremented using: refcount_inc_not_zero(). A corresponding decrement must be performed if free_implicit_child_mr_work() is not called. The code has been updated to properly manage the reference count that was incremented. [1] INFO: task python3:2157 blocked for more than 120 seconds. Not tainted 6.12.0-rc7+ #1633 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:python3 state:D stack:0 pid:2157 tgid:2157 ppid:1685 flags:0x00000000 Call Trace: <TASK> __schedule+0x420/0xd30 schedule+0x47/0x130 __mlx5_ib_dereg_mr+0x379/0x5d0 [mlx5_ib] ? __pfx_autoremove_wake_function+0x10/0x10 ib_dereg_mr_user+0x5f/0x120 [ib_core] ? lock_release+0xc6/0x280 destroy_hw_idr_uobject+0x1d/0x60 [ib_uverbs] uverbs_destroy_uobject+0x58/0x1d0 [ib_uverbs] uobj_destroy+0x3f/0x70 [ib_uverbs] ib_uverbs_cmd_verbs+0x3e4/0xbb0 [ib_uverbs] ? __pfx_uverbs_destroy_def_handler+0x10/0x10 [ib_uverbs] ? lock_acquire+0xc1/0x2f0 ? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs] ? ib_uverbs_ioctl+0x116/0x170 [ib_uverbs] ? lock_release+0xc6/0x280 ib_uverbs_ioctl+0xe7/0x170 [ib_uverbs] ? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs] __x64_sys_ioctl+0x1b0/0xa70 ? kmem_cache_free+0x221/0x400 do_syscall_64+0x6b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f20f21f017b RSP: 002b:00007ffcfc4a77c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffcfc4a78d8 RCX: 00007f20f21f017b RDX: 00007ffcfc4a78c0 RSI: 00000000c0181b01 RDI: 0000000000000003 RBP: 00007ffcfc4a78a0 R08: 000056147d125190 R09: 00007f20f1f14c60 R10: 0000000000000001 R11: 0000000000000246 R12: 00007ffcfc4a7890 R13: 000000000000001c R14: 000056147d100fc0 R15: 00007f20e365c9d0 </TASK> Fixes: d3d930411ce3 ("RDMA/mlx5: Fix implicit ODP use after free") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Artemy Kovalyov <artemyko@nvidia.com> Link: https://patch.msgid.link/80f2fcd19952dfa7d9981d93fd6359b4471f8278.1739186929.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-06RDMA/mlx5: Fix a WARN during dereg_mr for DM typeYishai Hadas
Memory regions (MR) of type DM (device memory) do not have an associated umem. In the __mlx5_ib_dereg_mr() -> mlx5_free_priv_descs() flow, the code incorrectly takes the wrong branch, attempting to call dma_unmap_single() on a DMA address that is not mapped. This results in a WARN [1], as shown below. The issue is resolved by properly accounting for the DM type and ensuring the correct branch is selected in mlx5_free_priv_descs(). [1] WARNING: CPU: 12 PID: 1346 at drivers/iommu/dma-iommu.c:1230 iommu_dma_unmap_page+0x79/0x90 Modules linked in: ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry ovelay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core CPU: 12 UID: 0 PID: 1346 Comm: ibv_rc_pingpong Not tainted 6.12.0-rc7+ #1631 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:iommu_dma_unmap_page+0x79/0x90 Code: 2b 49 3b 29 72 26 49 3b 69 08 73 20 4d 89 f0 44 89 e9 4c 89 e2 48 89 ee 48 89 df 5b 5d 41 5c 41 5d 41 5e 41 5f e9 07 b8 88 ff <0f> 0b 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 66 0f 1f 44 00 RSP: 0018:ffffc90001913a10 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88810194b0a8 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001 RBP: ffff88810194b0a8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f537abdd740(0000) GS:ffff88885fb00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f537aeb8000 CR3: 000000010c248001 CR4: 0000000000372eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __warn+0x84/0x190 ? iommu_dma_unmap_page+0x79/0x90 ? report_bug+0xf8/0x1c0 ? handle_bug+0x55/0x90 ? exc_invalid_op+0x13/0x60 ? asm_exc_invalid_op+0x16/0x20 ? iommu_dma_unmap_page+0x79/0x90 dma_unmap_page_attrs+0xe6/0x290 mlx5_free_priv_descs+0xb0/0xe0 [mlx5_ib] __mlx5_ib_dereg_mr+0x37e/0x520 [mlx5_ib] ? _raw_spin_unlock_irq+0x24/0x40 ? wait_for_completion+0xfe/0x130 ? rdma_restrack_put+0x63/0xe0 [ib_core] ib_dereg_mr_user+0x5f/0x120 [ib_core] ? lock_release+0xc6/0x280 destroy_hw_idr_uobject+0x1d/0x60 [ib_uverbs] uverbs_destroy_uobject+0x58/0x1d0 [ib_uverbs] uobj_destroy+0x3f/0x70 [ib_uverbs] ib_uverbs_cmd_verbs+0x3e4/0xbb0 [ib_uverbs] ? __pfx_uverbs_destroy_def_handler+0x10/0x10 [ib_uverbs] ? lock_acquire+0xc1/0x2f0 ? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs] ? ib_uverbs_ioctl+0x116/0x170 [ib_uverbs] ? lock_release+0xc6/0x280 ib_uverbs_ioctl+0xe7/0x170 [ib_uverbs] ? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs] __x64_sys_ioctl+0x1b0/0xa70 do_syscall_64+0x6b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f537adaf17b Code: 0f 1e fa 48 8b 05 1d ad 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ed ac 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffff218f0b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffff218f1d8 RCX: 00007f537adaf17b RDX: 00007ffff218f1c0 RSI: 00000000c0181b01 RDI: 0000000000000003 RBP: 00007ffff218f1a0 R08: 00007f537aa8d010 R09: 0000561ee2e4f270 R10: 00007f537aace3a8 R11: 0000000000000246 R12: 00007ffff218f190 R13: 000000000000001c R14: 0000561ee2e4d7c0 R15: 00007ffff218f450 </TASK> Fixes: f18ec4223117 ("RDMA/mlx5: Use a union inside mlx5_ib_mr") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/2039c22cfc3df02378747ba4d623a558b53fc263.1738587076.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-06RDMA/mlx5: Fix a race for DMABUF MR which can lead to CQE with errorYishai Hadas
This patch addresses a potential race condition for a DMABUF MR that can result in a CQE with an error on the UMR QP. During the __mlx5_ib_dereg_mr() flow, the following sequence of calls occurs: mlx5_revoke_mr() mlx5r_umr_revoke_mr() mlx5r_umr_post_send_wait() At this point, the lkey is freed from the hardware's perspective. However, concurrently, mlx5_ib_dmabuf_invalidate_cb() might be triggered by another task attempting to invalidate the MR having that freed lkey. Since the lkey has already been freed, this can lead to a CQE error, causing the UMR QP to enter an error state. To resolve this race condition, the dma_resv_lock() which was hold as part of the mlx5_ib_dmabuf_invalidate_cb() is now also acquired as part of the mlx5_revoke_mr() scope. Upon a successful revoke, we set umem_dmabuf->private which points to that MR to NULL, preventing any further invalidation attempts on its lkey. Fixes: e6fb246ccafb ("RDMA/mlx5: Consolidate MR destruction to mlx5_ib_dereg_mr()") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Artemy Kovalyov <artemyko@mnvidia.com> Link: https://patch.msgid.link/70617067abbfaa0c816a2544c922e7f4346def58.1738587016.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-03IB/mlx5: Set and get correct qp_num for a DCT QPMark Zhang
When a DCT QP is created on an active lag, it's dctc.port is assigned in a round-robin way, which is from 1 to dev->lag_port. In this case when querying this QP, we may get qp_attr.port_num > 2. Fix this by setting qp->port when modifying a DCT QP, and read port_num from qp->port instead of dctc.port when querying it. Fixes: 7c4b1ab9f167 ("IB/mlx5: Add DCT RoCE LAG support") Signed-off-by: Mark Zhang <markzhang@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Link: https://patch.msgid.link/94c76bf0adbea997f87ffa27674e0a7118ad92a9.1737290358.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-03RDMA/mlx5: Fix the recovery flow of the UMR QPYishai Hadas
This patch addresses an issue in the recovery flow of the UMR QP, ensuring tasks do not get stuck, as highlighted by the call trace [1]. During recovery, before transitioning the QP to the RESET state, the software must wait for all outstanding WRs to complete. Failing to do so can cause the firmware to skip sending some flushed CQEs with errors and simply discard them upon the RESET, as per the IB specification. This race condition can result in lost CQEs and tasks becoming stuck. To resolve this, the patch sends a final WR which serves only as a barrier before moving the QP state to RESET. Once a CQE is received for that final WR, it guarantees that no outstanding WRs remain, making it safe to transition the QP to RESET and subsequently back to RTS, restoring proper functionality. Note: For the barrier WR, we simply reuse the failed and ready WR. Since the QP is in an error state, it will only receive IB_WC_WR_FLUSH_ERR. However, as it serves only as a barrier we don't care about its status. [1] INFO: task rdma_resource_l:1922 blocked for more than 120 seconds. Tainted: G W 6.12.0-rc7+ #1626 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:rdma_resource_l state:D stack:0 pid:1922 tgid:1922 ppid:1369 flags:0x00004004 Call Trace: <TASK> __schedule+0x420/0xd30 schedule+0x47/0x130 schedule_timeout+0x280/0x300 ? mark_held_locks+0x48/0x80 ? lockdep_hardirqs_on_prepare+0xe5/0x1a0 wait_for_completion+0x75/0x130 mlx5r_umr_post_send_wait+0x3c2/0x5b0 [mlx5_ib] ? __pfx_mlx5r_umr_done+0x10/0x10 [mlx5_ib] mlx5r_umr_revoke_mr+0x93/0xc0 [mlx5_ib] __mlx5_ib_dereg_mr+0x299/0x520 [mlx5_ib] ? _raw_spin_unlock_irq+0x24/0x40 ? wait_for_completion+0xfe/0x130 ? rdma_restrack_put+0x63/0xe0 [ib_core] ib_dereg_mr_user+0x5f/0x120 [ib_core] ? lock_release+0xc6/0x280 destroy_hw_idr_uobject+0x1d/0x60 [ib_uverbs] uverbs_destroy_uobject+0x58/0x1d0 [ib_uverbs] uobj_destroy+0x3f/0x70 [ib_uverbs] ib_uverbs_cmd_verbs+0x3e4/0xbb0 [ib_uverbs] ? __pfx_uverbs_destroy_def_handler+0x10/0x10 [ib_uverbs] ? __lock_acquire+0x64e/0x2080 ? mark_held_locks+0x48/0x80 ? find_held_lock+0x2d/0xa0 ? lock_acquire+0xc1/0x2f0 ? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs] ? __fget_files+0xc3/0x1b0 ib_uverbs_ioctl+0xe7/0x170 [ib_uverbs] ? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs] __x64_sys_ioctl+0x1b0/0xa70 do_syscall_64+0x6b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f99c918b17b RSP: 002b:00007ffc766d0468 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffc766d0578 RCX: 00007f99c918b17b RDX: 00007ffc766d0560 RSI: 00000000c0181b01 RDI: 0000000000000003 RBP: 00007ffc766d0540 R08: 00007f99c8f99010 R09: 000000000000bd7e R10: 00007f99c94c1c70 R11: 0000000000000246 R12: 00007ffc766d0530 R13: 000000000000001c R14: 0000000040246a80 R15: 0000000000000000 </TASK> Fixes: 158e71bb69e3 ("RDMA/mlx5: Add a umr recovery flow") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/27b51b92ec42dfb09d8096fcbd51878f397ce6ec.1737290141.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-01-24Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "Lighter that normal, but the now usual collection of driver fixes and small improvements: - Small fixes and minor improvements to cxgb4, bnxt_re, rxe, srp, efa, cxgb4 - Update mlx4 to use the new umem APIs, avoiding direct use of scatterlist - Support ROCEv2 in erdma - Remove various uncalled functions, constify bin_attribute - Provide core infrastructure to catch netdev events and route them to drivers, consolidating duplicated driver code - Fix rare race condition crashes in mlx5 ODP flows" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (63 commits) RDMA/mlx5: Fix implicit ODP use after free RDMA/mlx5: Fix a race for an ODP MR which leads to CQE with error RDMA/qib: Constify 'struct bin_attribute' RDMA/hfi1: Constify 'struct bin_attribute' RDMA/rxe: Fix the warning "__rxe_cleanup+0x12c/0x170 [rdma_rxe]" RDMA/cxgb4: Notify rdma stack for IB_EVENT_QP_LAST_WQE_REACHED event RDMA/bnxt_re: Allocate dev_attr information dynamically RDMA/bnxt_re: Pass the context for ulp_irq_stop RDMA/bnxt_re: Add support to handle DCB_CONFIG_CHANGE event RDMA/bnxt_re: Query firmware defaults of CC params during probe RDMA/bnxt_re: Add Async event handling support bnxt_en: Add ULP call to notify async events RDMA/mlx5: Fix indirect mkey ODP page count MAINTAINERS: Update the bnxt_re maintainers RDMA/hns: Clean up the legacy CONFIG_INFINIBAND_HNS RDMA/rtrs: Add missing deinit() call RDMA/efa: Align interrupt related fields to same type RDMA/bnxt_re: Fix to drop reference to the mmap entry in case of error RDMA/mlx5: Fix link status down event for MPV RDMA/erdma: Support create_ah/destroy_ah in non-sleepable contexts ...
2025-01-21RDMA/mlx5: Fix implicit ODP use after freePatrisious Haddad
Prevent double queueing of implicit ODP mr destroy work by using __xa_cmpxchg() to make sure this is the only time we are destroying this specific mr. Without this change, we could try to invalidate this mr twice, which in turn could result in queuing a MR work destroy twice, and eventually the second work could execute after the MR was freed due to the first work, causing a user after free and trace below. refcount_t: underflow; use-after-free. WARNING: CPU: 2 PID: 12178 at lib/refcount.c:28 refcount_warn_saturate+0x12b/0x130 Modules linked in: bonding ib_ipoib vfio_pci ip_gre geneve nf_tables ip6_gre gre ip6_tunnel tunnel6 ipip tunnel4 ib_umad rdma_ucm mlx5_vfio_pci vfio_pci_core vfio_iommu_type1 mlx5_ib vfio ib_uverbs mlx5_core iptable_raw openvswitch nsh rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay zram zsmalloc fuse [last unloaded: ib_uverbs] CPU: 2 PID: 12178 Comm: kworker/u20:5 Not tainted 6.5.0-rc1_net_next_mlx5_58c644e #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Workqueue: events_unbound free_implicit_child_mr_work [mlx5_ib] RIP: 0010:refcount_warn_saturate+0x12b/0x130 Code: 48 c7 c7 38 95 2a 82 c6 05 bc c6 fe 00 01 e8 0c 66 aa ff 0f 0b 5b c3 48 c7 c7 e0 94 2a 82 c6 05 a7 c6 fe 00 01 e8 f5 65 aa ff <0f> 0b 5b c3 90 8b 07 3d 00 00 00 c0 74 12 83 f8 01 74 13 8d 50 ff RSP: 0018:ffff8881008e3e40 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027 RDX: ffff88852c91b5c8 RSI: 0000000000000001 RDI: ffff88852c91b5c0 RBP: ffff8881dacd4e00 R08: 00000000ffffffff R09: 0000000000000019 R10: 000000000000072e R11: 0000000063666572 R12: ffff88812bfd9e00 R13: ffff8881c792d200 R14: ffff88810011c005 R15: ffff8881002099c0 FS: 0000000000000000(0000) GS:ffff88852c900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f5694b5e000 CR3: 00000001153f6003 CR4: 0000000000370ea0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? refcount_warn_saturate+0x12b/0x130 free_implicit_child_mr_work+0x180/0x1b0 [mlx5_ib] process_one_work+0x1cc/0x3c0 worker_thread+0x218/0x3c0 kthread+0xc6/0xf0 ret_from_fork+0x1f/0x30 </TASK> Fixes: 5256edcb98a1 ("RDMA/mlx5: Rework implicit ODP destroy") Cc: stable@vger.kernel.org Link: https://patch.msgid.link/r/c96b8645a81085abff739e6b06e286a350d1283d.1737274283.git.leon@kernel.org Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-01-21RDMA/mlx5: Fix a race for an ODP MR which leads to CQE with errorYishai Hadas
This patch addresses a race condition for an ODP MR that can result in a CQE with an error on the UMR QP. During the __mlx5_ib_dereg_mr() flow, the following sequence of calls occurs: mlx5_revoke_mr() mlx5r_umr_revoke_mr() mlx5r_umr_post_send_wait() At this point, the lkey is freed from the hardware's perspective. However, concurrently, mlx5_ib_invalidate_range() might be triggered by another task attempting to invalidate a range for the same freed lkey. This task will: - Acquire the umem_odp->umem_mutex lock. - Call mlx5r_umr_update_xlt() on the UMR QP. - Since the lkey has already been freed, this can lead to a CQE error, causing the UMR QP to enter an error state [1]. To resolve this race condition, the umem_odp->umem_mutex lock is now also acquired as part of the mlx5_revoke_mr() scope. Upon successful revoke, we set umem_odp->private which points to that MR to NULL, preventing any further invalidation attempts on its lkey. [1] From dmesg: infiniband rocep8s0f0: dump_cqe:277:(pid 0): WC error: 6, Message: memory bind operation error cqe_dump: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 cqe_dump: 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 cqe_dump: 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 cqe_dump: 00000030: 00 00 00 00 08 00 78 06 25 00 11 b9 00 0e dd d2 WARNING: CPU: 15 PID: 1506 at drivers/infiniband/hw/mlx5/umr.c:394 mlx5r_umr_post_send_wait+0x15a/0x2b0 [mlx5_ib] Modules linked in: ip6table_mangle ip6table_natip6table_filter ip6_tables iptable_mangle xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core CPU: 15 UID: 0 PID: 1506 Comm: ibv_rc_pingpong Not tainted 6.12.0-rc7+ #1626 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:mlx5r_umr_post_send_wait+0x15a/0x2b0 [mlx5_ib] [..] Call Trace: <TASK> mlx5r_umr_update_xlt+0x23c/0x3e0 [mlx5_ib] mlx5_ib_invalidate_range+0x2e1/0x330 [mlx5_ib] __mmu_notifier_invalidate_range_start+0x1e1/0x240 zap_page_range_single+0xf1/0x1a0 madvise_vma_behavior+0x677/0x6e0 do_madvise+0x1a2/0x4b0 __x64_sys_madvise+0x25/0x30 do_syscall_64+0x6b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e Fixes: e6fb246ccafb ("RDMA/mlx5: Consolidate MR destruction to mlx5_ib_dereg_mr()") Cc: stable@vger.kernel.org Link: https://patch.msgid.link/r/68a1e007c25b2b8fe5d625f238cc3b63e5341f77.1737290229.git.leon@kernel.org Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Artemy Kovalyov <artemyko@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-01-13RDMA/mlx5: Fix indirect mkey ODP page countMichael Guralnik
Restrict the check for the number of pages handled during an ODP page fault to direct mkeys. Perform the check right after handling the page fault and don't propagate the number of handled pages to callers. Indirect mkeys and their associated direct mkeys can have different start addresses. As a result, the calculation of the number of pages to handle for an indirect mkey may not match the actual page fault handling done on the direct mkey. For example: A 4K sized page fault on a KSM mkey that has a start address that is not aligned to a page will result a calculation that assumes the number of pages required to handle are 2. While the underlying MTT might be aligned will require fetching only a single page. Thus, do the calculation and compare number of pages handled only per direct mkey. Fixes: db570d7deafb ("IB/mlx5: Add ODP support to MW") Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Reviewed-by: Artemy Kovalyov <artemyko@nvidia.com> Link: https://patch.msgid.link/86c483d9e75ce8fe14e9ff85b62df72b779f8ab1.1736187990.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-01-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.13-rc6). No conflicts. Adjacent changes: include/linux/if_vlan.h f91a5b808938 ("af_packet: fix vlan_get_protocol_dgram() vs MSG_PEEK") 3f330db30638 ("net: reformat kdoc return statements") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-03RDMA/mlx5: Enable multiplane mode only when it is supportedMark Zhang
Driver queries vport_cxt.num_plane and enables multiplane when it is greater then 0, but some old FWs (versions from x.40.1000 till x.42.1000), report vport_cxt.num_plane = 1 unexpectedly. Fix it by querying num_plane only when HCA_CAP2.multiplane bit is set. Fixes: 2a5db20fa532 ("RDMA/mlx5: Add support to multi-plane device and port") Link: https://patch.msgid.link/r/1ef901acdf564716fcf550453cf5e94f343777ec.1734610916.git.leon@kernel.org Cc: stable@vger.kernel.org Reported-by: Francesco Poli <invernomuto@paranoici.org> Closes: https://lore.kernel.org/all/nvs4i2v7o6vn6zhmtq4sgazy2hu5kiulukxcntdelggmznnl7h@so3oul6uwgbl/ Signed-off-by: Mark Zhang <markzhang@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-01-02RDMA/mlx5: Fix link status down event for MPVPatrisious Haddad
The commit below prevented MPV from unloading correctly due to blocking the netdev down event, allow sending the event for MPV mode to maintain proper unload flow. Fixes: 379013776222 ("RDMA/mlx5: Handle link status event only for LAG device") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://patch.msgid.link/d7731478e456f61255af798a7fd4e64b006ddebb.1735567976.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-12-25RDMA/mlx5: Handle link status event only for LAG deviceYuyu Li
The link status events of non-LAG devices are now handled in ib_core, so only LAG device events need to be handled in driver. Signed-off-by: Yuyu Li <liyuyu6@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-12-23net/mlx5: fs, add counter object to flow destinationMoshe Shemesh
Currently mlx5_flow_destination includes counter_id which is assigned in case we use flow counter on the flow steering rule. However, counter_id is not enough data in case of using HW Steering. Thus, have mlx5_fc object as part of mlx5_flow_destination instead of counter_id and assign it where needed. In case counter_id is received from user space, create a local counter object to represent it. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20241219175841.1094544-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-10RDMA/mlx5: Extend ODP statistics with operation countChiara Meiohas
The current ODP counters represent the total number of pages handled, but it is not enough to understand the effectiveness of these operations. Extend the ODP counters to include the number of times page fault and invalidation events were handled. Example for a single page fault handling 512 pages: - page_fault: incremented by 512 (total pages) - page_fault_handled: incremented by 1 (operation count) The same example is applicable for page invalidation too. Previous output: $ rdma stat mr dev rocep8s0f0 mrn 8 page_faults 27 page_invalidations 0 page_prefetch 29 New output: $ rdma stat mr dev rocep8s0f0 mrn 21 page_faults 512 page_faults_handled 1 page_invalidations 0 page_invalidations_handled 0 page_prefetch 51200 Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/b18f29ed1392996ade66e9e6c45f018925253f6a.1733234165.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-12-05RDMA/mlx5: Enforce same type port association for multiport RoCEPatrisious Haddad
Different core device types such as PFs and VFs shouldn't be affiliated together since they have different capabilities, fix that by enforcing type check before doing the affiliation. Fixes: 32f69e4be269 ("{net, IB}/mlx5: Manage port association for multiport RoCE") Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://patch.msgid.link/88699500f690dff1c1852c1ddb71f8a1cc8b956e.1733233480.git.leonro@nvidia.com Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-11-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds
Pull rdma updates from Jason Gunthorpe: "Seveal fixes scattered across the drivers and a few new features: - Minor updates and bug fixes to hfi1, efa, iopob, bnxt, hns - Force disassociate the userspace FD when hns does an async reset - bnxt new features for optimized modify QP to skip certain stayes, CQ coalescing, better debug dumping - mlx5 new data placement ordering feature - Faster destruction of mlx5 devx HW objects - Improvements to RDMA CM mad handling" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (51 commits) RDMA/bnxt_re: Correct the sequence of device suspend RDMA/bnxt_re: Use the default mode of congestion control RDMA/bnxt_re: Support different traffic class IB/cm: Rework sending DREQ when destroying a cm_id IB/cm: Do not hold reference on cm_id unless needed IB/cm: Explicitly mark if a response MAD is a retransmission RDMA/mlx5: Move events notifier registration to be after device registration RDMA/bnxt_re: Cache MSIx info to a local structure RDMA/bnxt_re: Refurbish CQ to NQ hash calculation RDMA/bnxt_re: Refactor NQ allocation RDMA/bnxt_re: Fail probe early when not enough MSI-x vectors are reserved RDMA/hns: Fix different dgids mapping to the same dip_idx RDMA/bnxt_re: Add set_func_resources support for P5/P7 adapters RDMA/bnxt_re: Enhance RoCE SRIOV resource configuration design bnxt_en: Add support for RoCE sriov configuration RDMA/hns: Fix NULL pointer derefernce in hns_roce_map_mr_sg() RDMA/hns: Fix out-of-order issue of requester when setting FENCE RDMA/nldev: Add IB device and net device rename events RDMA/mlx5: Add implementation for ufile_hw_cleanup device operation RDMA/core: Move ib_uverbs_file struct to uverbs_types.h ...
2024-11-14RDMA/mlx5: Move events notifier registration to be after device registrationPatrisious Haddad
Move pkey change work initialization and cleanup from device resources stage to notifier stage, since this is the stage which handles this work events. Fix a race between the device deregistration and pkey change work by moving MLX5_IB_STAGE_DEVICE_NOTIFIER to be after MLX5_IB_STAGE_IB_REG in order to ensure that the notifier is deregistered before the device during cleanup. Which ensures there are no works that are being executed after the device has already unregistered which can cause the panic below. BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 630071 Comm: kworker/1:2 Kdump: loaded Tainted: G W OE --------- --- 5.14.0-162.6.1.el9_1.x86_64 #1 Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 02/27/2023 Workqueue: events pkey_change_handler [mlx5_ib] RIP: 0010:setup_qp+0x38/0x1f0 [mlx5_ib] Code: ee 41 54 45 31 e4 55 89 f5 53 48 89 fb 48 83 ec 20 8b 77 08 65 48 8b 04 25 28 00 00 00 48 89 44 24 18 48 8b 07 48 8d 4c 24 16 <4c> 8b 38 49 8b 87 80 0b 00 00 4c 89 ff 48 8b 80 08 05 00 00 8b 40 RSP: 0018:ffffbcc54068be20 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff954054494128 RCX: ffffbcc54068be36 RDX: ffff954004934000 RSI: 0000000000000001 RDI: ffff954054494128 RBP: 0000000000000023 R08: ffff954001be2c20 R09: 0000000000000001 R10: ffff954001be2c20 R11: ffff9540260133c0 R12: 0000000000000000 R13: 0000000000000023 R14: 0000000000000000 R15: ffff9540ffcb0905 FS: 0000000000000000(0000) GS:ffff9540ffc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010625c001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: mlx5_ib_gsi_pkey_change+0x20/0x40 [mlx5_ib] process_one_work+0x1e8/0x3c0 worker_thread+0x50/0x3b0 ? rescuer_thread+0x380/0x380 kthread+0x149/0x170 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x22/0x30 Modules linked in: rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_fwctl(OE) fwctl(OE) ib_uverbs(OE) mlx5_core(OE) mlxdevm(OE) ib_core(OE) mlx_compat(OE) psample mlxfw(OE) tls knem(OE) netconsole nfsv3 nfs_acl nfs lockd grace fscache netfs qrtr rfkill sunrpc intel_rapl_msr intel_rapl_common rapl hv_balloon hv_utils i2c_piix4 pcspkr joydev fuse ext4 mbcache jbd2 sr_mod sd_mod cdrom t10_pi sg ata_generic pci_hyperv pci_hyperv_intf hyperv_drm drm_shmem_helper drm_kms_helper hv_storvsc syscopyarea hv_netvsc sysfillrect sysimgblt hid_hyperv fb_sys_fops scsi_transport_fc hyperv_keyboard drm ata_piix crct10dif_pclmul crc32_pclmul crc32c_intel libata ghash_clmulni_intel hv_vmbus serio_raw [last unloaded: ib_core] CR2: 0000000000000000 ---[ end trace f6f8be4eae12f7bc ]--- Fixes: 7722f47e71e5 ("IB/mlx5: Create GSI transmission QPs when P_Key table is changed") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/d271ceeff0c08431b3cbbbb3e2d416f09b6d1621.1731496944.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-11-04RDMA/mlx5: Add implementation for ufile_hw_cleanup device operationPatrisious Haddad
Implement the device API for ufile_hw_cleanup operation, which iterates over the ufile uobjects lists, and attempts to destroy DevX QPs, by issuing up to 8 commands in parallel. This function is responsible only for cleaning the FW resources of the QP, and doesn't necessarily cleanup all of its resources. Hence the normal serialized cleanup flow is still executed after it in __uverbs_cleanup_ufile() to cleanup the remaining resources and handle the cleanup of SW objects. In order to avoid double cleanup for the FW resources, new DevX flag was added DEVX_OBJ_FLAGS_HW_FREED, which marks the object's FW resources as already freed. Since QP destruction is the most time-consuming operation in FW, parallelizing it reduces the cleanup time of applications that use DevX QPs. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://patch.msgid.link/2f82675d0412542cba1c47a6b86f589521ae41e1.1730373303.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-11-04RDMA/mlx5: Ensure active slave attachment to the bond IB deviceChiara Meiohas
Fix a race condition when creating a lag bond in active backup mode where after the bond creation the backup slave was attached to the IB device, instead of the active slave. This caused stale entries in the GID table, as the gid updating mechanism relies on ib_device_get_netdev(), which would return the backup slave. Send an MLX5_DRIVER_EVENT_ACTIVE_BACKUP_LAG_CHANGE_LOWERSTATE event when activating the lag, additionally to when modifying the lag. This ensures that eventually the active netdevice is stored in the bond IB device. When handling this event remove the GIDs of the previously attached netdevice in this port and rescan the GIDs of the newly attached netdevice. This ensures that eventually the active slave netdevice is correctly stored in the IB device port. While there might be a brief moment where the backup slave GIDs appear in the GID table, it will eventually stabilize with the correct GIDs (of the bond and the active slave). Fixes: 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions") Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/91fc2cb24f63add266a528c1c702668a80416d9f.1730381292.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2024-11-04RDMA/mlx5: Call dev_put() after the blocking notifierChiara Meiohas
Move dev_put() call to occur directly after the blocking notifier, instead of within the event handler. Fixes: 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions") Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/342ff94b3dcbb07da1c7dab862a73933d604b717.1730381292.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>