summaryrefslogtreecommitdiff
path: root/drivers/infiniband
AgeCommit message (Collapse)Author
2025-03-12RDMA/hns: Fix wrong value of max_sge_rdJunxian Huang
There is no difference between the sge of READ and non-READ operations in hns RoCE. Set max_sge_rd to the same value as max_send_sge. Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250311084857.3803665-8-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-12RDMA/hns: Fix missing xa_destroy()Junxian Huang
Add xa_destroy() for xarray in driver. Fixes: 5c1f167af112 ("RDMA/hns: Init SRQ table for hip08") Fixes: 27e19f451089 ("RDMA/hns: Convert cq_table to XArray") Fixes: 736b5a70db98 ("RDMA/hns: Convert qp_table_tree to XArray") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250311084857.3803665-7-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-12RDMA/hns: Fix a missing rollback in error path of hns_roce_create_qp_common()Junxian Huang
When ib_copy_to_udata() fails in hns_roce_create_qp_common(), hns_roce_qp_remove() should be called in the error path to clean up resources in hns_roce_qp_store(). Fixes: 0f00571f9433 ("RDMA/hns: Use new SQ doorbell register for HIP09") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250311084857.3803665-6-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-12RDMA/hns: Fix invalid sq params not being blockedJunxian Huang
SQ params from userspace are checked in by set_user_sq_size(). But when the check fails, the function doesn't return but instead keep running and overwrite 'ret'. As a result, the invalid params will not get blocked actually. Add a return right after the failed check. Besides, although the check result of kernel sq params will not be overwritten, to keep coding style unified, move default_congest_type() before set_kernel_sq_size(). Fixes: 6ec429d5887a ("RDMA/hns: Support userspace configuring congestion control algorithm with QP granularity") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250311084857.3803665-5-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-12RDMA/hns: Fix unmatched condition in error path of alloc_user_qp_db()Junxian Huang
Currently the condition of unmapping sdb in error path is not exactly the same as the condition of mapping in alloc_user_qp_db(). This may cause a problem of unmapping an unmapped db in some case, such as when the QP is XRC TGT. Unified the two conditions. Fixes: 90ae0b57e4a5 ("RDMA/hns: Combine enable flags of qp") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250311084857.3803665-4-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-12RDMA/hns: Fix soft lockup during bt pages loopJunxian Huang
Driver runs a for-loop when allocating bt pages and mapping them with buffer pages. When a large buffer (e.g. MR over 100GB) is being allocated, it may require a considerable loop count. This will lead to soft lockup: watchdog: BUG: soft lockup - CPU#27 stuck for 22s! ... Call trace: hem_list_alloc_mid_bt+0x124/0x394 [hns_roce_hw_v2] hns_roce_hem_list_request+0xf8/0x160 [hns_roce_hw_v2] hns_roce_mtr_create+0x2e4/0x360 [hns_roce_hw_v2] alloc_mr_pbl+0xd4/0x17c [hns_roce_hw_v2] hns_roce_reg_user_mr+0xf8/0x190 [hns_roce_hw_v2] ib_uverbs_reg_mr+0x118/0x290 watchdog: BUG: soft lockup - CPU#35 stuck for 23s! ... Call trace: hns_roce_hem_list_find_mtt+0x7c/0xb0 [hns_roce_hw_v2] mtr_map_bufs+0xc4/0x204 [hns_roce_hw_v2] hns_roce_mtr_create+0x31c/0x3c4 [hns_roce_hw_v2] alloc_mr_pbl+0xb0/0x160 [hns_roce_hw_v2] hns_roce_reg_user_mr+0x108/0x1c0 [hns_roce_hw_v2] ib_uverbs_reg_mr+0x120/0x2bc Add a cond_resched() to fix soft lockup during these loops. In order not to affect the allocation performance of normal-size buffer, set the loop count of a 100GB MR as the threshold to call cond_resched(). Fixes: 38389eaa4db1 ("RDMA/hns: Add mtr support for mixed multihop addressing") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250311084857.3803665-3-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-12RDMA/hns: Inappropriate format characters cleanupGuofeng Yue
Use %u for unsigned type and %d for enum. Signed-off-by: Guofeng Yue <yueguofeng@h-partners.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250311084857.3803665-2-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-12RDMA/bnxt_re: Avoid clearing VLAN_ID mask in modify qp pathSaravanan Vajravel
Driver is always clearing the mask that sets the VLAN ID/Service Level in the adapter. Recent change for supporting multiple traffic class exposed this issue. Allow setting SL and VLAN_ID while QP is moved from INIT to RTR state. Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver") Fixes: c64b16a37b6d ("RDMA/bnxt_re: Support different traffic class") Signed-off-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1741670196-2919-1-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-09RDMA/mlx5: Expose RDMA TRANSPORT flow table types to userspacePatrisious Haddad
This patch adds RDMA_TRANSPORT_RX and RDMA_TRANSPORT_TX as a new flow table type for matcher creation. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Link: https://patch.msgid.link/2287d8c50483e880450c7e8e08d9de34cdec1b14.1741261611.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-09RDMA/mlx5: Check enabled UCAPs when creating ucontextChiara Meiohas
Verify that the enabled UCAPs are supported by the device before creating the ucontext. If supported, create the ucontext with the associated capabilities. Store the privileged ucontext UID on creation and remove it when destroying the privileged ucontext. This allows the command interface to recognize privileged commands through its UID. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/8b180583a207cb30deb7a2967934079749cdcc44.1741261611.git.leon@kernel.org Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-09RDMA/uverbs: Add support for UCAPs in context creationChiara Meiohas
Add support for file descriptor array attribute for GET_CONTEXT commands. Check that the file descriptor (fd) array represents fds for valid UCAPs. Store the enabled UCAPs from the fd array as a bitmask in ib_ucontext. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/ebfb30bc947e2259b193c96a319c80e82599045b.1741261611.git.leon@kernel.org Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-09RDMA/mlx5: Create UCAP char devices for supported device capabilitiesChiara Meiohas
Create UCAP character devices when probing an IB device with supported firmware capabilities. If the RDMA_CTRL general object type is supported, check for specific UCTX capabilities: Create /dev/infiniband/mlx5_perm_ctrl_local for RDMA_UCAP_MLX5_CTRL_LOCAL Create /dev/infiniband/mlx5_perm_ctrl_other_vhca for RDMA_UCAP_MLX5_CTRL_OTHER_VHCA Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/30ed40e7a12a694cf4ee257459ed61b145b7837d.1741261611.git.leon@kernel.org Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-09RDMA/uverbs: Introduce UCAP (User CAPabilities) APIChiara Meiohas
Implement a new User CAPabilities (UCAP) API to provide fine-grained control over specific firmware features. This approach offers more granular capabilities than the existing Linux capabilities, which may be too generic for certain FW features. This mechanism represents each capability as a character device with root read-write access. Root processes can grant users special privileges by allowing access to these character devices (e.g., using chown). UCAP character devices are located in /dev/infiniband and the class path is /sys/class/infiniband_ucaps. Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Link: https://patch.msgid.link/5a1379187cd21178e8554afc81a3c941f21af22f.1741261611.git.leon@kernel.org Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-08RDMA/mana_ib: Use safer allocation function()Dan Carpenter
My static checker says this multiplication can overflow. I'm not an expert in this code but the call tree would be: ib_uverbs_handler_UVERBS_METHOD_QP_CREATE() <- reads cap from the user -> ib_create_qp_user() -> create_qp() -> mana_ib_create_qp() -> mana_ib_create_ud_qp() -> create_shadow_queue() It can't hurt to use safer interfaces. Fixes: c8017f5b4856 ("RDMA/mana_ib: UD/GSI work requests") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/58439ac0-1ee5-4f96-a595-7ab83b59139b@stanley.mountain Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-06RDMA/erdma: Prevent use-after-free in erdma_accept_newconn()Cheng Xu
After the erdma_cep_put(new_cep) being called, new_cep will be freed, and the following dereference will cause a UAF problem. Fix this issue. Fixes: 920d93eac8b9 ("RDMA/erdma: Add connection management (CM) support") Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-06RDMA/vmw_pvrdma: Remove unused pvrdma_modify_deviceDr. David Alan Gilbert
pvrdma_modify_device() was added in 2016 as part of commit 29c8d9eba550 ("IB: Add vmw_pvrdma driver") but accidentally it was never wired into the device_ops struct. After some discussion the best course seems to be just to remove it, see discussion at: https://lore.kernel.org/all/Z8TWF6coBUF3l_jk@gallifrey/ Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://patch.msgid.link/20250304215637.68559-1-linux@treblig.org Acked-by: Vishnu Dasa <vishnu.dasa@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-06RDMA/mlx5: Handle errors returned from mlx5r_ib_rate()Qasim Ijaz
In function create_ib_ah() the following line attempts to left shift the return value of mlx5r_ib_rate() by 4 and store it in the stat_rate_sl member of av: However the code overlooks the fact that mlx5r_ib_rate() may return -EINVAL if the rate passed to it is less than IB_RATE_2_5_GBPS or greater than IB_RATE_800_GBPS. Because of this, the code may invoke undefined behaviour when shifting a signed negative value when doing "-EINVAL << 4". To fix this check for errors before assigning stat_rate_sl and propagate any error value to the callers. Fixes: c534ffda781f ("RDMA/mlx5: Fix AH static rate parsing") Signed-off-by: Qasim Ijaz <qasdev00@gmail.com> Link: https://patch.msgid.link/20250304140246.205919-1-qasdev00@gmail.com Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-03RDMA/bnxt_re: Fix reporting maximum SRQs on P7 chipsPreethi G
Firmware reports support for additional SRQs in the max_srq_ext field. In CREQ_QUERY_FUNC response, if MAX_SRQ_EXTENDED flag is set, driver should derive the total number of max SRQs by the summation of "max_srq" and "max_srq_ext" fields. Fixes: b1b66ae094cd ("bnxt_en: Use FW defined resource limits for RoCE") Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Preethi G <preethi.gurusiddalingeswaraswamy@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1741021178-2569-4-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-03RDMA/bnxt_re: Add missing paranthesis in map_qp_id_to_tbl_indxKashyap Desai
The modulo operation returns wrong result without the paranthesis and that resulted in wrong QP table indexing. Fixes: 84cf229f4001 ("RDMA/bnxt_re: Fix the qp table indexing") Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1741021178-2569-3-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-03RDMA/bnxt_re: Fix allocation of QP tableKashyap Desai
Driver is creating QP table too early while probing before querying firmware capabilities. Driver currently is using a hard coded values of 64K as size while creating QP table. This resulted in a crash when firmwre supported QP count is more than 64K. To fix the issue, move the QP tabel creation after the firmware capabilities are queried. Use the firmware returned maximum value of QPs while creating the QP table. Fixes: b1b66ae094cd ("bnxt_en: Use FW defined resource limits for RoCE") Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1741021178-2569-2-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-03RDMA/rxe: Fix the failure of ibv_query_device() and ibv_query_device_ex() testsZhu Yanjun
In rdma-core, the following failures appear. " $ ./build/bin/run_tests.py -k device ssssssss....FF........s ====================================================================== FAIL: test_query_device (tests.test_device.DeviceTest.test_query_device) Test ibv_query_device() ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/ubuntu/rdma-core/tests/test_device.py", line 63, in test_query_device self.verify_device_attr(attr, dev) File "/home/ubuntu/rdma-core/tests/test_device.py", line 200, in verify_device_attr assert attr.sys_image_guid != 0 ^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError ====================================================================== FAIL: test_query_device_ex (tests.test_device.DeviceTest.test_query_device_ex) Test ibv_query_device_ex() ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/ubuntu/rdma-core/tests/test_device.py", line 222, in test_query_device_ex self.verify_device_attr(attr_ex.orig_attr, dev) File "/home/ubuntu/rdma-core/tests/test_device.py", line 200, in verify_device_attr assert attr.sys_image_guid != 0 ^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError " The root cause is: before a net device is set with rxe, this net device is used to generate a sys_image_guid. Fixes: 2ac5415022d1 ("RDMA/rxe: Remove the direct link to net_device") Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Link: https://patch.msgid.link/20250302215444.3742072-1-yanjun.zhu@linux.dev Reviewed-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Tested-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-03RDMA/mlx5: Reorder capability check lastChristian Göttsche
capable() calls refer to enabled LSMs whether to permit or deny the request. This is relevant in connection with SELinux, where a capability check results in a policy decision and by default a denial message on insufficient permission is issued. It can lead to three undesired cases: 1. A denial message is generated, even in case the operation was an unprivileged one and thus the syscall succeeded, creating noise. 2. To avoid the noise from 1. the policy writer adds a rule to ignore those denial messages, hiding future syscalls, where the task performs an actual privileged operation, leading to hidden limited functionality of that task. 3. To avoid the noise from 1. the policy writer adds a rule to permit the task the requested capability, while it does not need it, violating the principle of least privilege. Signed-off-by: Christian Göttsche <cgzones@googlemail.com> Link: https://patch.msgid.link/20250302160657.127253-10-cgoettsche@seltendoof.de Reviewed-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-03RDMA/core: Fixes infiniband sysctl boundsNicolas Bouchinet
Bound infiniband iwcm and ucma sysctl writings between SYSCTL_ZERO and SYSCTL_INT_MAX. The proc_handler has thus been updated to proc_dointvec_minmax. Signed-off-by: Nicolas Bouchinet <nicolas.bouchinet@ssi.gouv.fr> Link: https://patch.msgid.link/20250224095826.16458-6-nicolas.bouchinet@clip-os.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Joel Granados <joel.granados@kernel.org> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-03RDMA/core: Don't expose hw_counters outside of init net namespaceRoman Gushchin
Commit 467f432a521a ("RDMA/core: Split port and device counter sysfs attributes") accidentally almost exposed hw counters to non-init net namespaces. It didn't expose them fully, as an attempt to read any of those counters leads to a crash like this one: [42021.807566] BUG: kernel NULL pointer dereference, address: 0000000000000028 [42021.814463] #PF: supervisor read access in kernel mode [42021.819549] #PF: error_code(0x0000) - not-present page [42021.824636] PGD 0 P4D 0 [42021.827145] Oops: 0000 [#1] SMP PTI [42021.830598] CPU: 82 PID: 2843922 Comm: switchto-defaul Kdump: loaded Tainted: G S W I XXX [42021.841697] Hardware name: XXX [42021.849619] RIP: 0010:hw_stat_device_show+0x1e/0x40 [ib_core] [42021.855362] Code: 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 49 89 d0 4c 8b 5e 20 48 8b 8f b8 04 00 00 48 81 c7 f0 fa ff ff <48> 8b 41 28 48 29 ce 48 83 c6 d0 48 c1 ee 04 69 d6 ab aa aa aa 48 [42021.873931] RSP: 0018:ffff97fe90f03da0 EFLAGS: 00010287 [42021.879108] RAX: ffff9406988a8c60 RBX: ffff940e1072d438 RCX: 0000000000000000 [42021.886169] RDX: ffff94085f1aa000 RSI: ffff93c6cbbdbcb0 RDI: ffff940c7517aef0 [42021.893230] RBP: ffff97fe90f03e70 R08: ffff94085f1aa000 R09: 0000000000000000 [42021.900294] R10: ffff94085f1aa000 R11: ffffffffc0775680 R12: ffffffff87ca2530 [42021.907355] R13: ffff940651602840 R14: ffff93c6cbbdbcb0 R15: ffff94085f1aa000 [42021.914418] FS: 00007fda1a3b9700(0000) GS:ffff94453fb80000(0000) knlGS:0000000000000000 [42021.922423] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [42021.928130] CR2: 0000000000000028 CR3: 00000042dcfb8003 CR4: 00000000003726f0 [42021.935194] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [42021.942257] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [42021.949324] Call Trace: [42021.951756] <TASK> [42021.953842] [<ffffffff86c58674>] ? show_regs+0x64/0x70 [42021.959030] [<ffffffff86c58468>] ? __die+0x78/0xc0 [42021.963874] [<ffffffff86c9ef75>] ? page_fault_oops+0x2b5/0x3b0 [42021.969749] [<ffffffff87674b92>] ? exc_page_fault+0x1a2/0x3c0 [42021.975549] [<ffffffff87801326>] ? asm_exc_page_fault+0x26/0x30 [42021.981517] [<ffffffffc0775680>] ? __pfx_show_hw_stats+0x10/0x10 [ib_core] [42021.988482] [<ffffffffc077564e>] ? hw_stat_device_show+0x1e/0x40 [ib_core] [42021.995438] [<ffffffff86ac7f8e>] dev_attr_show+0x1e/0x50 [42022.000803] [<ffffffff86a3eeb1>] sysfs_kf_seq_show+0x81/0xe0 [42022.006508] [<ffffffff86a11134>] seq_read_iter+0xf4/0x410 [42022.011954] [<ffffffff869f4b2e>] vfs_read+0x16e/0x2f0 [42022.017058] [<ffffffff869f50ee>] ksys_read+0x6e/0xe0 [42022.022073] [<ffffffff8766f1ca>] do_syscall_64+0x6a/0xa0 [42022.027441] [<ffffffff8780013b>] entry_SYSCALL_64_after_hwframe+0x78/0xe2 The problem can be reproduced using the following steps: ip netns add foo ip netns exec foo bash cat /sys/class/infiniband/mlx4_0/hw_counters/* The panic occurs because of casting the device pointer into an ib_device pointer using container_of() in hw_stat_device_show() is wrong and leads to a memory corruption. However the real problem is that hw counters should never been exposed outside of the non-init net namespace. Fix this by saving the index of the corresponding attribute group (it might be 1 or 2 depending on the presence of driver-specific attributes) and zeroing the pointer to hw_counters group for compat devices during the initialization. With this fix applied hw_counters are not available in a non-init net namespace: find /sys/class/infiniband/mlx4_0/ -name hw_counters /sys/class/infiniband/mlx4_0/ports/1/hw_counters /sys/class/infiniband/mlx4_0/ports/2/hw_counters /sys/class/infiniband/mlx4_0/hw_counters ip netns add foo ip netns exec foo bash find /sys/class/infiniband/mlx4_0/ -name hw_counters Fixes: 467f432a521a ("RDMA/core: Split port and device counter sysfs attributes") Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Maher Sanalla <msanalla@nvidia.com> Cc: linux-rdma@vger.kernel.org Cc: linux-kernel@vger.kernel.org Link: https://patch.msgid.link/20250227165420.3430301-1-roman.gushchin@linux.dev Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-03-03RDMA/siw: Switch to using the crc32c libraryEric Biggers
Now that the crc32c() library function directly takes advantage of architecture-specific optimizations, it is unnecessary to go through the crypto API. Just use crc32c(). This is much simpler, and it improves performance due to eliminating the crypto API overhead. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://patch.msgid.link/20250227051207.19470-1-ebiggers@kernel.org Acked-by: Bernard Metzler <bmt@zurich.ibm.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.14-rc5). Conflicts: drivers/net/ethernet/cadence/macb_main.c fa52f15c745c ("net: cadence: macb: Synchronize stats calculations") 75696dd0fd72 ("net: cadence: macb: Convert to get_stats64") https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au Adjacent changes: drivers/net/ethernet/intel/ice/ice_sriov.c 79990cf5e7ad ("ice: Fix deinitializing VF in error path") a203163274a4 ("ice: simplify VF MSI-X managing") net/ipv4/tcp.c 18912c520674 ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace") 297d389e9e5b ("net: prefix devmem specific helpers") net/mptcp/subflow.c 8668860b0ad3 ("mptcp: reset when MPTCP opts are dropped after join") c3349a22c200 ("mptcp: consolidate subflow cleanup") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24RDMA/hfi1: Remove unused one_qsfp_writeDr. David Alan Gilbert
The last use of one_qsfp_write() was removed in 2016's commit 145dd2b39958 ("IB/hfi1: Always turn on CDRs for low power QSFP modules") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Link: https://patch.msgid.link/20250223215543.153312-1-linux@treblig.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-23RDMA/bnxt_re: Fix the page details for the srq created by kernel consumersKashyap Desai
While using nvme target with use_srq on, below kernel panic is noticed. [ 549.698111] bnxt_en 0000:41:00.0 enp65s0np0: FEC autoneg off encoding: Clause 91 RS(544,514) [ 566.393619] Oops: divide error: 0000 [#1] PREEMPT SMP NOPTI .. [ 566.393799] <TASK> [ 566.393807] ? __die_body+0x1a/0x60 [ 566.393823] ? die+0x38/0x60 [ 566.393835] ? do_trap+0xe4/0x110 [ 566.393847] ? bnxt_qplib_alloc_init_hwq+0x1d4/0x580 [bnxt_re] [ 566.393867] ? bnxt_qplib_alloc_init_hwq+0x1d4/0x580 [bnxt_re] [ 566.393881] ? do_error_trap+0x7c/0x120 [ 566.393890] ? bnxt_qplib_alloc_init_hwq+0x1d4/0x580 [bnxt_re] [ 566.393911] ? exc_divide_error+0x34/0x50 [ 566.393923] ? bnxt_qplib_alloc_init_hwq+0x1d4/0x580 [bnxt_re] [ 566.393939] ? asm_exc_divide_error+0x16/0x20 [ 566.393966] ? bnxt_qplib_alloc_init_hwq+0x1d4/0x580 [bnxt_re] [ 566.393997] bnxt_qplib_create_srq+0xc9/0x340 [bnxt_re] [ 566.394040] bnxt_re_create_srq+0x335/0x3b0 [bnxt_re] [ 566.394057] ? srso_return_thunk+0x5/0x5f [ 566.394068] ? __init_swait_queue_head+0x4a/0x60 [ 566.394090] ib_create_srq_user+0xa7/0x150 [ib_core] [ 566.394147] nvmet_rdma_queue_connect+0x7d0/0xbe0 [nvmet_rdma] [ 566.394174] ? lock_release+0x22c/0x3f0 [ 566.394187] ? srso_return_thunk+0x5/0x5f Page size and shift info is set only for the user space SRQs. Set page size and page shift for kernel space SRQs also. Fixes: 0c4dcd602817 ("RDMA/bnxt_re: Refactor hardware queue memory allocation") Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1740237621-29291-1-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-23RDMA/mana_ib: Ensure variable err is initializedKees Bakker
In the function mana_ib_gd_create_dma_region if there are no dma blocks to process the variable `err` remains uninitialized. Fixes: 0266a177631d ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter") Signed-off-by: Kees Bakker <kees@ijzerbout.nl> Link: https://patch.msgid.link/20250221195833.7516C16290A@bout3.ijzerbout.nl Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-23RDMA/mlx5: Fix bind QP error cleanup flowPatrisious Haddad
When there is a failure during bind QP, the cleanup flow destroys the counter regardless if it is the one that created it or not, which is problematic since if it isn't the one that created it, that counter could still be in use. Fix that by destroying the counter only if it was created during this call. Fixes: 45842fc627c7 ("IB/mlx5: Support statistic q counter configuration") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Zhang <markzhang@nvidia.com> Link: https://patch.msgid.link/25dfefddb0ebefa668c32e06a94d84e3216257cf.1740033937.git.leon@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-21net: Use link/peer netns in newlink() of rtnl_link_opsXiao Liang
Add two helper functions - rtnl_newlink_link_net() and rtnl_newlink_peer_net() for netns fallback logic. Peer netns falls back to link netns, and link netns falls back to source netns. Convert the use of params->net in netdevice drivers to one of the helper functions for clarity. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-4-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21rtnetlink: Pack newlink() params into structXiao Liang
There are 4 net namespaces involved when creating links: - source netns - where the netlink socket resides, - target netns - where to put the device being created, - link netns - netns associated with the device (backend), - peer netns - netns of peer device. Currently, two nets are passed to newlink() callback - "src_net" parameter and "dev_net" (implicitly in net_device). They are set as follows, depending on netlink attributes in the request. +------------+-------------------+---------+---------+ | peer netns | IFLA_LINK_NETNSID | src_net | dev_net | +------------+-------------------+---------+---------+ | | absent | source | target | | absent +-------------------+---------+---------+ | | present | link | link | +------------+-------------------+---------+---------+ | | absent | peer | target | | present +-------------------+---------+---------+ | | present | peer | link | +------------+-------------------+---------+---------+ When IFLA_LINK_NETNSID is present, the device is created in link netns first and then moved to target netns. This has some side effects, including extra ifindex allocation, ifname validation and link events. These could be avoided if we create it in target netns from the beginning. On the other hand, the meaning of src_net parameter is ambiguous. It varies depending on how parameters are passed. It is the effective link (or peer netns) by design, but some drivers ignore it and use dev_net instead. To provide more netns context for drivers, this patch packs existing newlink() parameters, along with the source netns, link netns and peer netns, into a struct. The old "src_net" is renamed to "net" to avoid confusion with real source netns, and will be deprecated later. The use of src_net are converted to params->net trivially. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-3-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21RDMA/rxe: Add support for the traditional Atomic operations with ODPDaisuke Matsuda
Enable 'fetch and add' and 'compare and swap' operations to be used with ODP. This is comprised of the following steps: 1. Check the driver page table(umem_odp->dma_list) to see if the target page is both readable and writable. 2. If not, then trigger page fault to map the page. 3. Convert its user space address to a kernel logical address using PFNs in the driver page table(umem_odp->pfn_list). 4. Execute the operation. Link: https://patch.msgid.link/r/20241220100936.2193541-6-matsuda-daisuke@fujitsu.com Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-02-21RDMA/rxe: Add support for Send/Recv/Write/Read with ODPDaisuke Matsuda
rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses it to load payloads of requesting packets; responder uses it to process Send, Write, and Read operaetions; completer uses it to copy data from response packets of Read and Atomic operations to a user MR. Allow these operations to be used with ODP by adding a subordinate function rxe_odp_mr_copy(). It is comprised of the following steps: 1. Check the driver page table(umem_odp->dma_list) to see if pages being accessed are present with appropriate permission. 2. If necessary, trigger page fault to map the pages. 3. Convert their user space addresses to kernel logical addresses using PFNs in the driver page table(umem_odp->pfn_list). 4. Execute data copy to/from the pages. Link: https://patch.msgid.link/r/20241220100936.2193541-5-matsuda-daisuke@fujitsu.com Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-02-21RDMA/rxe: Allow registering MRs for On-Demand PagingDaisuke Matsuda
Allow userspace to register an ODP-enabled MR, in which case the flag IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no RDMA operation enabled right now. They will be supported later in the subsequent two patches. rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR. It syncs process address space from the CPU page table to the driver page table (dma_list/pfn_list in umem_odp) when called with RXE_PAGEFAULT_SNAPSHOT flag. Additionally, It can be used to trigger page fault when pages being accessed are not present or do not have proper read/write permissions, and possibly to prefetch pages in the future. Link: https://patch.msgid.link/r/20241220100936.2193541-4-matsuda-daisuke@fujitsu.com Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-02-21RDMA/rxe: Add page invalidation supportDaisuke Matsuda
On page invalidation, an MMU notifier callback is invoked to unmap DMA addresses and update the driver page table(umem_odp->dma_list). The callback is registered when an ODP-enabled MR is created. Link: https://patch.msgid.link/r/20241220100936.2193541-3-matsuda-daisuke@fujitsu.com Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-02-21RDMA/rxe: Move some code to rxe_loc.h in preparation for ODPDaisuke Matsuda
rxe_mr_init() and resp_states are going to be used in rxe_odp.c, which is to be created in the subsequent patch. Link: https://patch.msgid.link/r/20241220100936.2193541-2-matsuda-daisuke@fujitsu.com Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-02-20RDMA/mlx5: Fix AH static rate parsingPatrisious Haddad
Previously static rate wasn't translated according to our PRM but simply used the 4 lower bytes. Correctly translate static rate value passed in AH creation attribute according to our PRM expected values. In addition change 800GB mapping to zero, which is the PRM specified value. Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Link: https://patch.msgid.link/18ef4cc5396caf80728341eb74738cd777596f60.1739187089.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-20RDMA/mlx5: Fix implicit ODP hang on parent deregistrationYishai Hadas
Fix the destroy_unused_implicit_child_mr() to prevent hanging during parent deregistration as of below [1]. Upon entering destroy_unused_implicit_child_mr(), the reference count for the implicit MR parent is incremented using: refcount_inc_not_zero(). A corresponding decrement must be performed if free_implicit_child_mr_work() is not called. The code has been updated to properly manage the reference count that was incremented. [1] INFO: task python3:2157 blocked for more than 120 seconds. Not tainted 6.12.0-rc7+ #1633 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:python3 state:D stack:0 pid:2157 tgid:2157 ppid:1685 flags:0x00000000 Call Trace: <TASK> __schedule+0x420/0xd30 schedule+0x47/0x130 __mlx5_ib_dereg_mr+0x379/0x5d0 [mlx5_ib] ? __pfx_autoremove_wake_function+0x10/0x10 ib_dereg_mr_user+0x5f/0x120 [ib_core] ? lock_release+0xc6/0x280 destroy_hw_idr_uobject+0x1d/0x60 [ib_uverbs] uverbs_destroy_uobject+0x58/0x1d0 [ib_uverbs] uobj_destroy+0x3f/0x70 [ib_uverbs] ib_uverbs_cmd_verbs+0x3e4/0xbb0 [ib_uverbs] ? __pfx_uverbs_destroy_def_handler+0x10/0x10 [ib_uverbs] ? lock_acquire+0xc1/0x2f0 ? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs] ? ib_uverbs_ioctl+0x116/0x170 [ib_uverbs] ? lock_release+0xc6/0x280 ib_uverbs_ioctl+0xe7/0x170 [ib_uverbs] ? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs] __x64_sys_ioctl+0x1b0/0xa70 ? kmem_cache_free+0x221/0x400 do_syscall_64+0x6b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f20f21f017b RSP: 002b:00007ffcfc4a77c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffcfc4a78d8 RCX: 00007f20f21f017b RDX: 00007ffcfc4a78c0 RSI: 00000000c0181b01 RDI: 0000000000000003 RBP: 00007ffcfc4a78a0 R08: 000056147d125190 R09: 00007f20f1f14c60 R10: 0000000000000001 R11: 0000000000000246 R12: 00007ffcfc4a7890 R13: 000000000000001c R14: 000056147d100fc0 R15: 00007f20e365c9d0 </TASK> Fixes: d3d930411ce3 ("RDMA/mlx5: Fix implicit ODP use after free") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Artemy Kovalyov <artemyko@nvidia.com> Link: https://patch.msgid.link/80f2fcd19952dfa7d9981d93fd6359b4471f8278.1739186929.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-19RDMA/core: Fix best page size finding when it can cross SG entriesMichael Margolin
A single scatter-gather entry is limited by a 32 bits "length" field that is practically 4GB - PAGE_SIZE. This means that even when the memory is physically contiguous, we might need more than one entry to represent it. Additionally when using dmabuf, the sg_table might be originated outside the subsystem and optimized for other needs. For instance an SGT of 16GB GPU continuous memory might look like this: (a real life example) dma_address 34401400000, length fffff000 dma_address 345013ff000, length fffff000 dma_address 346013fe000, length fffff000 dma_address 347013fd000, length fffff000 dma_address 348013fc000, length 4000 Since ib_umem_find_best_pgsz works within SG entries, in the above case we will result with the worst possible 4KB page size. Fix this by taking into consideration only the alignment of addresses of real discontinuity points rather than treating SG entries as such, and adjust the page iterator to correctly handle cross SG entry pages. There is currently an assumption that drivers do not ask for pages bigger than maximal DMA size supported by their devices. Reviewed-by: Firas Jahjah <firasj@amazon.com> Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com> Link: https://patch.msgid.link/20250217141623.12428-1-mrgolin@amazon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-18IB/iser: fix typos in iscsi_iser.c commentsImanol
Fixes multiple occurrences of the misspelled word "occured" in the comments of `iscsi_iser.c`, replacing them with the correct spelling "occurred". This improves readability without affecting functionality. Signed-off-by: Imanol <imvalient@protonmail.com> Link: https://patch.msgid.link/20250217183048.9394-1-imvalient@protonmail.com Acked-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-18RDMA/mana_ib: Implement DMABUF MR supportKonstantin Taranov
Add support of dmabuf MRs to mana_ib. Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> Link: https://patch.msgid.link/1739454861-4456-1-git-send-email-kotaranov@linux.microsoft.com Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-18RDMA: Switch to use hrtimer_setup()Nam Cao
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Zack Rusin <zack.rusin@broadcom.com> Link: https://lore.kernel.org/all/37bd6895bb946f6d785ab5fe32f1a6f4b9e77c26.1738746904.git.namcao@linutronix.de
2025-02-14RDMA/irdma: Switch to using the crc32c libraryEric Biggers
Now that the crc32c() library function directly takes advantage of architecture-specific optimizations, it is unnecessary to go through the crypto API. Just use crc32c(). This is much simpler, and it improves performance due to eliminating the crypto API overhead. Note that for crc32c the equivalent of crypto_shash_digest() is cpu_to_le32(~crc32c(~0, ...)), considering that crypto_shash_digest() had before and inversions as well as a cpu_to_le32() built-in. This means that this driver is using u32 for fixed-endian types; this patch does not try to fix that but rather just keep the exact same behavior. Link: https://lore.kernel.org/r/20250207033643.59904-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://patch.msgid.link/20250207040816.69163-1-ebiggers@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-10RDMA/bnxt_re: Fix the statistics for Gen P7 VFSelvin Xavier
Gen P7 VF support the extended stats and is prevented by a VF check. Fix the check to issue the FW command for GenP7 VFs also. Fixes: 1801d87b3598 ("RDMA/bnxt_re: Support new 5760X P7 devices") Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1738657285-23968-5-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-10RDMA/bnxt_re: Fix issue in the unload pathKalesh AP
The cited comment removed the netdev notifier register call from the driver. But, it did not remove the cleanup code from the unload path. As a result, driver unload is not clean and resulted in undesired behaviour. Fixes: d3b15fcc4201 ("RDMA/bnxt_re: Remove deliver net device event") Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1738657285-23968-4-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-10RDMA/bnxt_re: Add sanity checks on rdev validityKalesh AP
There is a possibility that ulp_irq_stop and ulp_irq_start callbacks will be called when the device is in detached state. This can cause a crash due to NULL pointer dereference as the rdev is already freed. Fixes: cc5b9b48d447 ("RDMA/bnxt_re: Recover the device when FW error is detected") Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1738657285-23968-3-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-10RDMA/bnxt_re: Fix an issue in bnxt_re_async_notifierKalesh AP
In the bnxt_re_async_notifier() callback, the way driver retrieves rdev pointer is wrong. The rdev pointer should be parsed from adev pointer as while registering with the L2 for ULP, driver uses the aux device pointer for the handle. Fixes: 7fea32784068 ("RDMA/bnxt_re: Add Async event handling support") Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1738657285-23968-2-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-09RDMA/bnxt_re: Fix the condition check while programming congestion controlSelvin Xavier
Program the Congestion control values when the CC gen matches. Fix the condition check for the same. Fixes: 656dff55da19 ("RDMA/bnxt_re: Congestion control settings using debugfs hook") Reported-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reported-by: Chengchang Tang <tangchengchang@huawei.com> Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Link: https://patch.msgid.link/1739022506-8937-1-git-send-email-selvin.xavier@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-02-09RDMA/bnxt_re: Fix buffer overflow in debugfs codeDan Carpenter
Add some bounds checking to prevent memory corruption in bnxt_re_cc_config_set(). This is debugfs code so the bug can only be triggered by root. Fixes: 656dff55da19 ("RDMA/bnxt_re: Congestion control settings using debugfs hook") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/a6b081ab-55fe-4d0c-8f69-c5e5a59e9141@stanley.mountain Acked-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>