git.armlinux.org.uk/linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-07-13	Optimize DMABUF mkey page size in mlx5	Leon Romanovsky
	From Edward: This patch series enables the mlx5 driver to dynamically choose the optimal page size for a DMABUF-based memory key (mkey), rather than always registering with a fixed page size. Previously, DMABUF memory registration used a fixed 4K page size for mkeys which could lead to suboptimal performance when the underlying memory layout may offer better page sizes. This approach did not take advantage of larger page size capabilities advertised by the HCA, and the driver was not setting the proper page size mask in the mkey mask when performing page size changes, potentially leading to invalid registrations when updating to a very large pages. This series improves DMABUF performance by dynamically selecting the best page size for a given memory region (MR) both at creation time and on page fault occurrences, based on the underlying layout and fixing related gaps and bugs. By doing so, we reduce the number of page table entries (and thus MTT/ KSM descriptors) that the HCA must traverse, which in turn reduces cache-line fetches. Thanks * mlx5-next: RDMA/mlx5: Fix UMR modifying of mkey page size net/mlx5: Expose HCA capability bits for mkey max page size Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13	RDMA/mlx5: Fix UMR modifying of mkey page size	Edward Srouji
	When changing the page size on an mkey, the driver needs to set the appropriate bits in the mkey mask to indicate which fields are being modified. The 6th bit of a page size in mlx5 driver is considered an extension, and this bit has a dedicated capability and mask bits. Previously, the driver was not setting this mask in the mkey mask when performing page size changes, regardless of its hardware support, potentially leading to an incorrect page size updates. This fixes the issue by setting the relevant bit in the mkey mask when performing page size changes on an mkey and the 6th bit of this field is supported by the hardware. Fixes: cef7dde8836a ("net/mlx5: Expand mkey page size to support 6 bits") Signed-off-by: Edward Srouji <edwards@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/9f43a9c73bf2db6085a99dc836f7137e76579f09.1751979184.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-09	RDMA/efa: Add Network HW statistics counters	Basel Nassar
	Update device API and request network counters. Expose newly added counters through ib core counters mechanism. Reviewed-by: David Shoolman <shoolman@amazon.com> Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Basel Nassar <baselna@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com> Link: https://patch.msgid.link/20250706070740.22534-1-mrgolin@amazon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-09	IB/cm: Use separate agent w/o flow control for REP	Vlad Dumitrescu
	Most responses (e.g., RTU) are not subject to flow control, as there is no further response expected. However, REPs are both requests (waiting for RTUs) and responses (being waited by REQs). With agent-level flow control added to the MAD layer, REPs can get delayed by outstanding REQs. This can cause a problem in a scenario such as 2 hosts connecting to each other at the same time. Both hosts fill the flow control outstanding slots with REQs. The corresponding REPs are now blocked behind those REQs, and neither side can make progress until REQs time out. Add a separate MAD agent which is only used to send REPs. This agent does not have a recv_handler as it doesn't process responses nor does it register to receive requests. Disable flow control for agents w/o a recv_handler, as they aren't waiting for responses. This allows the newly added REP agent to send even when clients are slow to generate RTU, which would be needed to unblock flow control outstanding slots. Relax check in ib_post_send_mad to allow retries for this agent. REPs will be retried by the MAD layer until CM layer receives a response (e.g., RTU) on the normal agent and cancels them. Suggested-by: Sean Hefty <shefty@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Sean Hefty <shefty@nvidia.com> Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Link: https://patch.msgid.link/9ac12d0842b849e2c8537d6e291ee0af9f79855c.1751278420.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-09	IB/mad: Add flow control for solicited MADs	Or Har-Toov
	Currently, MADs sent via an agent are being forwarded directly to the corresponding MAD QP layer. MADs with a timeout value set and requiring a response (solicited MADs) will be resent if the timeout expires without receiving a response. In a congested subnet, flooding MAD QP layer with more solicited send requests from the agent will only worsen the situation by triggering more timeouts and therefore more retries. Thus, add flow control for non-user solicited MADs to block agents from issuing new solicited MAD requests to the MAD QP until outstanding requests are completed and the MAD QP is ready to process additional requests. While at it, keep track of the total outstanding solicited MAD work requests in send or wait list. The number of outstanding send WRs will be limited by a fraction of the RQ size, and any new send WR that exceeds that limit will be held in a backlog list. Backlog MADs will be forwarded to agent send list only once the total number of outstanding send WRs falls below the limit. Unsolicited MADs, RMPP MADs and MADs which are not SA, SMP or CM are not subject to this flow control mechanism and will not be affected by this change. For this purpose, a new state is introduced: - 'IB_MAD_STATE_QUEUED': MAD is in backlog list Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/c0ecaa1821badee124cd13f3bf860f67ce453beb.1751278420.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-09	IB/mad: Add state machine to MAD layer	Or Har-Toov
	Replace the use of refcount, timeout and status with a 'state' field to track the status of MADs send work requests (WRs). The state machine better represents the stages in the MAD lifecycle, specifically indicating whether the MAD is waiting for a completion, waiting for a response, was canceld or is done. The existing refcount only takes two values: 1 : MAD is waiting either for completion or for response. 2 : MAD is waiting for both response and completion. Also when a response was received before a completion notification. The status field represents if the MAD was canceled at some point in the flow. The timeout is used to represent if a response was received. The current state transitions are not clearly visible, and developers needs to infer the state from the refcount's, timeout's or status's value, which is error-prone and difficult to follow. Thus, replace with a state machine as the following: - 'IB_MAD_STATE_INIT': MAD is in the making and is not yet in any list - 'IB_MAD_STATE_SEND_START': MAD was sent to the QP and is waiting for completion notification in send list - 'IB_MAD_STATE_WAIT_RESP': MAD send completed successfully, waiting for a response in wait list - 'IB_MAD_STATE_EARLY_RESP': Response came early, before send completion notification, MAD is in the send list - 'IB_MAD_STATE_CANCELED': MAD was canceled while in send or wait list - 'IB_MAD_STATE_DONE': MAD processing completed, MAD is in no list Adding the state machine also make it possible to remove the double call for ib_mad_complete_send_wr in case of an early response and the use of a done list in case of a regular response. While at it, define a helper to clear error MADs which will handle freeing MADs that timed out or have been cancelled. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/48e6ae8689dc7bb8b4ba6e5ec562e1b018db88a8.1751278420.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07	Merge branch 'mlx5-next' into wip/leon-for-next	Leon Romanovsky
	* mlx5-next: net/mlx5: Check device memory pointer before usage net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow net/mlx5: Add IFC bits for PCIe Congestion Event object net/mlx5: Small refactor for general object capabilities
2025-07-07	RDMA/bnxt_re: Use macro instead of hard coded value	Kalesh AP
	1. Defined a macro for the hard coded value. 2. "access" field in the request structure is of type "u8". Updated the mask accordingly. Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Shravya KN <shravya.k-n@broadcom.com> Link: https://patch.msgid.link/20250704043857.19158-4-kalesh-anakkur.purayil@broadcom.com Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07	RDMA/bnxt_re: Support 2G message size	Selvin Xavier
	bnxt_qplib_put_sges is calculating the length in a signed int. So handling the 2G message size is not working since it is considered as negative. Use a unsigned number to calculate the total message length. As per the spec, IB message size shall be between zero and 2^31 bytes (inclusive). So adding a check for 2G message size. Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Shravya KN <shravya.k-n@broadcom.com> Link: https://patch.msgid.link/20250704043857.19158-3-kalesh-anakkur.purayil@broadcom.com Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07	RDMA/bnxt_re: Fix size of uverbs_copy_to() in BNXT_RE_METHOD_GET_TOGGLE_MEM	Kalesh AP
	Since both "length" and "offset" are of type u32, there is no functional issue here. Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com> Signed-off-by: Shravya KN <shravya.k-n@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20250704043857.19158-2-kalesh-anakkur.purayil@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07	RDMA/hns: Fix -Wframe-larger-than issue	Junxian Huang
	Fix -Wframe-larger-than issue by allocating memory for qpc struct with kzalloc() instead of using stack memory. Fixes: 606bf89e98ef ("RDMA/hns: Refactor for hns_roce_v2_modify_qp function") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506240032.CSgIyFct-lkp@intel.com/ Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250703113905.3597124-7-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07	RDMA/hns: Drop GFP_NOWARN	Junxian Huang
	GFP_NOWARN silences all warnings on dma_alloc_coherent() failure, which might otherwise help with troubleshooting. Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250703113905.3597124-6-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07	RDMA/hns: Fix accessing uninitialized resources	Junxian Huang
	hr_dev->pgdir_list and hr_dev->pgdir_mutex won't be initialized if CQ/QP record db are not enabled, but they are also needed when using SRQ with SRQ record db enabled. Simplified the logic by always initailizing the reosurces. Fixes: c9813b0b9992 ("RDMA/hns: Support SRQ record doorbell") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250703113905.3597124-5-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07	RDMA/hns: Get message length of ack_req from FW	Junxian Huang
	ACK_REQ_FREQ indicates the number of packets (after MTU fragmentation) HW sends before setting an ACK request. When MTU is greater than or equal to 1024, the current ACK_REQ_FREQ value causes HW to request an ACK for every MTU fragment. The processing of a large number of ACKs severely impacts HW performance when sending large size payloads. Get message length of ack_req from FW so that we can adjust this parameter according to different situations. There are several constraints for ACK_REQ_FREQ: 1. mtu * (2 ^ ACK_REQ_FREQ) should not be too large, otherwise it may cause some unexpected retries when sending large payload. 2. ACK_REQ_FREQ should be larger than or equal to LP_PKTN_INI. 3. ACK_REQ_FREQ must be equal to LP_PKTN_INI when using LDCP or HC3 congestion control algorithm. Fixes: 56518a603fd2 ("RDMA/hns: Modify the value of long message loopback slice") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250703113905.3597124-4-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-06	RDMA/hns: Fix HW configurations not cleared in error flow	wenglianfa
	hns_roce_clear_extdb_list_info() will eventually do some HW configurations through FW, and they need to be cleared by calling hns_roce_function_clear() when the initialization fails. Fixes: 7e78dd816e45 ("RDMA/hns: Clear extended doorbell info before using") Signed-off-by: wenglianfa <wenglianfa@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250703113905.3597124-3-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-06	RDMA/hns: Fix double destruction of rsv_qp	wenglianfa
	rsv_qp may be double destroyed in error flow, first in free_mr_init(), and then in hns_roce_exit(). Fix it by moving the free_mr_init() call into hns_roce_v2_init(). list_del corruption, ffff589732eb9b50->next is LIST_POISON1 (dead000000000100) WARNING: CPU: 8 PID: 1047115 at lib/list_debug.c:53 __list_del_entry_valid+0x148/0x240 ... Call trace: __list_del_entry_valid+0x148/0x240 hns_roce_qp_remove+0x4c/0x3f0 [hns_roce_hw_v2] hns_roce_v2_destroy_qp_common+0x1dc/0x5f4 [hns_roce_hw_v2] hns_roce_v2_destroy_qp+0x22c/0x46c [hns_roce_hw_v2] free_mr_exit+0x6c/0x120 [hns_roce_hw_v2] hns_roce_v2_exit+0x170/0x200 [hns_roce_hw_v2] hns_roce_exit+0x118/0x350 [hns_roce_hw_v2] __hns_roce_hw_v2_init_instance+0x1c8/0x304 [hns_roce_hw_v2] hns_roce_hw_v2_reset_notify_init+0x170/0x21c [hns_roce_hw_v2] hns_roce_hw_v2_reset_notify+0x6c/0x190 [hns_roce_hw_v2] hclge_notify_roce_client+0x6c/0x160 [hclge] hclge_reset_rebuild+0x150/0x5c0 [hclge] hclge_reset+0x10c/0x140 [hclge] hclge_reset_subtask+0x80/0x104 [hclge] hclge_reset_service_task+0x168/0x3ac [hclge] hclge_service_task+0x50/0x100 [hclge] process_one_work+0x250/0x9a0 worker_thread+0x324/0x990 kthread+0x190/0x210 ret_from_fork+0x10/0x18 Fixes: fd8489294dd2 ("RDMA/hns: Fix Use-After-Free of rsv_qp on HIP08") Signed-off-by: wenglianfa <wenglianfa@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250703113905.3597124-2-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02	net/mlx5: Check device memory pointer before usage	Stav Aviram
	Add a NULL check before accessing device memory to prevent a crash if dev->dm allocation in mlx5_init_once() fails. Fixes: c9b9dcb430b3 ("net/mlx5: Move device memory management to mlx5_core") Signed-off-by: Stav Aviram <saviram@nvidia.com> Link: https://patch.msgid.link/c88711327f4d74d5cebc730dc629607e989ca187.1751370035.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02	net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow	Patrisious Haddad
	Failing during the initialization of root_namespace didn't cleanup the priorities of the namespace on which the failure occurred. Properly cleanup said priorities on failure. Fixes: 52931f55159e ("net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://patch.msgid.link/78cf89b5d8452caf1e979350b30ada6904362f66.1751451780.git.leon@kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02	Fix dma_unmap_sg() nents value	Thomas Fourier
	The dma_unmap_sg() functions should be called with the same nents as the dma_map_sg(), not the value the map function returned. Fixes: ed10435d3583 ("RDMA/erdma: Implement hierarchical MTT") Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Link: https://patch.msgid.link/20250630092346.81017-2-fourier.thomas@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02	RDMA/counter: Check CAP_NET_RAW check in user namespace for RDMA counters	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 1bd8e0a9d0fd ("RDMA/counter: Allow manual mode configuration support") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/68e2064e72e94558a576fdbbb987681a64f6fea8.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02	RDMA/nldev: Check CAP_NET_RAW in user namespace for QP modify	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to modify the QP. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 0cadb4db79e1 ("RDMA/uverbs: Restrict usage of privileged QKEYs") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/099eb263622ccdd27014db7e02fec824a3307829.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02	RDMA/mlx5: Check CAP_NET_RAW in user namespace for devx create	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the devx object. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: a8b92ca1b0e5 ("IB/mlx5: Introduce DEVX") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/36ee87e92defd81410c6a2b33f9d6c0d6dcfd64c.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02	RDMA/uverbs: Check CAP_NET_RAW in user namespace for RAW QP create	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the QP. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/3914ef9702b01de8843a391ce397fca67d0fc7af.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01	RDMA/uverbs: Check CAP_NET_RAW in user namespace for RAW QP create	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the QP. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 6d1e7ba241e9 ("IB/uverbs: Introduce create/destroy QP commands over ioctl") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/7b6b87505ccc28a1f7b4255af94d898d2df0fff5.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01	RDMA/uverbs: Check CAP_NET_RAW in user namespace for QP create	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the QP. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 2dee0e545894 ("IB/uverbs: Enable QP creation with a given source QP number") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/0e5920d1dfe836817bb07576b192da41b637130b.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01	RDMA/mlx5: Check CAP_NET_RAW in user namespace for anchor create	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the anchor. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 0c6ab0ca9a66 ("RDMA/mlx5: Expose steering anchor to userspace") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/c2376ca75e7658e2cbd1f619cf28fbe98c906419.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01	RDMA/mlx5: Check CAP_NET_RAW in user namespace for flow create	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the flow. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 322694412400 ("IB/mlx5: Introduce driver create and destroy flow methods") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/a4dcd5e3ac6904ef50b19e56942ca6ab0728794c.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01	RDMA/uverbs: Check CAP_NET_RAW in user namespace for flow create	Parav Pandit
	Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the flow resource. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 436f2ad05a0b ("IB/core: Export ib_create/destroy_flow through uverbs") Signed-off-by: Parav Pandit <parav@nvidia.com> Suggested-by: Eric W. Biederman <ebiederm@xmission.com> Link: https://patch.msgid.link/6df6f2f24627874c4f6d041c19dc1f6f29f68f84.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26	RDMA/ipoib: Use parent rdma device net namespace	Mark Bloch
	Use the net namespace of the underlying rdma device. After honoring the rdma device's namespace, the ipoib netdev now also runs in the same net namespace of the rdma device. Add an API to read the net namespace of the rdma device so that ULP such as IPoIB can use it to initialize its netdev. Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-06-26	RDMA/mlx5: Allocate IB device with net namespace supplied from core dev	Mark Bloch
	Use the new ib_alloc_device_with_net() API to allocate the IB device so that it is properly bound to the network namespace obtained via mlx5_core_net(). This change ensures correct namespace association (e.g., for containerized setups). Additionally, expose mlx5_core_net so that RDMA driver can use it. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-06-26	RDMA/core: Extend RDMA device registration to be net namespace aware	Mark Bloch
	Presently, RDMA devices are always registered within the init network namespace, even if the associated devlink device's namespace was changed via a devlink reload. This mismatch leads to discrepancies between the network namespace of the devlink device and that of the RDMA device. Therefore, extend the RDMA device allocation API to optionally take the net namespace. This isn't limited to devices that support devlink but allows all users to provide the network namespace if they need to do so. If a network namespace is provided during device allocation, it's up to the caller to make sure the namespace stays valid until ib_register_device() is called. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-06-26	RDMA/rxe: Fix a couple IS_ERR() vs NULL bugs	Dan Carpenter
	The lookup_mr() function returns NULL on error. It never returns error pointers. Fixes: 9284bc34c773 ("RDMA/rxe: Enable asynchronous prefetch for ODP MRs") Fixes: 3576b0df1588 ("RDMA/rxe: Implement synchronous prefetch for ODP MRs") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://patch.msgid.link/685c1430.050a0220.18b0ef.da83@mx.google.com Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26	RDMA/siw: work around clang stack size warning	Arnd Bergmann
	clang inlines a lot of functions into siw_qp_sq_process(), with the aggregate stack frame blowing the warning limit in some configurations: drivers/infiniband/sw/siw/siw_qp_tx.c:1014:5: error: stack frame size (1544) exceeds limit (1280) in 'siw_qp_sq_process' [-Werror,-Wframe-larger-than] The real problem here is the array of kvec structures in siw_tx_hdt that makes up the majority of the consumed stack space. Ideally there would be a way to avoid allocating the array on the stack, but that would require a larger rework. Add a noinline_for_stack annotation to avoid the warning for now, and make clang behave the same way as gcc here. The combined stack usage is still similar, but is spread over multiple functions now. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20250620114332.4072051-1-arnd@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Acked-by: Bernard Metzler <bmt@zurich.ibm.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26	RDMI: hfi1: drop cpumask_empty() call in hfi1/affinity.c	Yury Norov [NVIDIA]
	In few places, the driver tests a cpumask for emptiness immediately before calling functions that report emptiness themself. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Link: https://patch.msgid.link/20250604193947.11834-8-yury.norov@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26	RDMA: hfi1: simplify hfi1_get_proc_affinity()	Yury Norov [NVIDIA]
	The function protects the for loop with affinity->num_core_siblings > 0 condition, which is redundant because the loop will break immediately in that case. Drop it and save one indentation level. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Link: https://patch.msgid.link/20250604193947.11834-7-yury.norov@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26	RDMA: hfi1: use rounddown in find_hw_thread_mask()	Yury Norov [NVIDIA]
	num_cores_per_socket is calculated by dividing by node_affinity.num_online_nodes, but all users of this variable multiply it by node_affinity.num_online_nodes again. This effectively is the same as rounding it down by node_affinity.num_online_nodes. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Link: https://patch.msgid.link/20250604193947.11834-6-yury.norov@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26	RDMA: hfi1: simplify init_real_cpu_mask()	Yury Norov [NVIDIA]
	The function opencodes cpumask_nth() and cpumask_clear_cpus(). The dedicated helpers are easier to use and usually much faster than opencoded for-loops. While there, drop useless clear of real_cpu_mask. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Link: https://patch.msgid.link/20250604193947.11834-5-yury.norov@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26	RDMA: hfi1: simplify find_hw_thread_mask()	Yury Norov [NVIDIA]
	The function opencodes cpumask_nth() and cpumask_clear_cpus(). The dedicated helpers are easier to use and usually much faster than opencoded for-loops. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Link: https://patch.msgid.link/20250604193947.11834-4-yury.norov@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-26	RDMA: hfi1: fix possible divide-by-zero in find_hw_thread_mask()	Yury Norov [NVIDIA]
	The function divides number of online CPUs by num_core_siblings, and later checks the divider by zero. This implies a possibility to get and divide-by-zero runtime error. Fix it by moving the check prior to division. This also helps to save one indentation level. Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Link: https://patch.msgid.link/20250604193947.11834-3-yury.norov@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25	RDMA/core: reduce stack using in nldev_stat_get_doit()	Arnd Bergmann
	In the s390 defconfig, gcc-10 and earlier end up inlining three functions into nldev_stat_get_doit(), and each of them uses some 600 bytes of stack. The result is a function with an overly large stack frame and a warning: drivers/infiniband/core/nldev.c:2466:1: error: the frame size of 1720 bytes is larger than 1280 bytes [-Werror=frame-larger-than=] Mark the three functions noinline_for_stack to prevent this, ensuring that only one copy of the nlattr array is on the stack of each function. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://patch.msgid.link/20250620113335.3776965-1-arnd@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25	RDMA/mlx5: Add multiple priorities support to RDMA TRANSPORT userspace tables	Patrisious Haddad
	Support the creation of RDMA TRANSPORT tables over multiple priorities via matcher creation. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/bb38e50ae4504e979c6568d41939402a4cf15635.1750148083.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25	Add multiple priorities support to mlx5 RDMA TRANSPORT tables	Leon Romanovsky
	From Patrisious: This short series from Patrisious extends mlx5 flow steering logic to allow creation rule creation with priorities in RDMA TRANSPORT tables. Thanks Link: https://lore.kernel.org/all/cover.1750148083.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> * mlx5-next: (200 commits) net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain Linux 6.16-rc2 ...
2025-06-25	net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain	Patrisious Haddad
	RDMA TRANSPORT domains were initially limited to a single priority. This change allows the domains to have multiple priorities, making it possible to add several rules and control the order in which they're evaluated. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/b299cbb4c8678a33da6e6b6988b5bf6145c54b88.1750148083.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25	RDMA/mlx5: Support driver APIs pre_destroy_cq and post_destroy_cq	Mark Zhang
	- pre_destroy_cq: Destroy FW CQ object so that no new CQ event would be generated; - post_destroy_cq: Release all resources. This patch, along with last one, fixes the crash below. Unable to handle kernel paging request at virtual address ffff8000114b1180 Mem abort info: ESR = 0x96000047 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000047 CM = 0, WnR = 1 swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000f4582000 [ffff8000114b1180] pgd=00000447fffff003, p4d=00000447fffff003, pud=00000447ffffe003, pmd=00000447ffffb003, pte=0000000000000000 Internal error: Oops: 96000047 [#1] SMP Modules linked in: udp_diag uio_pci_generic uio tcp_diag inet_diag binfmt_misc sn_core_odd(OE) rpcrdma(OE) xprtrdma(OE) ib_isert(OE) ib_iser(OE) ib_srpt(OE) ib_srp(OE) ib_ipoib(OE) kpatch_9658536(OK) kpatch_9322385(OK) kpatch_8843421(OK) kpatch_8636216(OK) vfat fat aes_ce_blk crypto_simd cryptd aes_ce_cipher crct10dif_ce ghash_ce sm4_ce sha2_ce sha256_arm64 sha1_ce sbsa_gwdt sg acpi_ipmi ipmi_si ipmi_msghandler m1_uncore_ddrss_pmu m1_uncore_cmn_pmu team_yosemite9rc6(OE) vnic(OE) ip_tables mlx5_ib(OE) sd_mod ast mlx5_core(OE) i2c_algo_bit drm_vram_helper psample drm_kms_helper mlxdevm(OE) auxiliary(OE) mlxfw(OE) syscopyarea sysfillrect tls sysimgblt fb_sys_fops drm_ttm_helper nvme ttm nvme_core drm t10_pi i2c_designware_platform i2c_designware_core i2c_core ahci libahci libata rdma_ucm(OE) ib_uverbs(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_umad(OE) ib_core(OE) ib_ucm(OE) mlx_compat(OE) [last unloaded: ipmi_devintf] CPU: 83 PID: 59375 Comm: kworker/u253:1 Kdump: loaded Tainted: G OE K 5.10.84-004.ali5000.alios7.aarch64 #1 Hardware name: Inspur AliServer-Xuanwu2.0AM-02-2UM1P-5B/AS1221MG1, BIOS 1.2.M1.AL.P.158.00 08/31/2023 Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core] pstate: 82c00089 (Nzcv daIf +PAN +UAO +TCO BTYPE=--) pc : native_queued_spin_lock_slowpath+0x1c4/0x31c lr : mlx5_ib_poll_cq+0x18c/0x2f8 [mlx5_ib] sp : ffff80002be1bc80 x29: ffff80002be1bc80 x28: ffff000810e69000 x27: ffff000810e69000 x26: ffff000810e69200 x25: 0000000000000000 x24: ffff8000117db000 x23: ffff04000156b780 x22: 0000000000000000 x21: ffff04000ce6c160 x20: ffff0008196a4000 x19: 0000000000000010 x18: 0000000000000020 x17: 0000000000000000 x16: 0000000000000000 x15: ffff040055a364e8 x14: ffffffffffffffff x13: ffff80002318bda8 x12: ffff0400358836e8 x11: 0000000000000040 x10: 0000000000000eb0 x9 : 0000000000000000 x8 : 0000000000000000 x7 : ffff04477fa20140 x6 : ffff8000114b1140 x5 : ffff04477fa20140 x4 : ffff8000114b1180 x3 : ffff000810e69200 x2 : ffff8000114b1180 x1 : 0000000001500000 x0 : ffff04477fa20148 Call trace: native_queued_spin_lock_slowpath+0x1c4/0x31c __ib_process_cq+0x74/0x1b8 [ib_core] ib_cq_poll_work+0x34/0xa0 [ib_core] process_one_work+0x1d8/0x4b0 worker_thread+0x180/0x440 kthread+0x114/0x120 Code: 910020e0 8b0400c4 f862d929 aa0403e2 (f8296847) ---[ end trace 387be2290557729c ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x9850817,7a60aa38 Memory Limit: none Starting crashdump kernel... Bye! Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://patch.msgid.link/aaf0072f350d1c7e8731f43b79e11a560bafb9e0.1750070205.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-25	RDMA/core: Add driver APIs pre_destroy_cq() and post_destroy_cq()	Mark Zhang
	Currently in ib_free_cq, it disables IRQ or cancel the CQ work before driver destroy_cq. This isn't good as a new IRQ or a CQ work can be submitted immediately after disabling IRQ or canceling CQ work, which may run concurrently with destroy_cq and cause crashes. The right flow should be: 1. Driver disables CQ to make sure no new CQ event will be submitted; 2. Disables IRQ or Cancels CQ work in core layer, to make sure no CQ polling work is running; 3. Free all resources to destroy the CQ. This patch adds 2 driver APIs: - pre_destroy_cq(): Disable a CQ to prevent it from generating any new work completions, but not free any kernel resources; - post_destroy_cq(): Free all kernel resources. In ib_free_cq, the IRQ is disabled or CQ work is canceled after pre_destroy_cq, and before post_destroy_cq. Fixes: 14d3a3b2498e ("IB: add a proper completion queue abstraction") Signed-off-by: Mark Zhang <markzhang@nvidia.com> Link: https://patch.msgid.link/b5f7ae3d75f44a3e15ff3f4eb2bbdea13e06b97f.1750062328.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-17	RDMA/qib: Remove outdated driver	Dennis Dalessandro
	The qib Infiniband device has long been decommissioned from distros and unsupported by the company. We have kept it going in a compile only mode for a while now and it's time to remove the driver. It has a number of issues that are never going to be addressed and is a hindrance to improving hfi1 and rdmavt. Link: https://patch.msgid.link/r/174836076755.2436819.5981097575800950899.stgit@awdrv-04.cornelisnetworks.com Signed-off-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2025-06-14	Merge tag 'iommu-fixes-v6.16-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux Pull iommu fix from Joerg Roedel: - Fix PTE size calculation for NVidia Tegra * tag 'iommu-fixes-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: iommu/tegra: Fix incorrect size calculation
2025-06-14	Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linux	Linus Torvalds
	Pull block fixes from Jens Axboe: - Fix for a deadlock on queue freeze with zoned writes - Fix for zoned append emulation - Two bio folio fixes, for sparsemem and for very large folios - Fix for a performance regression introduced in 6.13 when plug insertion was changed - Fix for NVMe passthrough handling for polled IO - Document the ublk auto registration feature - loop lockdep warning fix * tag 'block-6.16-20250614' of git://git.kernel.dk/linux: nvme: always punt polled uring_cmd end_io work to task_work Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists block: Fix bvec_set_folio() for very large folios bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP block: use plug request list tail for one-shot backmerge attempt block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG) loop: move lo_set_size() out of queue freeze
2025-06-14	Merge tag 'mm-hotfixes-stable-2025-06-13-21-56' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "9 hotfixes. 3 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. Only 4 are for MM" * tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: add mmap_prepare() compatibility layer for nested file systems init: fix build warnings about export.h MAINTAINERS: add Barry as a THP reviewer drivers/rapidio/rio_cm.c: prevent possible heap overwrite mm: close theoretical race where stale TLB entries could linger mm/vma: reset VMA iterator on commit_merge() OOM failure docs: proc: update VmFlags documentation in smaps scatterlist: fix extraneous '@'-sign kernel-doc notation selftests/mm: skip failed memfd setups in gup_longterm
2025-06-13	Merge tag 'scsi-fixes' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "All fixes for drivers. The core change in the error handler is simply to translate an ALUA specific sense code into a retry the ALUA components can handle and won't impact any other devices" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: error: alua: I/O errors for ALUA state transitions scsi: storvsc: Increase the timeouts to storvsc_timeout scsi: s390: zfcp: Ensure synchronous unit_add scsi: iscsi: Fix incorrect error path labels for flashnode operations scsi: mvsas: Fix typos in per-phy comments and SAS cmd port registers scsi: core: ufs: Fix a hang in the error handler