summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-07-23IB: Extend UVERBS_METHOD_REG_MR to get DMAHYishai Hadas
Extend UVERBS_METHOD_REG_MR to get DMAH and pass it to all drivers. It will be used in mlx5 driver as part of the next patch from the series. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/2ae1e628c0675db81f092cc00d3ad6fbf6139405.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23RDMA/mlx5: Add DMAH object supportYishai Hadas
This patch introduces support for allocating and deallocating the DMAH object. Further details: ---------------- The DMAH API is exposed to upper layers only if the underlying device supports TPH. It uses the mlx5_core steering tag (ST) APIs to get a steering tag index based on the provided input. The obtained index is stored in the device-specific mlx5_dmah structure for future use. Upcoming patches in the series will integrate the allocated DMAH into the memory region (MR) registration process. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/778550776799d82edb4d05da249a1cff00160b50.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23RDMA/core: Introduce a DMAH object and its alloc/free APIsYishai Hadas
Introduce a new DMA handle (DMAH) object along with its corresponding allocation and deallocation APIs. This DMAH object encapsulates attributes intended for use in DMA transactions. While its initial purpose is to support TPH functionality, it is designed to be extensible for future features such as DMA PCI multipath, PCI UIO configurations, PCI traffic class selection, and more. Further details: ---------------- We ensure that a caller requesting a DMA handle for a specific CPU ID is permitted to be scheduled on it. This prevent a potential security issue where a non privilege user may trigger DMA operations toward a CPU that it's not allowed to run on. We manage reference counting for the DMAH object and its consumers (e.g., memory regions) as will be detailed in subsequent patches in the series. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/2cad097e849597e49d6b61e6865dba878257f371.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23IB/core: Add UVERBS_METHOD_REG_MR on the MR objectYishai Hadas
This new method enables us to use a single ioctl from user space which supports the below variants of reg_mr [1]. The method will be extended in the next patches from the series with an extra attribute to let us pass DMA handle to be used as part of the registration. [1] ibv_reg_mr(), ibv_reg_mr_iova(), ibv_reg_mr_iova2(), ibv_reg_dmabuf_mr(). Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Link: https://patch.msgid.link/5a3822ceef084efe967c9752e89c58d8250337c7.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23RDMA support for DMA handleLeon Romanovsky
From Yishai: This patch series introduces a new DMA Handle (DMAH) object, along with corresponding APIs for its allocation and deallocation. The DMAH object encapsulates attributes relevant for DMA transactions. While initially intended to support TLP Processing Hints (TPH) [1], the design is extensible to accommodate future features such as PCI multipath for DMA, PCI UIO configurations, traffic class selection, and more. Additionally, we introduce a new ioctl method on the MR object: UVERBS_METHOD_REG_MR. This method consolidates multiple reg_mr variants under a single user-space ioctl interface, supporting: ibv_reg_mr(), ibv_reg_mr_iova(), ibv_reg_mr_iova2() and ibv_reg_dmabuf_mr(). It also enables passing a DMA handle as part of the registration process. Throughout the patch series, the following DMAH-related stuff can also be observed in the IB layer: - Association with a CPU ID and its memory type, for use with Steering Tags [2]. - Inclusion of Processing Hints (PH) data for TPH functionality [3]. - Enforces security by ensuring that only tasks allowed to run on a given CPU may request a DMA handle for it. - Reference counting for DMAH life cycle management and safe usage across memory regions. mlx5 driver implementation: -------------------------- The series includes implementation of the above functionality in the mlx5 driver. In mlx5_core: - Enables TPH over PCIe when both firmware and OS support it. - Manages Steering Tags and corresponding indices by writing tag values to the PCI configuration space. - Exposes APIs to upper layers (e.g., mlx5_ib) to enable the PCIe TPH functionality. In mlx5_ib: - Adds full support for DMAH operations. - Utilizes mlx5_core's Steering Tag APIs to derive tag indices from input. - Stores the resulting index in a mlx5_dmah structure for use during MKEY creation with a DMA handle. - Adds support for allowing MKEYs to be created in conjunction with DMA handles. Additional details are provided in the commit messages. [1] Background, from PCIe specification 6.2. TLP Processing Hints (TPH) -------------------------- TLP Processing Hints is an optional feature that provides hints in Request TLP headers to facilitate optimized processing of Requests that target Memory Space. These Processing Hints enable the system hardware (e.g., the Root Complex and/ or Endpoints) to optimize platform resources such as system and memory interconnect on a per TLP basis. Steering Tags are system-specific values used to identify a processing resource that a Requester explicitly targets. System software discovers and identifies TPH capabilities to determine the Steering Tag allocation for each Function that supports TPH [2] Steering Tags Functions that intend to target a TLP towards a specific processing resource such as a host processor or system cache hierarchy require topological information of the target cache (e.g., which host cache). Steering Tags are system-specific values that provide information about the host or cache structure in the system cache hierarchy. These values are used to associate processing elements within the platform with the processing of Requests. [3] Processing Hints The Requester provides hints to the Root Complex or other targets about the intended use of data and data structures by the host and/or device. The hints are provided by the Requester, which has knowledge of upcoming Request patterns, and which the Completer would not be able to deduce autonomously (with good accuracy) Yishai Signed-off-by: Leon Romanovsky <leon@kernel.org> * mlx5-next: net/mlx5: Add support for device steering tag net/mlx5: Expose IFC bits for TPH PCI/TPH: Expose pcie_tph_get_st_table_size() net/mlx5: Expose cable_length field in PFCC register net/mlx5: Add IFC bits and enums for buf_ownership net/mlx5: Add IFC bits to support RSS for IPSec offload net/mlx5: IFC updates for disabled host PF net/mlx5: Expose disciplined_fr_counter through HCA capabilities in mlx5_ifc
2025-07-23net/mlx5: Add support for device steering tagYishai Hadas
Background, from PCIe specification 6.2. TLP Processing Hints (TPH) -------------------------- TLP Processing Hints is an optional feature that provides hints in Request TLP headers to facilitate optimized processing of Requests that target Memory Space. These Processing Hints enable the system hardware (e.g., the Root Complex and/or Endpoints) to optimize platform resources such as system and memory interconnect on a per TLP basis. Steering Tags are system-specific values used to identify a processing resource that a Requester explicitly targets. System software discovers and identifies TPH capabilities to determine the Steering Tag allocation for each Function that supports TPH. This patch adds steering tag support for mlx5 based NICs by: - Enabling the TPH functionality over PCI if both FW and OS support it. - Managing steering tags and their matching steering indexes by writing a ST to an ST index over the PCI configuration space. - Exposing APIs to upper layers (e.g.,mlx5_ib) to allow usage of the PCI TPH infrastructure. Further details: - Upon probing of a device, the feature will be enabled based on both capability detection and OS support. - It will retrieve the appropriate ST for a given CPU ID and memory type using the pcie_tph_get_cpu_st() API. - It will track available ST indices according to the configuration space table size (expected to be 63 entries), reserving index 0 to indicate non-TPH use. - It will assign a free ST index with a ST using the pcie_tph_set_st_entry() API. - It will reuse the same index for identical (CPU ID + memory type) combinations by maintaining a reference count per entry. - It will expose APIs to upper layers (e.g., mlx5_ib) to allow usage of the PCI TPH infrastructure. - SF will use its parent PF stuff. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://patch.msgid.link/de1ae7398e9e34eacd8c10845683df44fc9e32f8.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23net/mlx5: Expose IFC bits for TPHYishai Hadas
Expose IFC bits for the TPH functionality. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Edward Srouji <edwards@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Link: https://patch.msgid.link/38ea3a0d56551364214e8edf359c9c77c9a3b71b.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-23PCI/TPH: Expose pcie_tph_get_st_table_size()Yishai Hadas
Expose pcie_tph_get_st_table_size() to be used by drivers as will be done in the next patch from the series. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/9ae851e0ee42cc56d2a30276e116b65091030ceb.1752752567.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-21RDMA/mlx5: Fix incorrect MKEY maskingLeon Romanovsky
mkey_mask is __be64 type, while MLX5_MKEY_MASK_PAGE_SIZE is declared as unsigned long long. This causes to the static checkers errors: drivers/infiniband/hw/mlx5/umr.c:663:49: warning: invalid assignment: &= drivers/infiniband/hw/mlx5/umr.c:663:49: left side has type restricted __be64 drivers/infiniband/hw/mlx5/umr.c:663:49: right side has type int Fixes: e73242aa14d2 ("RDMA/mlx5: Optimize DMABUF mkey page size") Link: https://patch.msgid.link/e354d70b98dfa5ecf4c236a36cd36b64add9d9de.1753003467.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-07-21RDMA/mlx5: Fix returned type from _mlx5r_umr_zap_mkey()Leon Romanovsky
As Colin reported: "The variable zapped_blocks is a size_t type and is being assigned a int return value from the call to _mlx5r_umr_zap_mkey. Since zapped_blocks is an unsigned type, the error check for zapped_blocks < 0 will never be true." So separate return error and nblocks assignment. Fixes: e73242aa14d2 ("RDMA/mlx5: Optimize DMABUF mkey page size") Reported-by: Colin King (gmail) <colin.i.king@gmail.com> Closes: https://lore.kernel.org/all/79166fb1-3b73-4d37-af02-a17b22eb8e64@gmail.com Link: https://patch.msgid.link/71d8ea208ac7eaa4438af683b9afaed78625e419.1753003467.git.leon@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-07-20net/mlx5: Expose cable_length field in PFCC registerOren Sidi
Introduce new "cable_length" field in PFCC register and related fields to enhance rx buffer configuration management: 1. cable_length: Shifts cable length handling to fw by storing a manually entered length from user in PFCC.cable_length 2. lane_rate_oper: In a case where PFCC.cable_length is not supported, helps compute a default cable length Signed-off-by: Oren Sidi <osidi@nvidia.com> Reviewed-by: Alex Lazar <alazar@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1752734895-257735-4-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-20net/mlx5: Add IFC bits and enums for buf_ownershipOren Sidi
Extend structure layouts and defines buf_ownership. buf_ownership indicates whether the buffer is managed by SW or FW. Signed-off-by: Oren Sidi <osidi@nvidia.com> Reviewed-by: Alex Lazar <alazar@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1752734895-257735-3-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-20net/mlx5: Add IFC bits to support RSS for IPSec offloadJianbo Liu
This adds the capabilities, ipsec_next_header and inner/outer l4_type_ext fields to support RSS for the decrypted packets. These fields are specifically for firmware steering. HWS validation logic is updated to correctly handle the changes, ensuring the unsupported fields are not set. Besides, reserved_at_c4 is fixed to reserved_at_d4 to reflect the accurate offset within the structure. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1752734895-257735-2-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-17RDMA/mlx5: remove redundant check on err on return expressionColin Ian King
Currently all paths that set err and then check it for an error perform immediate returns, hence err always zero at the end of the function _mlx5r_umr_zap_mkey. The return expression err ? err : nblocks has a redundant check on the err since err is always zero, so just return nblocks instead. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://patch.msgid.link/20250717112108.4036171-1-colin.i.king@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13RDMA/mana_ib: add additional port countersZhiyue Qiu
Add packet and request port counters to mana_ib. Signed-off-by: Zhiyue Qiu <zhiyueqiu@microsoft.com> Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> Link: https://patch.msgid.link/1752143395-5324-1-git-send-email-kotaranov@linux.microsoft.com Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13RDMA/mana_ib: Fix DSCP value in modify QPShiraz Saleem
Convert the traffic_class in GRH to a DSCP value as required by the HW. Fixes: e095405b45bb ("RDMA/mana_ib: Modify QP state") Signed-off-by: Shiraz Saleem <shirazsaleem@microsoft.com> Signed-off-by: Konstantin Taranov <kotaranov@microsoft.com> Link: https://patch.msgid.link/1752143085-4169-1-git-send-email-kotaranov@linux.microsoft.com Reviewed-by: Long Li <longli@microsoft.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13RDMA/efa: Add CQ with external memory supportMichael Margolin
Add an option to create CQ using external memory instead of allocating in the driver. The memory can be passed from userspace by dmabuf fd and an offset or a VA. One of the possible usages is creating CQs that reside in accelerator memory, allowing low latency asynchronous direct polling from the accelerator device. Add a capability bit to reflect on the feature support. Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com> Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com> Link: https://patch.msgid.link/20250708202308.24783-4-mrgolin@amazon.com Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13RDMA/core: Add umem "is_contiguous" and "start_dma_addr" helpersMichael Margolin
In some cases drivers may need to check if a given umem is contiguous. Add a helper function in core code so that drivers don't need to deal with umem or scatter-gather lists structure. Additionally add a helper for getting umem's start DMA address and use it in other helper functions that open code it. Signed-off-by: Michael Margolin <mrgolin@amazon.com> Link: https://patch.msgid.link/20250708202308.24783-3-mrgolin@amazon.com Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13RDMA/uverbs: Add a common way to create CQ with umemMichael Margolin
Add ioctl command attributes and a common handling for the option to create CQs with memory buffers passed from userspace. When required attributes are supplied, create umem and provide it for driver's use. The extension enables creation of CQs on top of preallocated CPU virtual or device memory buffers, by supplying VA or dmabuf fd, in a common way. Drivers can support this flow by initializing a new create_cq_umem fp field in their ops struct, with a function that can handle the new parameter. Signed-off-by: Michael Margolin <mrgolin@amazon.com> Link: https://patch.msgid.link/20250708202308.24783-2-mrgolin@amazon.com Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13net/mlx5: IFC updates for disabled host PFDaniel Jurgens
The port 2 host PF can be disabled, this bit reflects that setting. Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: William Tu <witu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1752064867-16874-3-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13net/mlx5: Expose disciplined_fr_counter through HCA capabilities in mlx5_ifcCarolina Jubran
Introduce the `disciplined_fr_counter` capability bit to indicate that the device’s free-running cycle counter is disciplined to real-time. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1752064867-16874-2-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13RDMA/mlx5: Optimize DMABUF mkey page sizeEdward Srouji
The current implementation of DMABUF memory registration uses a fixed page size for the memory key (mkey), which can lead to suboptimal performance when the underlying memory layout may offer better page size. The optimization improves performance by reducing the number of page table entries required for the mkey, leading to less MTT/KSM descriptors that the HCA must go through to find translations, fewer cache-lines, and shorter UMR work requests on mkey updates such as when re-registering or reusing a cacheable mkey. To ensure safe page size updates, the implementation uses a 5-step process: 1. Make the first X entries non-present, while X is calculated to be minimal according to a large page shift that can be used to cover the MR length. 2. Update the page size to the large supported page size 3. Load the remaining N-X entries according to the (optimized) page shift 4. Update the page size according to the (optimized) page shift 5. Load the first X entries with the correct translations This ensures that at no point is the MR accessible with a partially updated translation table, maintaining correctness and preventing access to stale or inconsistent mappings, such as having an mkey advertising the new page size while some of the underlying page table entries still contain the old page size translations. Signed-off-by: Edward Srouji <edwards@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/bc05a6b2142c02f96a90635f9a4458ee4bbbf39f.1751979184.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13RDMA/mlx5: Align mkc page size capability check to PRMMichael Guralnik
Align the capabilities checked when using the log_page_size 6th bit in the mkey context to the PRM definition. The upper and lower bounds are set by max/min caps, and modification of the 6th bit by UMR is allowed only when a specific UMR cap is set. Current implementation falsely assumes all page sizes up-to 2^63 are supported when the UMR cap is set. In case the upper bound cap is lower than 63, this might result a FW syndrome on mkey creation, e.g: mlx5_core 0000:c1:00.0: mlx5_cmd_out_err:832:(pid 0): CREATE_MKEY(0×200) op_mod(0×0) failed, status bad parameter(0×3), syndrome (0×38a711), err(-22) Previous cap enforcement is still correct for all current HW, FW and driver combinations. However, this patch aligns the code to be PRM compliant in the general case. Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/eab4eeb4785105a4bb5eb362dc0b3662cd840412.1751979184.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13Optimize DMABUF mkey page size in mlx5Leon Romanovsky
From Edward: This patch series enables the mlx5 driver to dynamically choose the optimal page size for a DMABUF-based memory key (mkey), rather than always registering with a fixed page size. Previously, DMABUF memory registration used a fixed 4K page size for mkeys which could lead to suboptimal performance when the underlying memory layout may offer better page sizes. This approach did not take advantage of larger page size capabilities advertised by the HCA, and the driver was not setting the proper page size mask in the mkey mask when performing page size changes, potentially leading to invalid registrations when updating to a very large pages. This series improves DMABUF performance by dynamically selecting the best page size for a given memory region (MR) both at creation time and on page fault occurrences, based on the underlying layout and fixing related gaps and bugs. By doing so, we reduce the number of page table entries (and thus MTT/ KSM descriptors) that the HCA must traverse, which in turn reduces cache-line fetches. Thanks * mlx5-next: RDMA/mlx5: Fix UMR modifying of mkey page size net/mlx5: Expose HCA capability bits for mkey max page size Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13RDMA/mlx5: Fix UMR modifying of mkey page sizeEdward Srouji
When changing the page size on an mkey, the driver needs to set the appropriate bits in the mkey mask to indicate which fields are being modified. The 6th bit of a page size in mlx5 driver is considered an extension, and this bit has a dedicated capability and mask bits. Previously, the driver was not setting this mask in the mkey mask when performing page size changes, regardless of its hardware support, potentially leading to an incorrect page size updates. This fixes the issue by setting the relevant bit in the mkey mask when performing page size changes on an mkey and the 6th bit of this field is supported by the hardware. Fixes: cef7dde8836a ("net/mlx5: Expand mkey page size to support 6 bits") Signed-off-by: Edward Srouji <edwards@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/9f43a9c73bf2db6085a99dc836f7137e76579f09.1751979184.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-13net/mlx5: Expose HCA capability bits for mkey max page sizeMichael Guralnik
Expose the HCA capability for maximal page size that can be configured for an mkey. Used for enforcing capabilities when working with highly contiguous memory and using large page sizes. Signed-off-by: Michael Guralnik <michaelgur@nvidia.com> Link: https://patch.msgid.link/3e4d3fda37934430f65f72601519e22bf396fd05.1751979184.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-09RDMA/uverbs: Add empty rdma_uattrs_has_raw_cap() declarationLeon Romanovsky
The call to rdma_uattrs_has_raw_cap() is placed in mlx5 fs.c file, which is compiled without relation to CONFIG_INFINIBAND_USER_ACCESS. Despite the check is used only in flows with CONFIG_INFINIBAND_USER_ACCESS=y|m, the compilers generate the following error for CONFIG_INFINIBAND_USER_ACCESS=n builds. >> ERROR: modpost: "rdma_uattrs_has_raw_cap" [drivers/infiniband/hw/mlx5/mlx5_ib.ko] undefined! Fixes: f458ccd2aa2c ("RDMA/uverbs: Check CAP_NET_RAW in user namespace for flow create") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507080725.bh7xrhpg-lkp@intel.com/ Link: https://patch.msgid.link/72dee6b379bd709255a5d8e8010b576d50e47170.1751967071.git.leon@kernel.org Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-07-09RDMA/efa: Add Network HW statistics countersBasel Nassar
Update device API and request network counters. Expose newly added counters through ib core counters mechanism. Reviewed-by: David Shoolman <shoolman@amazon.com> Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Basel Nassar <baselna@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com> Link: https://patch.msgid.link/20250706070740.22534-1-mrgolin@amazon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-09IB/cm: Use separate agent w/o flow control for REPVlad Dumitrescu
Most responses (e.g., RTU) are not subject to flow control, as there is no further response expected. However, REPs are both requests (waiting for RTUs) and responses (being waited by REQs). With agent-level flow control added to the MAD layer, REPs can get delayed by outstanding REQs. This can cause a problem in a scenario such as 2 hosts connecting to each other at the same time. Both hosts fill the flow control outstanding slots with REQs. The corresponding REPs are now blocked behind those REQs, and neither side can make progress until REQs time out. Add a separate MAD agent which is only used to send REPs. This agent does not have a recv_handler as it doesn't process responses nor does it register to receive requests. Disable flow control for agents w/o a recv_handler, as they aren't waiting for responses. This allows the newly added REP agent to send even when clients are slow to generate RTU, which would be needed to unblock flow control outstanding slots. Relax check in ib_post_send_mad to allow retries for this agent. REPs will be retried by the MAD layer until CM layer receives a response (e.g., RTU) on the normal agent and cancels them. Suggested-by: Sean Hefty <shefty@nvidia.com> Reviewed-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Sean Hefty <shefty@nvidia.com> Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Link: https://patch.msgid.link/9ac12d0842b849e2c8537d6e291ee0af9f79855c.1751278420.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-09IB/mad: Add flow control for solicited MADsOr Har-Toov
Currently, MADs sent via an agent are being forwarded directly to the corresponding MAD QP layer. MADs with a timeout value set and requiring a response (solicited MADs) will be resent if the timeout expires without receiving a response. In a congested subnet, flooding MAD QP layer with more solicited send requests from the agent will only worsen the situation by triggering more timeouts and therefore more retries. Thus, add flow control for non-user solicited MADs to block agents from issuing new solicited MAD requests to the MAD QP until outstanding requests are completed and the MAD QP is ready to process additional requests. While at it, keep track of the total outstanding solicited MAD work requests in send or wait list. The number of outstanding send WRs will be limited by a fraction of the RQ size, and any new send WR that exceeds that limit will be held in a backlog list. Backlog MADs will be forwarded to agent send list only once the total number of outstanding send WRs falls below the limit. Unsolicited MADs, RMPP MADs and MADs which are not SA, SMP or CM are not subject to this flow control mechanism and will not be affected by this change. For this purpose, a new state is introduced: - 'IB_MAD_STATE_QUEUED': MAD is in backlog list Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/c0ecaa1821badee124cd13f3bf860f67ce453beb.1751278420.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-09IB/mad: Add state machine to MAD layerOr Har-Toov
Replace the use of refcount, timeout and status with a 'state' field to track the status of MADs send work requests (WRs). The state machine better represents the stages in the MAD lifecycle, specifically indicating whether the MAD is waiting for a completion, waiting for a response, was canceld or is done. The existing refcount only takes two values: 1 : MAD is waiting either for completion or for response. 2 : MAD is waiting for both response and completion. Also when a response was received before a completion notification. The status field represents if the MAD was canceled at some point in the flow. The timeout is used to represent if a response was received. The current state transitions are not clearly visible, and developers needs to infer the state from the refcount's, timeout's or status's value, which is error-prone and difficult to follow. Thus, replace with a state machine as the following: - 'IB_MAD_STATE_INIT': MAD is in the making and is not yet in any list - 'IB_MAD_STATE_SEND_START': MAD was sent to the QP and is waiting for completion notification in send list - 'IB_MAD_STATE_WAIT_RESP': MAD send completed successfully, waiting for a response in wait list - 'IB_MAD_STATE_EARLY_RESP': Response came early, before send completion notification, MAD is in the send list - 'IB_MAD_STATE_CANCELED': MAD was canceled while in send or wait list - 'IB_MAD_STATE_DONE': MAD processing completed, MAD is in no list Adding the state machine also make it possible to remove the double call for ib_mad_complete_send_wr in case of an early response and the use of a done list in case of a regular response. While at it, define a helper to clear error MADs which will handle freeing MADs that timed out or have been cancelled. Signed-off-by: Or Har-Toov <ohartoov@nvidia.com> Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Link: https://patch.msgid.link/48e6ae8689dc7bb8b4ba6e5ec562e1b018db88a8.1751278420.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07Merge branch 'mlx5-next' into wip/leon-for-nextLeon Romanovsky
* mlx5-next: net/mlx5: Check device memory pointer before usage net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow net/mlx5: Add IFC bits for PCIe Congestion Event object net/mlx5: Small refactor for general object capabilities
2025-07-07RDMA/bnxt_re: Use macro instead of hard coded valueKalesh AP
1. Defined a macro for the hard coded value. 2. "access" field in the request structure is of type "u8". Updated the mask accordingly. Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Shravya KN <shravya.k-n@broadcom.com> Link: https://patch.msgid.link/20250704043857.19158-4-kalesh-anakkur.purayil@broadcom.com Reviewed-by: Hongguang Gao <hongguang.gao@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07RDMA/bnxt_re: Support 2G message sizeSelvin Xavier
bnxt_qplib_put_sges is calculating the length in a signed int. So handling the 2G message size is not working since it is considered as negative. Use a unsigned number to calculate the total message length. As per the spec, IB message size shall be between zero and 2^31 bytes (inclusive). So adding a check for 2G message size. Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Shravya KN <shravya.k-n@broadcom.com> Link: https://patch.msgid.link/20250704043857.19158-3-kalesh-anakkur.purayil@broadcom.com Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07RDMA/bnxt_re: Fix size of uverbs_copy_to() in BNXT_RE_METHOD_GET_TOGGLE_MEMKalesh AP
Since both "length" and "offset" are of type u32, there is no functional issue here. Reviewed-by: Saravanan Vajravel <saravanan.vajravel@broadcom.com> Signed-off-by: Shravya KN <shravya.k-n@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Link: https://patch.msgid.link/20250704043857.19158-2-kalesh-anakkur.purayil@broadcom.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07RDMA/hns: Fix -Wframe-larger-than issueJunxian Huang
Fix -Wframe-larger-than issue by allocating memory for qpc struct with kzalloc() instead of using stack memory. Fixes: 606bf89e98ef ("RDMA/hns: Refactor for hns_roce_v2_modify_qp function") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506240032.CSgIyFct-lkp@intel.com/ Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250703113905.3597124-7-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07RDMA/hns: Drop GFP_NOWARNJunxian Huang
GFP_NOWARN silences all warnings on dma_alloc_coherent() failure, which might otherwise help with troubleshooting. Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250703113905.3597124-6-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07RDMA/hns: Fix accessing uninitialized resourcesJunxian Huang
hr_dev->pgdir_list and hr_dev->pgdir_mutex won't be initialized if CQ/QP record db are not enabled, but they are also needed when using SRQ with SRQ record db enabled. Simplified the logic by always initailizing the reosurces. Fixes: c9813b0b9992 ("RDMA/hns: Support SRQ record doorbell") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250703113905.3597124-5-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-07RDMA/hns: Get message length of ack_req from FWJunxian Huang
ACK_REQ_FREQ indicates the number of packets (after MTU fragmentation) HW sends before setting an ACK request. When MTU is greater than or equal to 1024, the current ACK_REQ_FREQ value causes HW to request an ACK for every MTU fragment. The processing of a large number of ACKs severely impacts HW performance when sending large size payloads. Get message length of ack_req from FW so that we can adjust this parameter according to different situations. There are several constraints for ACK_REQ_FREQ: 1. mtu * (2 ^ ACK_REQ_FREQ) should not be too large, otherwise it may cause some unexpected retries when sending large payload. 2. ACK_REQ_FREQ should be larger than or equal to LP_PKTN_INI. 3. ACK_REQ_FREQ must be equal to LP_PKTN_INI when using LDCP or HC3 congestion control algorithm. Fixes: 56518a603fd2 ("RDMA/hns: Modify the value of long message loopback slice") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250703113905.3597124-4-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-06RDMA/hns: Fix HW configurations not cleared in error flowwenglianfa
hns_roce_clear_extdb_list_info() will eventually do some HW configurations through FW, and they need to be cleared by calling hns_roce_function_clear() when the initialization fails. Fixes: 7e78dd816e45 ("RDMA/hns: Clear extended doorbell info before using") Signed-off-by: wenglianfa <wenglianfa@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250703113905.3597124-3-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-06RDMA/hns: Fix double destruction of rsv_qpwenglianfa
rsv_qp may be double destroyed in error flow, first in free_mr_init(), and then in hns_roce_exit(). Fix it by moving the free_mr_init() call into hns_roce_v2_init(). list_del corruption, ffff589732eb9b50->next is LIST_POISON1 (dead000000000100) WARNING: CPU: 8 PID: 1047115 at lib/list_debug.c:53 __list_del_entry_valid+0x148/0x240 ... Call trace: __list_del_entry_valid+0x148/0x240 hns_roce_qp_remove+0x4c/0x3f0 [hns_roce_hw_v2] hns_roce_v2_destroy_qp_common+0x1dc/0x5f4 [hns_roce_hw_v2] hns_roce_v2_destroy_qp+0x22c/0x46c [hns_roce_hw_v2] free_mr_exit+0x6c/0x120 [hns_roce_hw_v2] hns_roce_v2_exit+0x170/0x200 [hns_roce_hw_v2] hns_roce_exit+0x118/0x350 [hns_roce_hw_v2] __hns_roce_hw_v2_init_instance+0x1c8/0x304 [hns_roce_hw_v2] hns_roce_hw_v2_reset_notify_init+0x170/0x21c [hns_roce_hw_v2] hns_roce_hw_v2_reset_notify+0x6c/0x190 [hns_roce_hw_v2] hclge_notify_roce_client+0x6c/0x160 [hclge] hclge_reset_rebuild+0x150/0x5c0 [hclge] hclge_reset+0x10c/0x140 [hclge] hclge_reset_subtask+0x80/0x104 [hclge] hclge_reset_service_task+0x168/0x3ac [hclge] hclge_service_task+0x50/0x100 [hclge] process_one_work+0x250/0x9a0 worker_thread+0x324/0x990 kthread+0x190/0x210 ret_from_fork+0x10/0x18 Fixes: fd8489294dd2 ("RDMA/hns: Fix Use-After-Free of rsv_qp on HIP08") Signed-off-by: wenglianfa <wenglianfa@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> Link: https://patch.msgid.link/20250703113905.3597124-2-huangjunxian6@hisilicon.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02net/mlx5: Check device memory pointer before usageStav Aviram
Add a NULL check before accessing device memory to prevent a crash if dev->dm allocation in mlx5_init_once() fails. Fixes: c9b9dcb430b3 ("net/mlx5: Move device memory management to mlx5_core") Signed-off-by: Stav Aviram <saviram@nvidia.com> Link: https://patch.msgid.link/c88711327f4d74d5cebc730dc629607e989ca187.1751370035.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02net/mlx5: fs, fix RDMA TRANSPORT init cleanup flowPatrisious Haddad
Failing during the initialization of root_namespace didn't cleanup the priorities of the namespace on which the failure occurred. Properly cleanup said priorities on failure. Fixes: 52931f55159e ("net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://patch.msgid.link/78cf89b5d8452caf1e979350b30ada6904362f66.1751451780.git.leon@kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02Fix dma_unmap_sg() nents valueThomas Fourier
The dma_unmap_sg() functions should be called with the same nents as the dma_map_sg(), not the value the map function returned. Fixes: ed10435d3583 ("RDMA/erdma: Implement hierarchical MTT") Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Link: https://patch.msgid.link/20250630092346.81017-2-fourier.thomas@gmail.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02RDMA/counter: Check CAP_NET_RAW check in user namespace for RDMA countersParav Pandit
Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 1bd8e0a9d0fd ("RDMA/counter: Allow manual mode configuration support") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/68e2064e72e94558a576fdbbb987681a64f6fea8.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02RDMA/nldev: Check CAP_NET_RAW in user namespace for QP modifyParav Pandit
Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to modify the QP. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 0cadb4db79e1 ("RDMA/uverbs: Restrict usage of privileged QKEYs") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/099eb263622ccdd27014db7e02fec824a3307829.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02RDMA/mlx5: Check CAP_NET_RAW in user namespace for devx createParav Pandit
Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the devx object. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: a8b92ca1b0e5 ("IB/mlx5: Introduce DEVX") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/36ee87e92defd81410c6a2b33f9d6c0d6dcfd64c.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02RDMA/uverbs: Check CAP_NET_RAW in user namespace for RAW QP createParav Pandit
Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the QP. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/3914ef9702b01de8843a391ce397fca67d0fc7af.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01RDMA/uverbs: Check CAP_NET_RAW in user namespace for RAW QP createParav Pandit
Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the QP. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 6d1e7ba241e9 ("IB/uverbs: Introduce create/destroy QP commands over ioctl") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/7b6b87505ccc28a1f7b4255af94d898d2df0fff5.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01RDMA/uverbs: Check CAP_NET_RAW in user namespace for QP createParav Pandit
Currently, the capability check is done in the default init_user_ns user namespace. When a process runs in a non default user namespace, such check fails. Due to this when a process is running using Podman, it fails to create the QP. Since the RDMA device is a resource within a network namespace, use the network namespace associated with the RDMA device to determine its owning user namespace. Fixes: 2dee0e545894 ("IB/uverbs: Enable QP creation with a given source QP number") Signed-off-by: Parav Pandit <parav@nvidia.com> Link: https://patch.msgid.link/0e5920d1dfe836817bb07576b192da41b637130b.1750963874.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>