summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/mellanox
AgeCommit message (Collapse)Author
2025-07-07net/mlx5: HWS, Rearrange to prevent forward declarationYevgeny Kliteynik
As a preparation for the following patch that will add support for shrinking empty matchers, rearrange the code to prevent forward declaration of functions. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250703185431.445571-9-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07net/mlx5: HWS, Track matcher sizes individuallyVlad Dogaru
Track and grow matcher sizes individually for RX and TX RTCs. This allows RX-only or TX-only use cases to effectively halve the device resources they use. For testing we used a simple module that inserts 1M RX-only rules and measured the number of pages the device requests, and memory usage as reported by `free -h`. Pages Memory Before this patch: 300k 1.5GiB After this patch: 160k 900MiB Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250703185431.445571-8-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07net/mlx5: HWS, Decouple matcher RX and TX sizesVlad Dogaru
Kernel HWS only uses FDB tables and, as such, creates two lower level containers (RTCs) for each matcher: one for RX and one for TX. Allow these RTCs to differ in size by converting the size part of the matcher attribute to a two element array. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250703185431.445571-7-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07net/mlx5: HWS, Create STEs directly from matcherVlad Dogaru
Matchers were using the pool abstraction solely as a convenience to allocate two STE ranges. The pool's core functionality, that of allocating individual items from the range, was unused. Matchers rely either on the hardware to hash rules into a table, or on a user-provided index. Remove the STE pool from the matcher and allocate the STE ranges manually instead. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250703185431.445571-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07net/mlx5: HWS, Refactor rule skip logicVlad Dogaru
Reduce nesting by adding a couple of early return statements. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250703185431.445571-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07net/mlx5: HWS, Export rule skip logicVlad Dogaru
The bwc layer will use `mlx5hws_rule_skip` to keep track of numbers of RX and TX rules individually, so export this function for future usage. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250703185431.445571-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07net/mlx5: HWS, remove incorrect commentYevgeny Kliteynik
Removing incorrect comment section that is probably some copy-paste artifact. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250703185431.445571-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07net/mlx5: HWS, remove unused create_dest_array parameterVlad Dogaru
`flow_source` is not used anywhere in mlx5hws_action_create_dest_array. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250703185431.445571-2-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07mlxbf_gige: emit messages during open and probe failuresDavid Thompson
The open() and probe() functions of the mlxbf_gige driver check for errors during initialization, but do not provide details regarding the errors. The mlxbf_gige driver should provide error details in the kernel log, noting what step of initialization failed. Signed-off-by: David Thompson <davthompson@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701180324.29683-1-davthompson@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07Merge branch 'mlx5-next' into wip/leon-for-nextLeon Romanovsky
* mlx5-next: net/mlx5: Check device memory pointer before usage net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow net/mlx5: Add IFC bits for PCIe Congestion Event object net/mlx5: Small refactor for general object capabilities
2025-07-02net/mlx5: Manage TC arbiter nodes and implement full support for tc-bwCarolina Jubran
Introduce support for managing Traffic Class (TC) arbiter nodes and associated vports TC nodes within the E-Switch QoS hierarchy. This patch adds support for the new scheduling node type, `SCHED_NODE_TYPE_VPORTS_TC_TSAR`, and implements full support for setting tc-bw on both vports and nodes. Key changes include: - Introduced the new scheduling node type, `SCHED_NODE_TYPE_VPORTS_TC_TSAR`, for managing vports within the TC arbiter node. - New helper functions for creating and destroying vports TC nodes under the TC arbiter. - Updated the minimum rate normalization function to skip nodes of type `SCHED_NODE_TYPE_VPORTS_TC_TSAR`. Vports TC TSARs have bandwidth shares configured on them but not minimum rates, so their `min_rate` cannot be normalized. - Implementation of `esw_qos_tc_arbiter_scheduling_setup()` and `esw_qos_tc_arbiter_scheduling_teardown()` for initializing and cleaning up TC arbiter scheduling elements. These functions now fully support tc-bw configuration on TC arbiter nodes. - Introduced a new helper `esw_qos_calculate_tc_bw_divider()` to compute the total TC bandwidth share, which is used as a divider for normalizing each TC's share. - Added `esw_qos_tc_arbiter_get_bw_shares()` and `esw_qos_set_tc_arbiter_bw_shares()` to handle the settings of bandwidth shares for vports traffic class TSARs. - `esw_qos_set_tc_arbiter_bw_shares()` normalizes each TC share based on the total and the firmware's maximum allowed TSAR bandwidth share. - Refactored `mlx5_esw_devlink_rate_node_tc_bw_set()` and `mlx5_esw_devlink_rate_leaf_tc_bw_set()` to fully support configuring tc-bw on devlink rate nodes and vports, respectively. - Refactored `mlx5_esw_qos_node_update_parent()` to ensure that tc-bw configuration remains compatible with setting a parent on a rate node, preserving level hierarchy functionality. - Refactored `esw_qos_calc_bw_share()` to generalize its input so it can be used for both minimum rate and bandwidth share calculations. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-8-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net/mlx5: Add traffic class scheduling support for vport QoSCarolina Jubran
Introduce support for traffic class (TC) scheduling on vports by allowing the vport to own multiple TC scheduling nodes. This patch enables more granular control of QoS by defining three distinct QoS states for vports, each providing unique scheduling behavior: 1. Regular QoS: The `sched_node` represents the vport directly, handling QoS as a single scheduling entity. 2. TC QoS on the vport: The `sched_node` acts as a TC arbiter, enabling TC scheduling directly on the vport. 3. TC QoS on the parent node: The `sched_node` functions as a rate limiter, with TC arbitration enabled at the parent level, associating multiple scheduling nodes with each vport. Key changes include: - Added support for new scheduling elements, vport traffic class and rate limiter. - New helper functions for creating, destroying, and restoring vport TC scheduling nodes, handling transitions between regular QoS and TC arbitration states. - Updated `esw_qos_vport_enable()` and `esw_qos_vport_disable()` to support both regular QoS and TC arbitration states, ensuring consistent transitions between scheduling modes. - Introduced a `sched_nodes` array under `vport->qos` to store multiple TC scheduling nodes per vport, enabling finer control over per-TC QoS. - Enhanced `esw_qos_vport_update_parent()` to handle transitions between the three QoS states based on the current and new parent node types. This patch lays the groundwork for future support for configuring tc-bw on vports. Although the infrastructure is in place, full support for tc-bw is not yet implemented; attempts to set tc-bw on vports will return `-EOPNOTSUPP`. No functional changes are introduced at this stage. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-7-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net/mlx5: Add support for setting tc-bw on nodesCarolina Jubran
Introduce support for enabling and disabling Traffic Class (TC) arbitration for existing devlink rate nodes. This patch adds support for a new scheduling node type, `SCHED_NODE_TYPE_TC_ARBITER_TSAR`. Key changes include: - New helper functions for transitioning existing rate nodes to TC arbiter nodes and vice versa. These functions handle the allocation of TC arbiter nodes, copying of child nodes, and restoring vport QoS settings when TC arbitration is disabled. - Implementation of `mlx5_esw_devlink_rate_node_tc_bw_set()` to manage tc-bw configuration on nodes. - Introduced stubs for `esw_qos_tc_arbiter_scheduling_setup()` and `esw_qos_tc_arbiter_scheduling_teardown()`, which will be extended in future patches to provide full support for tc-bw on devlink rate objects. - Validation functions for tc-bw settings, allowing graceful handling of unsupported traffic class bandwidth configurations. - Updated `__esw_qos_alloc_node()` to insert the new node into the parent’s children list only if the parent is not NULL. For the root TSAR, the new node is inserted directly after the allocation call. - Don't allow `tc-bw` configuration for nodes containing non-leaf children. This patch lays the groundwork for future support for configuring tc-bw on devlink rate nodes. Although the infrastructure is in place, full support for tc-bw is not yet implemented; attempts to set tc-bw on nodes will return `-EOPNOTSUPP`. No functional changes are introduced at this stage. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net/mlx5: Add no-op implementation for setting tc-bw on rate objectsCarolina Jubran
Introduce `mlx5_esw_devlink_rate_node_tc_bw_set()` and `mlx5_esw_devlink_rate_leaf_tc_bw_set()` with no-op logic. Future patches will add support for setting traffic class bandwidth on rate objects. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net/mlx5: Check device memory pointer before usageStav Aviram
Add a NULL check before accessing device memory to prevent a crash if dev->dm allocation in mlx5_init_once() fails. Fixes: c9b9dcb430b3 ("net/mlx5: Move device memory management to mlx5_core") Signed-off-by: Stav Aviram <saviram@nvidia.com> Link: https://patch.msgid.link/c88711327f4d74d5cebc730dc629607e989ca187.1751370035.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-02net/mlx5: fs, fix RDMA TRANSPORT init cleanup flowPatrisious Haddad
Failing during the initialization of root_namespace didn't cleanup the priorities of the namespace on which the failure occurred. Properly cleanup said priorities on failure. Fixes: 52931f55159e ("net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://patch.msgid.link/78cf89b5d8452caf1e979350b30ada6904362f66.1751451780.git.leon@kernel.org Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-07-01time/timecounter: Fix the lie that struct cyclecounter is constGreg Kroah-Hartman
In both the read callback for struct cyclecounter, and in struct timecounter, struct cyclecounter is declared as a const pointer. Unfortunatly, a number of users of this pointer treat it as a non-const pointer as it is burried in a larger structure that is heavily modified by the callback function when accessed. This lie had been hidden by the fact that container_of() "casts away" a const attribute of a pointer without any compiler warning happening at all. Fix this all up by removing the const attribute in the needed places so that everyone can see that the structure really isn't const, but can, and is, modified by the users of it. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/2025070124-backyard-hurt-783a@gregkh
2025-06-30net/mlx5e: Fix error handling in RQ memory model registrationFushuai Wang
Currently when xdp_rxq_info_reg_mem_model() fails in the XSK path, the error handling incorrectly jumps to err_destroy_page_pool. While this may not cause errors, we should make it jump to the correct location. Signed-off-by: Fushuai Wang <wangfushuai@baidu.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev> Acked-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-06-26RDMA/mlx5: Allocate IB device with net namespace supplied from core devMark Bloch
Use the new ib_alloc_device_with_net() API to allocate the IB device so that it is properly bound to the network namespace obtained via mlx5_core_net(). This change ensures correct namespace association (e.g., for containerized setups). Additionally, expose mlx5_core_net so that RDMA driver can use it. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
2025-06-25net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domainPatrisious Haddad
RDMA TRANSPORT domains were initially limited to a single priority. This change allows the domains to have multiple priorities, making it possible to add several rules and control the order in which they're evaluated. Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/b299cbb4c8678a33da6e6b6988b5bf6145c54b88.1750148083.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org>
2025-06-21eth: mlx5: migrate to new RXFH callbacksJakub Kicinski
Migrate to new callbacks added by commit 9bb00786fc61 ("net: ethtool: add dedicated callbacks for getting and setting rxfh fields"). Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Link: https://patch.msgid.link/20250618203823.1336156-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19mlxbf_gige: return EPROBE_DEFER if PHY IRQ is not availableDavid Thompson
The message "Error getting PHY irq. Use polling instead" is emitted when the mlxbf_gige driver is loaded by the kernel before the associated gpio-mlxbf driver, and thus the call to get the PHY IRQ fails since it is not yet available. The driver probe() must return -EPROBE_DEFER if acpi_dev_gpio_irq_get_by() returns the same. Fixes: 6c2a6ddca763 ("net: mellanox: mlxbf_gige: Replace non-standard interrupt handling") Signed-off-by: David Thompson <davthompson@nvidia.com> Reviewed-by: Asmaa Mnebhi <asmaa@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250618135902.346-1-davthompson@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-19net/mlx4_en: Remove the redundant NULL check for the 'my_ets' objectAndrey Vatoropin
Static analysis shows that pointer "my_ets" cannot be NULL because it points to the object "struct ieee_ets". Remove the extra NULL check. It is meaningless and harms the readability of the code. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Andrey Vatoropin <a.vatoropin@crpt.ru> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250616045034.26000-1-a.vatoropin@crpt.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-18udp_tunnel: remove rtnl_lock dependencyStanislav Fomichev
Drivers that are using ops lock and don't depend on RTNL lock still need to manage it because udp_tunnel's RTNL dependency. Introduce new udp_tunnel_nic_lock and use it instead of rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to grab udp_tunnel_nic_lock mutex and might sleep). Cover more places in v4: - netlink - udp_tunnel_notify_add_rx_port (ndo_open) - triggers udp_tunnel_nic_device_sync_work - udp_tunnel_notify_del_rx_port (ndo_stop) - triggers udp_tunnel_nic_device_sync_work - udp_tunnel_get_rx_info (__netdev_update_features) - triggers NETDEV_UDP_TUNNEL_PUSH_INFO - udp_tunnel_drop_rx_info (__netdev_update_features) - triggers NETDEV_UDP_TUNNEL_DROP_INFO - udp_tunnel_nic_reset_ntf (ndo_open) - notifiers - udp_tunnel_nic_netdevice_event, depending on the event: - triggers NETDEV_UDP_TUNNEL_PUSH_INFO - triggers NETDEV_UDP_TUNNEL_DROP_INFO - ethnl_tunnel_info_reply_size - udp_tunnel_nic_set_port_priv (two intel drivers) Cc: Michael Chan <michael.chan@broadcom.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Stanislav Fomichev <stfomichev@gmail.com> Link: https://patch.msgid.link/20250616162117.287806-4-stfomichev@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-18net/mlx4e: Don't redefine IB_MTU_XXX enumMark Zhang
Rely on existing IB_MTU_XXX definitions which exist in ib_verbs.h. Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Mark Zhang <markzhang@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/382c91ee506e7f1f3c1801957df6b28963484b7d.1750147222.git.leon@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net/mlx5e: Add TX support for netmemsDragos Tatulea
Declare netmem TX support in netdev. As required, use the netmem aware dma unmapping APIs for unmapping netmems in tx completion path. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250616141441.1243044-13-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net/mlx5e: Support ethtool tcp-data-split settingsSaeed Mahameed
In mlx5, tcp header-data split requires HW GRO to be on. Enabling it fails when HW GRO is off. mlx5e_fix_features now keeps HW GRO on when tcp data split is enabled. Finally, when tcp data split is disabled, features are updated to maybe remove the forced HW GRO. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250616141441.1243044-12-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net/mlx5e: Implement queue mgmt ops and single channel swapSaeed Mahameed
The bulk of the work is done in mlx5e_queue_mem_alloc, where we allocate and create the new channel resources, similar to mlx5e_safe_switch_params, but here we do it for a single channel using existing params, sort of a clone channel. To swap the old channel with the new one, we deactivate and close the old channel then replace it with the new one, since the swap procedure doesn't fail in mlx5, we do it all in one place (mlx5e_queue_start). Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Acked-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250616141441.1243044-11-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net/mlx5e: Add support for UNREADABLE netmem page poolsSaeed Mahameed
On netdev_rx_queue_restart, a special type of page pool maybe expected. In this patch declare support for UNREADABLE netmem iov pages in the pool params only when header data split shampo RQ mode is enabled, also set the queue index in the page pool params struct. Shampo mode requirement: Without header split rx needs to peek at the data, we can't do UNREADABLE_NETMEM. The patch also enables the use of a separate page pool for headers when a memory provider is installed for the queue, otherwise the same common page pool continues to be used. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250616141441.1243044-10-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net/mlx5e: Convert over to netmemSaeed Mahameed
mlx5e_page_frag holds the physical page itself, to naturally support zc page pools, remove physical page reference from mlx5 and replace it with netmem_ref, to avoid internal handling in mlx5 for net_iov backed pages. SHAMPO can issue packets that are not split into header and data. These packets will be dropped if the data part resides in a net_iov as the driver can't read into this area. No performance degradation observed. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250616141441.1243044-9-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net/mlx5e: SHAMPO: Separate pool for headersSaeed Mahameed
Allow allocating a separate page pool for headers when SHAMPO is on. This will be useful for adding support to zc page pool, which has to be different from the headers page pool. For now, the pools are the same. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250616141441.1243044-8-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net/mlx5e: SHAMPO: Improve hw gro capability checkingSaeed Mahameed
Add missing HW capabilities, declare the feature in netdev->vlan_features, similar to other features in mlx5e_build_nic_netdev. No functional change here as all by default disabled features are explicitly disabled at the bottom of the function. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250616141441.1243044-7-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net/mlx5e: SHAMPO: Remove redundant paramsSaeed Mahameed
Two SHAMPO params are static and always the same, remove them from the global mlx5e_params struct. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250616141441.1243044-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net/mlx5e: SHAMPO: Reorganize mlx5_rq_shampo_allocSaeed Mahameed
Drop redundant SHAMPO structure alloc/free functions. Gather together function calls pertaining to header split info, pass header per WQE (hd_per_wqe) as parameter to those function to avoid use before initialization future mistakes. Allocate HW GRO related info outside of the header related info scope. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250616141441.1243044-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12net: ethtool: require drivers to opt into the per-RSS ctx RXFHJakub Kicinski
RX Flow Hashing supports using different configuration for different RSS contexts. Only two drivers seem to support it. Make sure we uniformly error out for drivers which don't. Reviewed-by: Joe Damato <joe@dama.to> Link: https://patch.msgid.link/20250611145949.2674086-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc2). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12Merge tag 'net-6.16-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth and wireless. Current release - regressions: - af_unix: allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD Current release - new code bugs: - eth: airoha: correct enable mask for RX queues 16-31 - veth: prevent NULL pointer dereference in veth_xdp_rcv when peer disappears under traffic - ipv6: move fib6_config_validate() to ip6_route_add(), prevent invalid routes Previous releases - regressions: - phy: phy_caps: don't skip better duplex match on non-exact match - dsa: b53: fix untagged traffic sent via cpu tagged with VID 0 - Revert "wifi: mwifiex: Fix HT40 bandwidth issue.", it caused transient packet loss, exact reason not fully understood, yet Previous releases - always broken: - net: clear the dst when BPF is changing skb protocol (IPv4 <> IPv6) - sched: sfq: fix a potential crash on gso_skb handling - Bluetooth: intel: improve rx buffer posting to avoid causing issues in the firmware - eth: intel: i40e: make reset handling robust against multiple requests - eth: mlx5: ensure FW pages are always allocated on the local NUMA node, even when device is configure to 'serve' another node - wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850, prevent kernel crashes - wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request() for 3 sec if fw_stats_done is not set" * tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits) selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context net: ethtool: Don't check if RSS context exists in case of context 0 af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD. ipv6: Move fib6_config_validate() to ip6_route_add(). net: drv: netdevsim: don't napi_complete() from netpoll net/mlx5: HWS, Add error checking to hws_bwc_rule_complex_hash_node_get() veth: prevent NULL pointer dereference in veth_xdp_rcv net_sched: remove qdisc_tree_flush_backlog() net_sched: ets: fix a race in ets_qdisc_change() net_sched: tbf: fix a race in tbf_change() net_sched: red: fix a race in __red_change() net_sched: prio: fix a race in prio_tune() net_sched: sch_sfq: reject invalid perturb period net: phy: phy_caps: Don't skip better duplex macth on non-exact match MAINTAINERS: Update Kuniyuki Iwashima's email address. selftests: net: add test case for NAT46 looping back dst net: clear the dst when changing skb protocol net/mlx5e: Fix number of lanes to UNKNOWN when using data_rate_oper net/mlx5e: Fix leak of Geneve TLV option object net/mlx5: HWS, make sure the uplink is the last destination ...
2025-06-12net/mlx5: HWS, Add error checking to hws_bwc_rule_complex_hash_node_get()Dan Carpenter
Check for if ida_alloc() or rhashtable_lookup_get_insert_fast() fails. Fixes: 17e0accac577 ("net/mlx5: HWS, support complex matchers") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Link: https://patch.msgid.link/aEmBONjyiF6z5yCV@stanley.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-11net/mlx5e: Fix number of lanes to UNKNOWN when using data_rate_operShahar Shitrit
When the link is up, either eth_proto_oper or ext_eth_proto_oper typically reports the active link protocol, from which both speed and number of lanes can be retrieved. However, in certain cases, such as when a NIC is connected via a non-standard cable, the firmware may not report the protocol. In such scenarios, the speed can still be obtained from the data_rate_oper field in PTYS register. Since data_rate_oper provides only speed information and lacks lane details, it is incorrect to derive the number of lanes from it. This patch corrects the behavior by setting the number of lanes to UNKNOWN instead of incorrectly using MAX_LANES when relying on data_rate_oper. Fixes: 7e959797f021 ("net/mlx5e: Enable lanes configuration when auto-negotiation is off") Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250610151514.1094735-10-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-11net/mlx5e: Fix leak of Geneve TLV option objectJianbo Liu
Previously, a unique tunnel id was added for the matching on TC non-zero chains, to support inner header rewrite with goto action. Later, it was used to support VF tunnel offload for vxlan, then for Geneve and GRE. To support VF tunnel, a temporary mlx5_flow_spec is used to parse tunnel options. For Geneve, if there is TLV option, a object is created, or refcnt is added if already exists. But the temporary mlx5_flow_spec is directly freed after parsing, which causes the leak because no information regarding the object is saved in flow's mlx5_flow_spec, which is used to free the object when deleting the flow. To fix the leak, call mlx5_geneve_tlv_option_del() before free the temporary spec if it has TLV object. Fixes: 521933cdc4aa ("net/mlx5e: Support Geneve and GRE with VF tunnel offload") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Alex Lazar <alazar@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250610151514.1094735-9-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-11net/mlx5: HWS, make sure the uplink is the last destinationVlad Dogaru
When there are more than one destinations, we create a FW flow table and provide it with all the destinations. FW requires to have wire as the last destination in the list (if it exists), otherwise the operation fails with FW syndrome. This patch fixes the destination array action creation: if it contains a wire destination, it is moved to the end. Fixes: 504e536d9010 ("net/mlx5: HWS, added actions handling") Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250610151514.1094735-7-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-11net/mlx5: HWS, fix missing ip_version handling in definerYevgeny Kliteynik
Fix missing field handling in definer - outer IP version. Fixes: 74a778b4a63f ("net/mlx5: HWS, added definers handling") Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250610151514.1094735-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-11net/mlx5: HWS, Init mutex on the correct pathVlad Dogaru
The newly introduced mutex is only used for reformat actions, but it was initialized for modify header instead. The struct that contains the mutex is zero-initialized and an all-zero mutex is valid, so the issue only shows up with CONFIG_DEBUG_MUTEXES. Fixes: b206d9ec19df ("net/mlx5: HWS, register reformat actions with fw") Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250610151514.1094735-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-11net/mlx5: Fix return value when searching for existing flow groupPatrisious Haddad
When attempting to add a rule to an existing flow group, if a matching flow group exists but is not active, the error code returned should be EAGAIN, so that the rule can be added to the matching flow group once it is active, rather than ENOENT, which indicates that no matching flow group was found. Fixes: bd71b08ec2ee ("net/mlx5: Support multiple updates of steering rules in parallel") Signed-off-by: Gavi Teitz <gavi@nvidia.com> Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250610151514.1094735-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-11net/mlx5: Fix ECVF vports unload on shutdown flowAmir Tzin
Fix shutdown flow UAF when a virtual function is created on the embedded chip (ECVF) of a BlueField device. In such case the vport acl ingress table is not properly destroyed. ECVF functionality is independent of ecpf_vport_exists capability and thus functions mlx5_eswitch_(enable|disable)_pf_vf_vports() should not test it when enabling/disabling ECVF vports. kernel log: [] refcount_t: underflow; use-after-free. [] WARNING: CPU: 3 PID: 1 at lib/refcount.c:28 refcount_warn_saturate+0x124/0x220 ---------------- [] Call trace: [] refcount_warn_saturate+0x124/0x220 [] tree_put_node+0x164/0x1e0 [mlx5_core] [] mlx5_destroy_flow_table+0x98/0x2c0 [mlx5_core] [] esw_acl_ingress_table_destroy+0x28/0x40 [mlx5_core] [] esw_acl_ingress_lgcy_cleanup+0x80/0xf4 [mlx5_core] [] esw_legacy_vport_acl_cleanup+0x44/0x60 [mlx5_core] [] esw_vport_cleanup+0x64/0x90 [mlx5_core] [] mlx5_esw_vport_disable+0xc0/0x1d0 [mlx5_core] [] mlx5_eswitch_unload_ec_vf_vports+0xcc/0x150 [mlx5_core] [] mlx5_eswitch_disable_sriov+0x198/0x2a0 [mlx5_core] [] mlx5_device_disable_sriov+0xb8/0x1e0 [mlx5_core] [] mlx5_sriov_detach+0x40/0x50 [mlx5_core] [] mlx5_unload+0x40/0xc4 [mlx5_core] [] mlx5_unload_one_devl_locked+0x6c/0xe4 [mlx5_core] [] mlx5_unload_one+0x3c/0x60 [mlx5_core] [] shutdown+0x7c/0xa4 [mlx5_core] [] pci_device_shutdown+0x3c/0xa0 [] device_shutdown+0x170/0x340 [] __do_sys_reboot+0x1f4/0x2a0 [] __arm64_sys_reboot+0x2c/0x40 [] invoke_syscall+0x78/0x100 [] el0_svc_common.constprop.0+0x54/0x184 [] do_el0_svc+0x30/0xac [] el0_svc+0x48/0x160 [] el0t_64_sync_handler+0xa4/0x12c [] el0t_64_sync+0x1a4/0x1a8 [] --[ end trace 9c4601d68c70030e ]--- Fixes: a7719b29a821 ("net/mlx5: Add management of EC VF vports") Reviewed-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Amir Tzin <amirtz@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250610151514.1094735-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-11net/mlx5: Ensure fw pages are always allocated on same NUMAMoshe Shemesh
When firmware asks the driver to allocate more pages, using event of give_pages, the driver should always allocate it from same NUMA, the original device NUMA. Current code uses dev_to_node() which can result in different NUMA as it is changed by other driver flows, such as mlx5_dma_zalloc_coherent_node(). Instead, use saved numa node for allocating firmware pages. Fixes: 311c7c71c9bb ("net/mlx5e: Allocate DMA coherent memory on reader NUMA node") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250610151514.1094735-2-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-11net/mlx5: Expose serial numbers in devlink infoJiri Pirko
Devlink info allows to expose serial number and board serial number Get the values from PCI VPD and expose it. $ devlink dev info pci/0000:08:00.0: driver mlx5_core serial_number e4397f872caeed218000846daa7d2f49 board.serial_number MT2314XZ00YA versions: fixed: fw.psid MT_0000000894 running: fw.version 28.41.1000 fw 28.41.1000 stored: fw.version 28.41.1000 fw 28.41.1000 auxiliary/mlx5_core.eth.0: driver mlx5_core.eth pci/0000:08:00.1: driver mlx5_core serial_number e4397f872caeed218000846daa7d2f49 board.serial_number MT2314XZ00YA versions: fixed: fw.psid MT_0000000894 running: fw.version 28.41.1000 fw 28.41.1000 stored: fw.version 28.41.1000 fw 28.41.1000 auxiliary/mlx5_core.eth.1: driver mlx5_core.eth Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Acked-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250610025128.109232-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-08treewide, timers: Rename from_timer() to timer_container_of()Ingo Molnar
Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-05-29net/mlx4_en: Prevent potential integer overflow calculating HzDan Carpenter
The "freq" variable is in terms of MHz and "max_val_cycles" is in terms of Hz. The fact that "max_val_cycles" is a u64 suggests that support for high frequency is intended but the "freq_khz * 1000" would overflow the u32 type if we went above 4GHz. Use unsigned long long type for the mutliplication to prevent that. Fixes: 31c128b66e5b ("net/mlx4_en: Choose time-stamping shift value according to HW frequency") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/aDbFHe19juIJKjsb@stanley.mountain Signed-off-by: Paolo Abeni <pabeni@redhat.com>