summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet/mellanox
AgeCommit message (Collapse)Author
10 hoursMerge tag 'net-next-6.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Wrap datapath globals into net_aligned_data, to avoid false sharing - Preserve MSG_ZEROCOPY in forwarding (e.g. out of a container) - Add SO_INQ and SCM_INQ support to AF_UNIX - Add SIOCINQ support to AF_VSOCK - Add TCP_MAXSEG sockopt to MPTCP - Add IPv6 force_forwarding sysctl to enable forwarding per interface - Make TCP validation of whether packet fully fits in the receive window and the rcv_buf more strict. With increased use of HW aggregation a single "packet" can be multiple 100s of kB - Add MSG_MORE flag to optimize large TCP transmissions via sockmap, improves latency up to 33% for sockmap users - Convert TCP send queue handling from tasklet to BH workque - Improve BPF iteration over TCP sockets to see each socket exactly once - Remove obsolete and unused TCP RFC3517/RFC6675 loss recovery code - Support enabling kernel threads for NAPI processing on per-NAPI instance basis rather than a whole device. Fully stop the kernel NAPI thread when threaded NAPI gets disabled. Previously thread would stick around until ifdown due to tricky synchronization - Allow multicast routing to take effect on locally-generated packets - Add output interface argument for End.X in segment routing - MCTP: add support for gateway routing, improve bind() handling - Don't require rtnl_lock when fetching an IPv6 neighbor over Netlink - Add a new neighbor flag ("extern_valid"), which cedes refresh responsibilities to userspace. This is needed for EVPN multi-homing where a neighbor entry for a multi-homed host needs to be synced across all the VTEPs among which the host is multi-homed - Support NUD_PERMANENT for proxy neighbor entries - Add a new queuing discipline for IETF RFC9332 DualQ Coupled AQM - Add sequence numbers to netconsole messages. Unregister netconsole's console when all net targets are removed. Code refactoring. Add a number of selftests - Align IPSec inbound SA lookup to RFC 4301. Only SPI and protocol should be used for an inbound SA lookup - Support inspecting ref_tracker state via DebugFS - Don't force bonding advertisement frames tx to ~333 ms boundaries. Add broadcast_neighbor option to send ARP/ND on all bonded links - Allow providing upcall pid for the 'execute' command in openvswitch - Remove DCCP support from Netfilter's conntrack - Disallow multiple packet duplications in the queuing layer - Prevent use of deprecated iptables code on PREEMPT_RT Driver API: - Support RSS and hashing configuration over ethtool Netlink - Add dedicated ethtool callbacks for getting and setting hashing fields - Add support for power budget evaluation strategy in PSE / Power-over-Ethernet. Generate Netlink events for overcurrent etc - Support DPLL phase offset monitoring across all device inputs. Support providing clock reference and SYNC over separate DPLL inputs - Support traffic classes in devlink rate API for bandwidth management - Remove rtnl_lock dependency from UDP tunnel port configuration Device drivers: - Add a new Broadcom driver for 800G Ethernet (bnge) - Add a standalone driver for Microchip ZL3073x DPLL - Remove IBM's NETIUCV device driver - Ethernet high-speed NICs: - Broadcom (bnxt): - support zero-copy Tx of DMABUF memory - take page size into account for page pool recycling rings - Intel (100G, ice, idpf): - idpf: XDP and AF_XDP support preparations - idpf: add flow steering - add link_down_events statistic - clean up the TSPLL code - preparations for live VM migration - nVidia/Mellanox: - support zero-copy Rx/Tx interfaces (DMABUF and io_uring) - optimize context memory usage for matchers - expose serial numbers in devlink info - support PCIe congestion metrics - Meta (fbnic): - add 25G, 50G, and 100G link modes to phylink - support dumping FW logs - Marvell/Cavium: - support for CN20K generation of the Octeon chips - Amazon: - add HW clock (without timestamping, just hypervisor time access) - Ethernet virtual: - VirtIO net: - support segmentation of UDP-tunnel-encapsulated packets - Google (gve): - support packet timestamping and clock synchronization - Microsoft vNIC: - add handler for device-originated servicing events - allow dynamic MSI-X vector allocation - support Tx bandwidth clamping - Ethernet NICs consumer, and embedded: - AMD: - amd-xgbe: hardware timestamping and PTP clock support - Broadcom integrated MACs (bcmgenet, bcmasp): - use napi_complete_done() return value to support NAPI polling - add support for re-starting auto-negotiation - Broadcom switches (b53): - support BCM5325 switches - add bcm63xx EPHY power control - Synopsys (stmmac): - lots of code refactoring and cleanups - TI: - icssg-prueth: read firmware-names from device tree - icssg: PRP offload support - Microchip: - lan78xx: convert to PHYLINK for improved PHY and MAC management - ksz: add KSZ8463 switch support - Intel: - support similar queue priority scheme in multi-queue and time-sensitive networking (taprio) - support packet pre-emption in both - RealTek (r8169): - enable EEE at 5Gbps on RTL8126 - Airoha: - add PPPoE offload support - MDIO bus controller for Airoha AN7583 - Ethernet PHYs: - support for the IPQ5018 internal GE PHY - micrel KSZ9477 switch-integrated PHYs: - add MDI/MDI-X control support - add RX error counters - add cable test support - add Signal Quality Indicator (SQI) reporting - dp83tg720: improve reset handling and reduce link recovery time - support bcm54811 (and its MII-Lite interface type) - air_en8811h: support resume/suspend - support PHY counters for QCA807x and QCA808x - support WoL for QCA807x - CAN drivers: - rcar_canfd: support for Transceiver Delay Compensation - kvaser: report FW versions via devlink dev info - WiFi: - extended regulatory info support (6 GHz) - add statistics and beacon monitor for Multi-Link Operation (MLO) - support S1G aggregation, improve S1G support - add Radio Measurement action fields - support per-radio RTS threshold - some work around how FIPS affects wifi, which was wrong (RC4 is used by TKIP, not only WEP) - improvements for unsolicited probe response handling - WiFi drivers: - RealTek (rtw88): - IBSS mode for SDIO devices - RealTek (rtw89): - BT coexistence for MLO/WiFi7 - concurrent station + P2P support - support for USB devices RTL8851BU/RTL8852BU - Intel (iwlwifi): - use embedded PNVM in (to be released) FW images to fix compatibility issues - many cleanups (unused FW APIs, PCIe code, WoWLAN) - some FIPS interoperability - MediaTek (mt76): - firmware recovery improvements - more MLO work - Qualcomm/Atheros (ath12k): - fix scan on multi-radio devices - more EHT/Wi-Fi 7 features - encapsulation/decapsulation offload - Broadcom (brcm80211): - support SDIO 43751 device - Bluetooth: - hci_event: add support for handling LE BIG Sync Lost event - ISO: add socket option to report packet seqnum via CMSG - ISO: support SCM_TIMESTAMPING for ISO TS - Bluetooth drivers: - intel_pcie: support Function Level Reset - nxpuart: add support for 4M baudrate - nxpuart: implement powerup sequence, reset, FW dump, and FW loading" * tag 'net-next-6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1742 commits) dpll: zl3073x: Fix build failure selftests: bpf: fix legacy netfilter options ipv6: annotate data-races around rt->fib6_nsiblings ipv6: fix possible infinite loop in fib6_info_uses_dev() ipv6: prevent infinite loop in rt6_nlmsg_size() ipv6: add a retry logic in net6_rt_notify() vrf: Drop existing dst reference in vrf_ip6_input_dst net/sched: taprio: align entry index attr validation with mqprio net: fsl_pq_mdio: use dev_err_probe selftests: rtnetlink.sh: remove esp4_offload after test vsock: remove unnecessary null check in vsock_getname() igb: xsk: solve negative overflow of nb_pkts in zerocopy mode stmmac: xsk: fix negative overflow of budget in zerocopy mode dt-bindings: ieee802154: Convert at86rf230.txt yaml format net: dsa: microchip: Disable PTP function of KSZ8463 net: dsa: microchip: Setup fiber ports for KSZ8463 net: dsa: microchip: Write switch MAC address differently for KSZ8463 net: dsa: microchip: Use different registers for KSZ8463 net: dsa: microchip: Add KSZ8463 switch support to KSZ DSA driver dt-bindings: net: dsa: microchip: Add KSZ8463 switch support ...
29 hoursMerge tag 'timers-cleanups-2025-07-27' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide cleanup of struct cycle_counter const annotations. The initial idea of making them const was correct as they were seperate instances. When they got embedded into larger data structures, which are even modified by the callback this got moot. The only reason why this went unnoticed is that the required container_of() casts the const attribute forcefully away. Stop pretending that it is const" * tag 'timers-cleanups-2025-07-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time/timecounter: Fix the lie that struct cyclecounter is const
4 daysMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Merge in late fixes to prepare for the 6.17 net-next PR. Conflicts: net/core/neighbour.c 1bbb76a89948 ("neighbour: Fix null-ptr-deref in neigh_flush_dev().") 13a936bb99fb ("neighbour: Protect tbl->phash_buckets[] with a dedicated mutex.") 03dc03fa0432 ("neighbor: Add NTF_EXT_VALIDATED flag for externally validated entries") Adjacent changes: drivers/net/usb/usbnet.c 0d9cfc9b8cb1 ("net: usbnet: Avoid potential RCU stall on LINK_CHANGE event") 2c04d279e857 ("net: usb: Convert tasklet API to new bottom half workqueue mechanism") net/ipv6/route.c 31d7d67ba127 ("ipv6: annotate data-races around rt->fib6_nsiblings") 1caf27297215 ("ipv6: adopt dst_dev() helper") 3b3ccf9ed05e ("net: Remove unnecessary NULL check for lwtunnel_fill_encap()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 daysnet/mlx5e: Expose TIS via devlink tx reporter diagnoseFeng Liu
Underneath "TIS Config" tag expose TIS diagnostic information. Expose the tisn of each TC under each lag port. $ sudo devlink health diagnose auxiliary/mlx5_core.eth.2/131072 reporter tx ...... TIS Config: lag port: 0 tc: 0 tisn: 0 lag port: 1 tc: 0 tisn: 8 ...... Signed-off-by: Feng Liu <feliu@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/1753194228-333722-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 daysnet/mlx5e: Support routed networks during IPsec MACs initializationAlexandre Cassen
Remote IPsec tunnel endpoint may refer to a network segment that is not directly connected to the host. In such a case, IPsec tunnel endpoints are connected to a router and reachable via a routing path. In IPsec packet offload mode, HW is initialized with the MAC address of both IPsec tunnel endpoints. Extend the current IPsec init MACs procedure to resolve nexthop for routed networks. Direct neighbour lookup and probe is still used for directly connected networks and as a fallback mechanism if fib lookup fails. Signed-off-by: Alexandre Cassen <acassen@corp.free.fr> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/1753194228-333722-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 daysnet/mlx5e: Fix potential deadlock by deferring RX timeout recoveryShahar Shitrit
mlx5e_reporter_rx_timeout() is currently invoked synchronously in the driver's open error flow. This causes the thread holding priv->state_lock to attempt acquiring the devlink lock, which can result in a circular dependency with other devlink operations. For example: - Devlink health diagnose flow: - __devlink_nl_pre_doit() acquires the devlink lock. - devlink_nl_health_reporter_diagnose_doit() invokes the driver's diagnose callback. - mlx5e_rx_reporter_diagnose() then attempts to acquire priv->state_lock. - Driver open flow: - mlx5e_open() acquires priv->state_lock. - If an error occurs, devlink_health_reporter may be called, attempting to acquire the devlink lock. To prevent this circular locking scenario, defer the RX timeout recovery by scheduling it via a workqueue. This ensures that the recovery work acquires locks in a consistent order: first the devlink lock, then priv->state_lock. Additionally, make the recovery work acquire the netdev instance lock to safely synchronize with the open/close channel flows, similar to mlx5e_tx_timeout_work. Repeatedly attempt to acquire the netdev instance lock until it is taken or the target RQ is no longer active, as indicated by the MLX5E_STATE_CHANNELS_ACTIVE bit. Fixes: 32c57fb26863 ("net/mlx5e: Report and recover from rx timeout") Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1753256672-337784-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 daysnet/mlx5e: Remove skb secpath if xfrm state is not foundJianbo Liu
Hardware returns a unique identifier for a decrypted packet's xfrm state, this state is looked up in an xarray. However, the state might have been freed by the time of this lookup. Currently, if the state is not found, only a counter is incremented. The secpath (sp) extension on the skb is not removed, resulting in sp->len becoming 0. Subsequently, functions like __xfrm_policy_check() attempt to access fields such as xfrm_input_state(skb)->xso.type (which dereferences sp->xvec[sp->len - 1]) without first validating sp->len. This leads to a crash when dereferencing an invalid state pointer. This patch prevents the crash by explicitly removing the secpath extension from the skb if the xfrm state is not found after hardware decryption. This ensures downstream functions do not operate on a zero-length secpath. BUG: unable to handle page fault for address: ffffffff000002c8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 282e067 P4D 282e067 PUD 0 Oops: Oops: 0000 [#1] SMP CPU: 12 UID: 0 PID: 0 Comm: swapper/12 Not tainted 6.15.0-rc7_for_upstream_min_debug_2025_05_27_22_44 #1 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:__xfrm_policy_check+0x61a/0xa30 Code: b6 77 7f 83 e6 02 74 14 4d 8b af d8 00 00 00 41 0f b6 45 05 c1 e0 03 48 98 49 01 c5 41 8b 45 00 83 e8 01 48 98 49 8b 44 c5 10 <0f> b6 80 c8 02 00 00 83 e0 0c 3c 04 0f 84 0c 02 00 00 31 ff 80 fa RSP: 0018:ffff88885fb04918 EFLAGS: 00010297 RAX: ffffffff00000000 RBX: 0000000000000002 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000002 RDI: 0000000000000000 RBP: ffffffff8311af80 R08: 0000000000000020 R09: 00000000c2eda353 R10: ffff88812be2bbc8 R11: 000000001faab533 R12: ffff88885fb049c8 R13: ffff88812be2bbc8 R14: 0000000000000000 R15: ffff88811896ae00 FS: 0000000000000000(0000) GS:ffff8888dca82000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff000002c8 CR3: 0000000243050002 CR4: 0000000000372eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> ? try_to_wake_up+0x108/0x4c0 ? udp4_lib_lookup2+0xbe/0x150 ? udp_lib_lport_inuse+0x100/0x100 ? __udp4_lib_lookup+0x2b0/0x410 __xfrm_policy_check2.constprop.0+0x11e/0x130 udp_queue_rcv_one_skb+0x1d/0x530 udp_unicast_rcv_skb+0x76/0x90 __udp4_lib_rcv+0xa64/0xe90 ip_protocol_deliver_rcu+0x20/0x130 ip_local_deliver_finish+0x75/0xa0 ip_local_deliver+0xc1/0xd0 ? ip_protocol_deliver_rcu+0x130/0x130 ip_sublist_rcv+0x1f9/0x240 ? ip_rcv_finish_core+0x430/0x430 ip_list_rcv+0xfc/0x130 __netif_receive_skb_list_core+0x181/0x1e0 netif_receive_skb_list_internal+0x200/0x360 ? mlx5e_build_rx_skb+0x1bc/0xda0 [mlx5_core] gro_receive_skb+0xfd/0x210 mlx5e_handle_rx_cqe_mpwrq+0x141/0x280 [mlx5_core] mlx5e_poll_rx_cq+0xcc/0x8e0 [mlx5_core] ? mlx5e_handle_rx_dim+0x91/0xd0 [mlx5_core] mlx5e_napi_poll+0x114/0xab0 [mlx5_core] __napi_poll+0x25/0x170 net_rx_action+0x32d/0x3a0 ? mlx5_eq_comp_int+0x8d/0x280 [mlx5_core] ? notifier_call_chain+0x33/0xa0 handle_softirqs+0xda/0x250 irq_exit_rcu+0x6d/0xc0 common_interrupt+0x81/0xa0 </IRQ> Fixes: b2ac7541e377 ("net/mlx5e: IPsec: Add Connect-X IPsec Rx data path offload") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Yael Chemla <ychemla@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1753256672-337784-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 daysnet/mlx5e: Clear Read-Only port buffer size in PBMC before updateAlexei Lazar
When updating the PBMC register, we read its current value, modify desired fields, then write it back. The port_buffer_size field within PBMC is Read-Only (RO). If this RO field contains a non-zero value when read, attempting to write it back will cause the entire PBMC register update to fail. This commit ensures port_buffer_size is explicitly cleared to zero after reading the PBMC register but before writing back the modified value. This allows updates to other fields in the PBMC register to succeed. Fixes: 0696d60853d5 ("net/mlx5e: Receive buffer configuration") Signed-off-by: Alexei Lazar <alazar@nvidia.com> Reviewed-by: Yael Chemla <ychemla@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1753256672-337784-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 daysnet: Fix typosBjorn Helgaas
Fix typos in comments and error messages. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: David Arinzon <darinzon@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250723201528.2908218-1-helgaas@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysnet/mlx5: Fix build -Wframe-larger-than warningsZhu Yanjun
When building, the following warnings will appear. " pci_irq.c: In function ‘mlx5_ctrl_irq_request’: pci_irq.c:494:1: warning: the frame size of 1040 bytes is larger than 1024 bytes [-Wframe-larger-than=] pci_irq.c: In function ‘mlx5_irq_request_vector’: pci_irq.c:561:1: warning: the frame size of 1040 bytes is larger than 1024 bytes [-Wframe-larger-than=] eq.c: In function ‘comp_irq_request_sf’: eq.c:897:1: warning: the frame size of 1080 bytes is larger than 1024 bytes [-Wframe-larger-than=] irq_affinity.c: In function ‘irq_pool_request_irq’: irq_affinity.c:74:1: warning: the frame size of 1048 bytes is larger than 1024 bytes [-Wframe-larger-than=] " These warnings indicate that the stack frame size exceeds 1024 bytes in these functions. To resolve this, instead of allocating large memory buffers on the stack, it is better to use kvzalloc to allocate memory dynamically on the heap. This approach reduces stack usage and eliminates these frame size warnings. Acked-by: Junxian Huang <huangjunxian6@hisilicon.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250722212023.244296-1-yanjun.zhu@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysnet: Use netif_threaded_enable instead of netif_set_threaded in driversSamiullah Khawaja
Prepare for adding an enum type for NAPI threaded states by adding netif_threaded_enable API. De-export the existing netif_set_threaded API and only use it internally. Update existing drivers to use netif_threaded_enable instead of the de-exported netif_set_threaded. Note that dev_set_threaded used by mt76 debugfs file is unchanged. Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Link: https://patch.msgid.link/20250723013031.2911384-3-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc8). Conflicts: drivers/net/ethernet/microsoft/mana/gdma_main.c 9669ddda18fb ("net: mana: Fix warnings for missing export.h header inclusion") 755391121038 ("net: mana: Allocate MSI-X vectors dynamically") https://lore.kernel.org/20250711130752.23023d98@canb.auug.org.au Adjacent changes: drivers/net/ethernet/ti/icssg/icssg_prueth.h 6e86fb73de0f ("net: ti: icssg-prueth: Fix buffer allocation for ICSSG") ffe8a4909176 ("net: ti: icssg-prueth: Read firmware-names from device tree") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysmlx5: access ->pp through netmem_desc instead of pageByungchul Park
To eliminate the use of struct page in page pool, the page pool users should use netmem descriptor and APIs instead. Make mlx5 access ->pp through netmem_desc instead of page. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-11-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysnetmem, mlx4: access ->pp_ref_count through netmem_desc instead of pageByungchul Park
To eliminate the use of struct page in page pool, the page pool users should use netmem descriptor and APIs instead. Make mlx4 access ->pp_ref_count through netmem_desc instead of page. While at it, add a helper, pp_page_to_nmdesc() and __pp_page_to_nmdesc(), that can be used to get netmem_desc from page only if it's a pp page. For now that netmem_desc overlays on page, it can be achieved by just casting, and use macro and _Generic to cover const casting as well. Plus, change page_pool_page_is_pp() to check for 'const struct page *' instead of 'struct page *' since it doesn't modify data and additionally covers const type. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-4-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 daysMerge branch 'mlx5-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2025-07-22 The following pull-request contains common mlx5 updates * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Expose cable_length field in PFCC register net/mlx5: Add IFC bits and enums for buf_ownership net/mlx5: Add IFC bits to support RSS for IPSec offload ==================== Link: https://patch.msgid.link/1753175048-330044-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 daysnet/mlx5e: Remove duplicate mkey from SHAMPO headerLama Kayal
SHAMPO structure holds two variations of the mkey, which is unnecessary, a duplication that's repeated per rq. Remove duplicate mkey information and keep only one version, the one used in the fast path, rename field to reflect field type clearly. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1753081999-326247-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 daysnet/mlx5e: SHAMPO, Remove mlx5e_shampo_get_log_hd_entry_size()Lama Kayal
Refactor mlx5e_shampo_get_log_hd_entry_size() as macro, for more simplicity. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1753081999-326247-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 daysnet/mlx5e: SHAMPO, Cleanup reservation size formulaLama Kayal
The reservation size formula can be reduced to a simple evaluation of MLX5E_SHAMPO_WQ_RESRV_SIZE. This leaves mlx5e_shampo_get_log_rsrv_size() with one single use, which can be replaced with a macro for simplicity. Also, function mlx5e_shampo_get_log_rsrv_size() is used only throughout params.c, make it static. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1753081999-326247-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 daysnet/mlx5: Add IFC bits to support RSS for IPSec offloadJianbo Liu
This adds the capabilities, ipsec_next_header and inner/outer l4_type_ext fields to support RSS for the decrypted packets. These fields are specifically for firmware steering. HWS validation logic is updated to correctly handle the changes, ensuring the unsupported fields are not set. Besides, reserved_at_c4 is fixed to reserved_at_d4 to reflect the accurate offset within the structure. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1752734895-257735-2-git-send-email-tariqt@nvidia.com Signed-off-by: Leon Romanovsky <leon@kernel.org>
12 daysnet/mlx5: E-Switch, Fix peer miss rules to use peer eswitchShahar Shitrit
In the original design, it is assumed local and peer eswitches have the same number of vfs. However, in new firmware, local and peer eswitches can have different number of vfs configured by mlxconfig. In such configuration, it is incorrect to derive the number of vfs from the local device's eswitch. Fix this by updating the peer miss rules add and delete functions to use the peer device's eswitch and vf count instead of the local device's information, ensuring correct behavior regardless of vf configuration differences. Fixes: ac004b832128 ("net/mlx5e: E-Switch, Add peer miss rules") Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752753970-261832-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysnet/mlx5: Fix memory leak in cmd_exec()Chiara Meiohas
If cmd_exec() is called with callback and mlx5_cmd_invoke() returns an error, resources allocated in cmd_exec() will not be freed. Fix the code to release the resources if mlx5_cmd_invoke() returns an error. Fixes: f086470122d5 ("net/mlx5: cmdif, Return value improvements") Reported-by: Alex Tereshkin <atereshkin@nvidia.com> Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752753970-261832-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysnet: s/dev_set_threaded/netif_set_threaded/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Note that one dev_set_threaded call still remains in mt76 for debugfs file. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-7-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
12 daysnet: s/dev_get_port_parent_id/netif_get_port_parent_id/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysnet/mlx5e: Properly access RCU protected qdisc_sleeping variableLeon Romanovsky
qdisc_sleeping variable is declared as "struct Qdisc __rcu" and as such needs proper annotation while accessing it. Without rtnl_dereference(), the following error is generated by sparse: drivers/net/ethernet/mellanox/mlx5/core/en/qos.c:377:40: warning: incorrect type in initializer (different address spaces) drivers/net/ethernet/mellanox/mlx5/core/en/qos.c:377:40: expected struct Qdisc *qdisc drivers/net/ethernet/mellanox/mlx5/core/en/qos.c:377:40: got struct Qdisc [noderef] __rcu *qdisc_sleeping Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/1752675472-201445-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysnet/mlx5e: fix kdoc warning on eswitch.hMoshe Shemesh
Fix the following kdoc warning: git ls-files *.[ch] | egrep drivers/net/ethernet/mellanox/mlx5/core/ |\ xargs scripts/kernel-doc --none drivers/net/ethernet/mellanox/mlx5/core/eswitch.h:824: warning: cannot understand function prototype: 'struct mlx5_esw_event_info ' Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/1752675472-201445-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysnet/mlx5: HWS, Enable IPSec hardware offload in legacy modeLama Kayal
IPSec hardware offload in legacy mode should not be affected by the steering mode, hence it should also work properly with hmfs mode. Remove steering mode validation when calculating the cap for packet offload, this will also enable the missing cap MLX5_IPSEC_CAP_PRIO needed for crypto offload. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/1752675472-201445-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysnet/mlx5: Fix an IS_ERR() vs NULL bug in esw_qos_move_node()Dan Carpenter
The __esw_qos_alloc_node() function returns NULL on error. It doesn't return error pointers. Update the error checking to match. Fixes: 96619c485fa6 ("net/mlx5: Add support for setting tc-bw on nodes") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/0ce4ec2a-2b5d-4652-9638-e715a99902a7@sabinyo.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysnet/mlx5e: TX, Fix dma unmapping for devmem txDragos Tatulea
net_iovs should have the dma address set to 0 so that netmem_dma_unmap_page_attrs() correctly skips the unmap. This was not done in mlx5 when support for devmem tx was added and resulted in the splat below when the platform iommu was enabled. This patch addresses the issue by using netmem_dma_unmap_addr_set() which handles the net_iov case when setting the dma address. A small refactoring of mlx5e_dma_push() was required to be able to use this API. The function was split in two versions and each version called accordingly. Note that netmem_dma_unmap_addr_set() introduces an additional if case. Splat: WARNING: CPU: 14 PID: 2587 at drivers/iommu/dma-iommu.c:1228 iommu_dma_unmap_page+0x7d/0x90 Modules linked in: [...] Unloaded tainted modules: i10nm_edac(E):1 fjes(E):1 CPU: 14 UID: 0 PID: 2587 Comm: ncdevmem Tainted: G S E 6.15.0+ #3 PREEMPT(voluntary) Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE Hardware name: HPE ProLiant DL380 Gen10 Plus/ProLiant DL380 Gen10 Plus, BIOS U46 06/01/2022 RIP: 0010:iommu_dma_unmap_page+0x7d/0x90 Code: [...] RSP: 0000:ff6b1e3ea0b2fc58 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ff46ef2d0a2340c8 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001 RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffff8827a120 R10: 0000000000000000 R11: 0000000000000000 R12: 00000000d8000000 R13: 0000000000000008 R14: 0000000000000001 R15: 0000000000000000 FS: 00007feb69adf740(0000) GS:ff46ef2c779f1000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007feb69cca000 CR3: 0000000154b97006 CR4: 0000000000773ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> dma_unmap_page_attrs+0x227/0x250 mlx5e_poll_tx_cq+0x163/0x510 [mlx5_core] mlx5e_napi_poll+0x94/0x720 [mlx5_core] __napi_poll+0x28/0x1f0 net_rx_action+0x33a/0x420 ? mlx5e_completion_event+0x3d/0x40 [mlx5_core] handle_softirqs+0xe8/0x2f0 __irq_exit_rcu+0xcd/0xf0 common_interrupt+0x47/0xa0 asm_common_interrupt+0x26/0x40 RIP: 0033:0x7feb69cd08ec Code: [...] RSP: 002b:00007ffc01b8c880 EFLAGS: 00000246 RAX: 00000000c3a60cf7 RBX: 0000000000045e12 RCX: 000000000000000e RDX: 00000000000035b4 RSI: 0000000000000000 RDI: 00007ffc01b8c8c0 RBP: 00007ffc01b8c8b0 R08: 0000000000000000 R09: 0000000000000064 R10: 00007ffc01b8c8c0 R11: 0000000000000000 R12: 00007feb69cca000 R13: 00007ffc01b90e48 R14: 0000000000427e18 R15: 00007feb69d07000 </TASK> Cc: Mina Almasry <almasrymina@google.com> Reported-by: Stanislav Fomichev <stfomichev@gmail.com> Closes: https://lore.kernel.org/all/aFM6r9kFHeTdj-25@mini-arch/ Fixes: 5a842c288cfa ("net/mlx5e: Add TX support for netmems") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/1752649242-147678-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysMerge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc7). Conflicts: Documentation/netlink/specs/ovpn.yaml 880d43ca9aa4 ("netlink: specs: clean up spaces in brackets") af52020fc599 ("ovpn: reject unexpected netlink attributes") drivers/net/phy/phy_device.c a44312d58e78 ("net: phy: Don't register LEDs for genphy") f0f2b992d818 ("net: phy: Don't register LEDs for genphy") https://lore.kernel.org/20250710114926.7ec3a64f@kernel.org drivers/net/wireless/intel/iwlwifi/fw/regulatory.c drivers/net/wireless/intel/iwlwifi/mld/regulatory.c 5fde0fcbd760 ("wifi: iwlwifi: mask reserved bits in chan_state_active_bitmap") ea045a0de3b9 ("wifi: iwlwifi: add support for accepting raw DSM tables by firmware") net/ipv6/mcast.c ae3264a25a46 ("ipv6: mcast: Delay put pmc->idev in mld_del_delrec()") a8594c956cc9 ("ipv6: mcast: Avoid a duplicate pointer check in mld_del_delrec()") https://lore.kernel.org/8cc52891-3653-4b03-a45e-05464fe495cf@kernel.org No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
13 daysnet/mlx5: Update the list of the PCI supported devicesMaor Gottlieb
Add the upcoming ConnectX-10 device ID to the table of supported PCI device IDs. Cc: stable@vger.kernel.org Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Eran Ben Elisha <eranbe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1752650969-148501-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-16net/mlx5e: Add device PCIe congestion ethtool statsDragos Tatulea
Implement the PCIe Congestion Event notifier which triggers a work item to query the PCIe Congestion Event object. The result of the congestion state is reflected in the new ethtool stats: * pci_bw_inbound_high: the device has crossed the high threshold for inbound PCIe traffic. * pci_bw_inbound_low: the device has crossed the low threshold for inbound PCIe traffic * pci_bw_outbound_high: the device has crossed the high threshold for outbound PCIe traffic. * pci_bw_outbound_low: the device has crossed the low threshold for outbound PCIe traffic The high and low thresholds are currently configured at 90% and 75%. These are hysteresis thresholds which help to check if the PCI bus on the device side is in a congested state. If low + 1 = high then the device is in a congested state. If low == high then the device is not in a congested state. The counters are also documented. A follow-up patch will make the thresholds configurable. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1752589821-145787-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-16net/mlx5e: Create/destroy PCIe Congestion Event objectDragos Tatulea
Add initial infrastructure to create and destroy the PCIe Congestion Event object if the object is supported. The verb for the object creation function is "set" instead of "create" because the function will accommodate the modify operation as well in a subsequent patch. The next patches will hook it up to the event handler and will add actual functionality. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1752589821-145787-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-16net/mlx5: Correctly set gso_size when LRO is usedChristoph Paasch
gso_size is expected by the networking stack to be the size of the payload (thus, not including ethernet/IP/TCP-headers). However, cqe_bcnt is the full sized frame (including the headers). Dividing cqe_bcnt by lro_num_seg will then give incorrect results. For example, running a bpftrace higher up in the TCP-stack (tcp_event_data_recv), we commonly have gso_size set to 1450 or 1451 even though in reality the payload was only 1448 bytes. This can have unintended consequences: - In tcp_measure_rcv_mss() len will be for example 1450, but. rcv_mss will be 1448 (because tp->advmss is 1448). Thus, we will always recompute scaling_ratio each time an LRO-packet is received. - In tcp_gro_receive(), it will interfere with the decision whether or not to flush and thus potentially result in less gro'ed packets. So, we need to discount the protocol headers from cqe_bcnt so we can actually divide the payload by lro_num_seg to get the real gso_size. v2: - Use "(unsigned char *)tcp + tcp->doff * 4 - skb->data)" to compute header-len (Tariq Toukan <tariqt@nvidia.com>) - Improve commit-message (Gal Pressman <gal@nvidia.com>) Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files") Signed-off-by: Christoph Paasch <cpaasch@openai.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250715-cpaasch-pf-925-investigate-incorrect-gso_size-on-cx-7-nic-v2-1-e06c3475f3ac@openai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc6-2). No conflicts. Adjacent changes: drivers/net/wireless/mediatek/mt76/mt7925/mcu.c c701574c5412 ("wifi: mt76: mt7925: fix invalid array index in ssid assignment during hw scan") b3a431fe2e39 ("wifi: mt76: mt7925: fix off by one in mt7925_mcu_hw_scan()") drivers/net/wireless/mediatek/mt76/mt7996/mac.c 62da647a2b20 ("wifi: mt76: mt7996: Add MLO support to mt7996_tx_check_aggr()") dc66a129adf1 ("wifi: mt76: add a wrapper for wcid access with validation") drivers/net/wireless/mediatek/mt76/mt7996/main.c 3dd6f67c669c ("wifi: mt76: Move RCU section in mt7996_mcu_add_rate_ctrl()") 8989d8e90f5f ("wifi: mt76: mt7996: Do not set wcid.sta to 1 in mt7996_mac_sta_event()") net/mac80211/cfg.c 58fcb1b4287c ("wifi: mac80211: reject VHT opmode for unsupported channel widths") 037dc18ac3fb ("wifi: mac80211: add support for storing station S1G capabilities") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net/mlx5e: Add new prio for promiscuous modeJianbo Liu
An optimization for promiscuous mode adds a high-priority steering table with a single catch-all rule to steer all traffic directly to the TTC table. However, a gap exists between the creation of this table and the insertion of the catch-all rule. Packets arriving in this brief window would miss as no rule was inserted yet, unnecessarily incrementing the 'rx_steer_missed_packets' counter and dropped. This patch resolves the issue by introducing a new prio for this table, placing it between MLX5E_TC_PRIO and MLX5E_NIC_PRIO. By doing so, packets arriving during the window now fall through to the next prio (at MLX5E_NIC_PRIO) instead of being dropped. Fixes: 1c46d7409f30 ("net/mlx5e: Optimize promiscuous mode") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1752155624-24095-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net/mlx5e: Fix race between DIM disable and net_dim()Carolina Jubran
There's a race between disabling DIM and NAPI callbacks using the dim pointer on the RQ or SQ. If NAPI checks the DIM state bit and sees it still set, it assumes `rq->dim` or `sq->dim` is valid. But if DIM gets disabled right after that check, the pointer might already be set to NULL, leading to a NULL pointer dereference in net_dim(). Fix this by calling `synchronize_net()` before freeing the DIM context. This ensures all in-progress NAPI callbacks are finished before the pointer is cleared. Kernel log: BUG: kernel NULL pointer dereference, address: 0000000000000000 ... RIP: 0010:net_dim+0x23/0x190 ... Call Trace: <TASK> ? __die+0x20/0x60 ? page_fault_oops+0x150/0x3e0 ? common_interrupt+0xf/0xa0 ? sysvec_call_function_single+0xb/0x90 ? exc_page_fault+0x74/0x130 ? asm_exc_page_fault+0x22/0x30 ? net_dim+0x23/0x190 ? mlx5e_poll_ico_cq+0x41/0x6f0 [mlx5_core] ? sysvec_apic_timer_interrupt+0xb/0x90 mlx5e_handle_rx_dim+0x92/0xd0 [mlx5_core] mlx5e_napi_poll+0x2cd/0xac0 [mlx5_core] ? mlx5e_poll_ico_cq+0xe5/0x6f0 [mlx5_core] busy_poll_stop+0xa2/0x200 ? mlx5e_napi_poll+0x1d9/0xac0 [mlx5_core] ? mlx5e_trigger_irq+0x130/0x130 [mlx5_core] __napi_busy_loop+0x345/0x3b0 ? sysvec_call_function_single+0xb/0x90 ? asm_sysvec_call_function_single+0x16/0x20 ? sysvec_apic_timer_interrupt+0xb/0x90 ? pcpu_free_area+0x1e4/0x2e0 napi_busy_loop+0x11/0x20 xsk_recvmsg+0x10c/0x130 sock_recvmsg+0x44/0x70 __sys_recvfrom+0xbc/0x130 ? __schedule+0x398/0x890 __x64_sys_recvfrom+0x20/0x30 do_syscall_64+0x4c/0x100 entry_SYSCALL_64_after_hwframe+0x4b/0x53 ... ---[ end trace 0000000000000000 ]--- ... ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]--- Fixes: 445a25f6e1a2 ("net/mlx5e: Support updating coalescing configuration without resetting channels") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1752155624-24095-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-11net/mlx5: Reset bw_share field when changing a node's parentCarolina Jubran
When changing a node's parent, its scheduling element is destroyed and re-created with bw_share 0. However, the node's bw_share field was not updated accordingly. Set the node's bw_share to 0 after re-creation to keep the software state in sync with the firmware configuration. Fixes: 9c7bbf4c3304 ("net/mlx5: Add support for setting parent of nodes") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1752155624-24095-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5e: RX, Remove unnecessary RQT redirectsTariq Toukan
RQTs (Receive Queue Table) should redirect traffic to the channels' RQs when they're active. Otherwise, redirect to the designated "drop RQ". RQTs are created in "inactive" state, pointing to the "drop RQ". In activate and de-activate flows, do not "deactivate" the rest of RQTs (beyond the num of channels), as they are already inactive. This cuts down unnecessary execution of FW commands (MODIFY_RQT), and improves the latency of open/close channels or configuration change. Perf: NIC: Connect-X7. Configuration: 1 combined channel, max num channels 248. Measure time for "interface up + interface down". Before: 0.313 sec After: 0.057 sec (5.5x faster) 247 MODIFY_RQT commands saved in interface up. 247 MODIFY_RQT commands saved in interface down. Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-6-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5: Warn when write combining is not supportedMaor Gottlieb
Warn if write combining is not supported, as it can impact latency. Add the warning message to be printed only when the driver actually run the test and detect unsupported state, rather than when inheriting parent's result for SFs. Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Reviewed-by: Michael Guralnik <michaelgur@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5e: Replace recursive VLAN push handling with an iterative loopGal Pressman
mlx5e_tc_act_vlan_add_push_action() uses tail-recursion to walk through a stack of VLAN devices. There is no need for a complicated recursion with unnecessary stack consumption and less obvious code flow, rewrite the function so that it uses a do while loop instead. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5e: CT: extract a memcmp from a spinlock sectionCosmin Ratiu
This reduces the time the lock is held and reduces contention. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/mlx5e: Remove unused VLAN insertion logic in TX pathCarolina Jubran
The VLAN insertion capability (`wqe_vlan_insert`) was never enabled on all mlx5 devices. When VLAN TX offload is advertised but this capability is not supported, the driver uses inline headers to insert the VLAN tag. To support this, the driver used to set the `MLX5E_SQ_STATE_VLAN_NEED_L2_INLINE` bit to enforce L2 inline mode when `wqe_vlan_insert` was not supported. Since the capability is disabled on all devices, this logic was always active, and the SQ flag has become redundant. L2 inline is enforced unconditionally for VLAN-tagged packets. The `skb_vlan_tag_present()` check in the else-if block of `mlx5e_sq_xmit_wqe()` is never true by this point in the TX flow, as the VLAN tag has already been inserted by the driver using inline headers. As a result, this code is never executed. Remove the redundant SQ state, dead VLAN insertion code block, and related logic. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1752009387-13300-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08Merge branch 'mlx5-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2025-07-08 The following pull-request contains common mlx5 updates for your *net-next* tree. v2: https://lore.kernel.org/1751574385-24672-1-git-send-email-tariqt@nvidia.com * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Check device memory pointer before usage net/mlx5: fs, fix RDMA TRANSPORT init cleanup flow net/mlx5: Add IFC bits for PCIe Congestion Event object net/mlx5: Small refactor for general object capabilities net/mlx5: fs, add multiple prios to RDMA TRANSPORT steering domain ==================== Link: https://patch.msgid.link/1752002102-11316-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08eth: mlx5: migrate to the *_rxfh_context opsJakub Kicinski
Convert mlx5 to dedicated RXFH ops. This is a fairly shallow conversion, TBH, most of the driver code stays as is, but we let the core allocate the context ID for the driver. mlx5e_rx_res_rss_get_rxfh() and friends are made void, since core only calls the driver for context 0. The second call is right after context creation so it must exist (tm). Tested with drivers/net/hw/rss_ctx.py on MCX6. Reviewed-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250707184115.2285277-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net/mlx5: Fix spelling mistake "disabliing" -> "disabling"Colin Ian King
There is a spelling mistake in a NL_SET_ERR_MSG_MOD message. Fix it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250703102219.1248399-1-colin.i.king@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07net/mlx5: Add HWS as secondary steering modeMoshe Shemesh
Add HW Steering (HWS) as a secondary option for device steering mode. If the device does not support SW Steering (SWS), HW Steering will be used as the default, provided it is supported. FW Steering will now be selected as the default only if both HWS and SWS are unavailable. Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250703185431.445571-11-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07net/mlx5: HWS, Shrink empty matchersYevgeny Kliteynik
Matcher size is dynamic: it starts at initial size, and then it grows through rehash as more and more rules are added to this matcher. When rules are deleted, matcher's size is not decreased. Rehash approach is greedy. The idea is: if the matcher got to a certain size at some point, chances are - it will get to this size again, so it is better to avoid costly rehash operations whenever possible. However, when all the rules of the matcher are deleted, this should be viewed as special case. If the matcher actually got to the point where it has zero rules, it might be an indication that some usecase from the past is no longer happening. This is where some ICM can be freed. This patch handles this case: when a number of rules in a matcher goes down to zero, the matcher's tables are shrunk to the initial size. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250703185431.445571-10-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07net/mlx5: HWS, Rearrange to prevent forward declarationYevgeny Kliteynik
As a preparation for the following patch that will add support for shrinking empty matchers, rearrange the code to prevent forward declaration of functions. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250703185431.445571-9-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07net/mlx5: HWS, Track matcher sizes individuallyVlad Dogaru
Track and grow matcher sizes individually for RX and TX RTCs. This allows RX-only or TX-only use cases to effectively halve the device resources they use. For testing we used a simple module that inserts 1M RX-only rules and measured the number of pages the device requests, and memory usage as reported by `free -h`. Pages Memory Before this patch: 300k 1.5GiB After this patch: 160k 900MiB Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250703185431.445571-8-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07net/mlx5: HWS, Decouple matcher RX and TX sizesVlad Dogaru
Kernel HWS only uses FDB tables and, as such, creates two lower level containers (RTCs) for each matcher: one for RX and one for TX. Allow these RTCs to differ in size by converting the size part of the matcher attribute to a two element array. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250703185431.445571-7-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>