summaryrefslogtreecommitdiff
path: root/drivers/net/ethernet
AgeCommit message (Collapse)Author
2025-08-28net: macb: Disable clocks onceNeil Mandir
When the driver is removed the clocks are disabled twice: once in macb_remove and a second time by runtime pm. Disable wakeup in remove so all the clocks are disabled and skip the second call to macb_clks_disable. Always suspend the device as we always set it active in probe. Fixes: d54f89af6cc4 ("net: macb: Add pm runtime support") Signed-off-by: Neil Mandir <neil.mandir@seco.com> Co-developed-by: Sean Anderson <sean.anderson@linux.dev> Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Link: https://patch.msgid.link/20250826143022.935521-1-sean.anderson@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-27fbnic: Move phylink resume out of service_task and into open/closeAlexander Duyck
The fbnic driver was presenting with the following locking assert coming out of a PM resume: [ 42.208116][ T164] RTNL: assertion failed at drivers/net/phy/phylink.c (2611) [ 42.208492][ T164] WARNING: CPU: 1 PID: 164 at drivers/net/phy/phylink.c:2611 phylink_resume+0x190/0x1e0 [ 42.208872][ T164] Modules linked in: [ 42.209140][ T164] CPU: 1 UID: 0 PID: 164 Comm: bash Not tainted 6.17.0-rc2-virtme #134 PREEMPT(full) [ 42.209496][ T164] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-5.fc42 04/01/2014 [ 42.209861][ T164] RIP: 0010:phylink_resume+0x190/0x1e0 [ 42.210057][ T164] Code: 83 e5 01 0f 85 b0 fe ff ff c6 05 1c cd 3e 02 01 90 ba 33 0a 00 00 48 c7 c6 20 3a 1d a5 48 c7 c7 e0 3e 1d a5 e8 21 b8 90 fe 90 <0f> 0b 90 90 e9 86 fe ff ff e8 42 ea 1f ff e9 e2 fe ff ff 48 89 ef [ 42.210708][ T164] RSP: 0018:ffffc90000affbd8 EFLAGS: 00010296 [ 42.210983][ T164] RAX: 0000000000000000 RBX: ffff8880078d8400 RCX: 0000000000000000 [ 42.211235][ T164] RDX: 0000000000000000 RSI: 1ffffffff4f10938 RDI: 0000000000000001 [ 42.211466][ T164] RBP: 0000000000000000 R08: ffffffffa2ae79ea R09: fffffbfff4b3eb84 [ 42.211707][ T164] R10: 0000000000000003 R11: 0000000000000000 R12: ffff888007ad8000 [ 42.211997][ T164] R13: 0000000000000002 R14: ffff888006a18800 R15: ffffffffa34c59e0 [ 42.212234][ T164] FS: 00007f0dc8e39740(0000) GS:ffff88808f51f000(0000) knlGS:0000000000000000 [ 42.212505][ T164] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 42.212704][ T164] CR2: 00007f0dc8e9fe10 CR3: 000000000b56d003 CR4: 0000000000772ef0 [ 42.213227][ T164] PKRU: 55555554 [ 42.213366][ T164] Call Trace: [ 42.213483][ T164] <TASK> [ 42.213565][ T164] __fbnic_pm_attach.isra.0+0x8e/0xa0 [ 42.213725][ T164] pci_reset_function+0x116/0x1d0 [ 42.213895][ T164] reset_store+0xa0/0x100 [ 42.214025][ T164] ? pci_dev_reset_attr_is_visible+0x50/0x50 [ 42.214221][ T164] ? sysfs_file_kobj+0xc1/0x1e0 [ 42.214374][ T164] ? sysfs_kf_write+0x65/0x160 [ 42.214526][ T164] kernfs_fop_write_iter+0x2f8/0x4c0 [ 42.214677][ T164] ? kernfs_vma_page_mkwrite+0x1f0/0x1f0 [ 42.214836][ T164] new_sync_write+0x308/0x6f0 [ 42.214987][ T164] ? __lock_acquire+0x34c/0x740 [ 42.215135][ T164] ? new_sync_read+0x6f0/0x6f0 [ 42.215288][ T164] ? lock_acquire.part.0+0xbc/0x260 [ 42.215440][ T164] ? ksys_write+0xff/0x200 [ 42.215590][ T164] ? perf_trace_sched_switch+0x6d0/0x6d0 [ 42.215742][ T164] vfs_write+0x65e/0xbb0 [ 42.215876][ T164] ksys_write+0xff/0x200 [ 42.215994][ T164] ? __ia32_sys_read+0xc0/0xc0 [ 42.216141][ T164] ? do_user_addr_fault+0x269/0x9f0 [ 42.216292][ T164] ? rcu_is_watching+0x15/0xd0 [ 42.216442][ T164] do_syscall_64+0xbb/0x360 [ 42.216591][ T164] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 42.216784][ T164] RIP: 0033:0x7f0dc8ea9986 A bit of digging showed that we were invoking the phylink_resume as a part of the fbnic_up path when we were enabling the service task while not holding the RTNL lock. We should be enabling this sooner as a part of the ndo_open path and then just letting the service task come online later. This will help to enforce the correct locking and brings the phylink interface online at the same time as the network interface, instead of at a later time. I tested this on QEMU to verify this was working by putting the system to sleep using "echo mem > /sys/power/state" to put the system to sleep in the guest and then using the command "system_wakeup" in the QEMU monitor. Fixes: 69684376eed5 ("eth: fbnic: Add link detection") Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/175616257316.1963577.12238158800417771119.stgit@ahduyck-xeon-server.home.arpa Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-27fbnic: Fixup rtnl_lock and devl_lock handling related to mailbox codeAlexander Duyck
The exception handling path for the __fbnic_pm_resume function had a bug in that it was taking the devlink lock and then exiting to exception handling instead of waiting until after it released the lock to do so. In order to handle that I am swapping the placement of the unlock and the exception handling jump to label so that we don't trigger a deadlock by holding the lock longer than we need to. In addition this change applies the same ordering to the rtnl_lock/unlock calls in the same function as it should make the code easier to follow if it adheres to a consistent pattern. Fixes: 82534f446daa ("eth: fbnic: Add devlink dev flash support") Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/175616256667.1963577.5543500806256052549.stgit@ahduyck-xeon-server.home.arpa Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net: stmmac: Set CIC bit only for TX queues with COERohan G Thomas
Currently, in the AF_XDP transmit paths, the CIC bit of TX Desc3 is set for all packets. Setting this bit for packets transmitting through queues that don't support checksum offloading causes the TX DMA to get stuck after transmitting some packets. This patch ensures the CIC bit of TX Desc3 is set only if the TX queue supports checksum offloading. Fixes: 132c32ee5bc0 ("net: stmmac: Add TX via XDP zero-copy socket") Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com> Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com> Link: https://patch.msgid.link/20250825-xgmac-minor-fixes-v3-3-c225fe4444c0@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net: stmmac: xgmac: Correct supported speed modesRohan G Thomas
Correct supported speed modes as per the XGMAC databook. Commit 9cb54af214a7 ("net: stmmac: Fix IP-cores specific MAC capabilities") removes support for 10M, 100M and 1000HD. 1000HD is not supported by XGMAC IP, but it does support 10M and 100M FD mode for XGMAC version >= 2_20, and it also supports 10M and 100M HD mode if the HDSEL bit is set in the MAC_HW_FEATURE0 reg. This commit enables support for 10M and 100M speed modes for XGMAC IP based on XGMAC version and MAC capabilities. Fixes: 9cb54af214a7 ("net: stmmac: Fix IP-cores specific MAC capabilities") Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com> Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com> Link: https://patch.msgid.link/20250825-xgmac-minor-fixes-v3-2-c225fe4444c0@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net: stmmac: xgmac: Do not enable RX FIFO Overflow interruptsRohan G Thomas
Enabling RX FIFO Overflow interrupts is counterproductive and causes an interrupt storm when RX FIFO overflows. Disabling this interrupt has no side effect and eliminates interrupt storms when the RX FIFO overflows. Commit 8a7cb245cf28 ("net: stmmac: Do not enable RX FIFO overflow interrupts") disables RX FIFO overflow interrupts for DWMAC4 IP and removes the corresponding handling of this interrupt. This patch is doing the same thing for XGMAC IP. Fixes: 2142754f8b9c ("net: stmmac: Add MAC related callbacks for XGMAC2") Signed-off-by: Rohan G Thomas <rohan.g.thomas@altera.com> Reviewed-by: Matthew Gerlach <matthew.gerlach@altera.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250825-xgmac-minor-fixes-v3-1-c225fe4444c0@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net/mlx5e: Set local Xoff after FW updateAlexei Lazar
The local Xoff value is being set before the firmware (FW) update. In case of a failure where the FW is not updated with the new value, there is no fallback to the previous value. Update the local Xoff value after the FW has been successfully set. Fixes: 0696d60853d5 ("net/mlx5e: Receive buffer configuration") Signed-off-by: Alexei Lazar <alazar@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250825143435.598584-12-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net/mlx5e: Update and set Xon/Xoff upon port speed setAlexei Lazar
Xon/Xoff sizes are derived from calculations that include the port speed. These settings need to be updated and applied whenever the port speed is changed. The port speed is typically set after the physical link goes down and is negotiated as part of the link-up process between the two connected interfaces. Xon/Xoff parameters being updated at the point where the new negotiated speed is established. Fixes: 0696d60853d5 ("net/mlx5e: Receive buffer configuration") Signed-off-by: Alexei Lazar <alazar@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250825143435.598584-11-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net/mlx5e: Update and set Xon/Xoff upon MTU setAlexei Lazar
Xon/Xoff sizes are derived from calculation that include the MTU size. Set Xon/Xoff when MTU is set. If Xon/Xoff fails, set the previous MTU. Fixes: 0696d60853d5 ("net/mlx5e: Receive buffer configuration") Signed-off-by: Alexei Lazar <alazar@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250825143435.598584-10-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net/mlx5: Prevent flow steering mode changes in switchdev modeMoshe Shemesh
Changing flow steering modes is not allowed when eswitch is in switchdev mode. This fix ensures that any steering mode change, including to firmware steering, is correctly blocked while eswitch mode is switchdev. Fixes: e890acd5ff18 ("net/mlx5: Add devlink flow_steering_mode parameter") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250825143435.598584-9-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net/mlx5: Nack sync reset when SFs are presentMoshe Shemesh
If PF (Physical Function) has SFs (Sub-Functions), since the SFs are not taking part in the synchronization flow, sync reset can lead to fatal error on the SFs, as the function will be closed unexpectedly from the SF point of view. Add a check to prevent sync reset when there are SFs on a PF device which is not ECPF, as ECPF is teardowned gracefully before reset. Fixes: 92501fa6e421 ("net/mlx5: Ack on sync_reset_request only if PF can do reset_now") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250825143435.598584-8-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net/mlx5: Fix lockdep assertion on sync reset unload eventMoshe Shemesh
Fix lockdep assertion triggered during sync reset unload event. When the sync reset flow is initiated using the devlink reload fw_activate option, the PF already holds the devlink lock while handling unload event. In this case, delegate sync reset unload event handling back to the devlink callback process to avoid double-locking and resolve the lockdep warning. Kernel log: WARNING: CPU: 9 PID: 1578 at devl_assert_locked+0x31/0x40 [...] Call Trace: <TASK> mlx5_unload_one_devl_locked+0x2c/0xc0 [mlx5_core] mlx5_sync_reset_unload_event+0xaf/0x2f0 [mlx5_core] process_one_work+0x222/0x640 worker_thread+0x199/0x350 kthread+0x10b/0x230 ? __pfx_worker_thread+0x10/0x10 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x8e/0x100 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Fixes: 7a9770f1bfea ("net/mlx5: Handle sync reset unload event") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250825143435.598584-7-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net/mlx5: Reload auxiliary drivers on fw_activateMoshe Shemesh
The devlink reload fw_activate command performs firmware activation followed by driver reload, while devlink reload driver_reinit triggers only driver reload. However, the driver reload logic differs between the two modes, as on driver_reinit mode mlx5 also reloads auxiliary drivers, while in fw_activate mode the auxiliary drivers are suspended where applicable. Additionally, following the cited commit, if the device has multiple PFs, the behavior during fw_activate may vary between PFs: one PF may suspend auxiliary drivers, while another reloads them. Align devlink dev reload fw_activate behavior with devlink dev reload driver_reinit, to reload all auxiliary drivers. Fixes: 72ed5d5624af ("net/mlx5: Suspend auxiliary devices only in case of PCI device suspend") Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Akiva Goldberger <agoldberger@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250825143435.598584-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net/mlx5: HWS, Fix pattern destruction in mlx5hws_pat_get_pattern error pathLama Kayal
In mlx5hws_pat_get_pattern(), when mlx5hws_pat_add_pattern_to_cache() fails, the function attempts to clean up the pattern created by mlx5hws_cmd_header_modify_pattern_create(). However, it incorrectly uses *pattern_id which hasn't been set yet, instead of the local ptrn_id variable that contains the actual pattern ID. This results in attempting to destroy a pattern using uninitialized data from the output parameter, rather than the valid pattern ID returned by the firmware. Use ptrn_id instead of *pattern_id in the cleanup path to properly destroy the created pattern. Fixes: aefc15a0fa1c ("net/mlx5: HWS, added modify header pattern and args handling") Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250825143435.598584-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net/mlx5: HWS, Fix uninitialized variables in mlx5hws_pat_calc_nop error flowLama Kayal
In mlx5hws_pat_calc_nop(), src_field and dst_field are passed to hws_action_modify_get_target_fields() which should set their values. However, if an invalid action type is encountered, these variables remain uninitialized and are later used to update prev_src_field and prev_dst_field. Initialize both variables to INVALID_FIELD to ensure they have defined values in all code paths. Fixes: 01e035fd0380 ("net/mlx5: HWS, handle modify header actions dependency") Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250825143435.598584-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net/mlx5: HWS, Fix memory leak in hws_action_get_shared_stc_nic error flowLama Kayal
When an invalid stc_type is provided, the function allocates memory for shared_stc but jumps to unlock_and_out without freeing it, causing a memory leak. Fix by jumping to free_shared_stc label instead to ensure proper cleanup. Fixes: 504e536d9010 ("net/mlx5: HWS, added actions handling") Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250825143435.598584-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net/mlx5: HWS, Fix memory leak in hws_pool_buddy_init error pathLama Kayal
In the error path of hws_pool_buddy_init(), the buddy allocator cleanup doesn't free the allocator structure itself, causing a memory leak. Add the missing kfree() to properly release all allocated memory. Fixes: c61afff94373 ("net/mlx5: HWS, added memory management handling") Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250825143435.598584-2-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26Merge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2025-08-25 (ice, ixgbe) For ice: Emil adds a check to ensure auxiliary device was created before tear down to prevent NULL a pointer dereference. Jake reworks flow for failed Tx scheduler configuration to allow for proper recovery and operation. He also adjusts ice_adapter index for E825C devices as use of DSN is incompatible with this device. Michal corrects tracking of buffer allocation failure in ice_clean_rx_irq(). For ixgbe: Jedrzej adds __packed attribute to ixgbe_orom_civd_info to compatibility with device OROM data. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: ixgbe: fix ixgbe_orom_civd_info struct layout ice: fix incorrect counter for buffer allocation failures ice: use fixed adapter index for E825C embedded devices ice: don't leave device non-functional if Tx scheduler config fails ice: fix NULL pointer dereference in ice_unplug_aux_dev() on reset ==================== Link: https://patch.msgid.link/20250825215019.3442873-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26bnxt_en: Fix stats context reservation logicMichael Chan
The HW resource reservation logic allows the L2 driver to use the RoCE resources if the RoCE driver is not registered. When calculating the stats contexts available for L2, we should not blindly subtract the stats contexts reserved for RoCE unless the RoCE driver is registered. This bug may cause the L2 rings to be less than the number requested when we are close to running out of stats contexts. Fixes: 2e4592dc9bee ("bnxt_en: Change MSIX/NQs allocation policy") Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250825175927.459987-4-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26bnxt_en: Adjust TX rings if reservation is less than requestedMichael Chan
Before we accept an ethtool request to increase a resource (such as rings), we call the FW to check that the requested resource is likely available first before we commit. But it is still possible that the actual reservation or allocation can fail. The existing code is missing the logic to adjust the TX rings in case the reserved TX rings are less than requested. Add a warning message (a similar message for RX rings already exists) and add the logic to adjust the TX rings. Without this fix, the number of TX rings reported to the stack can exceed the actual TX rings and ethtool -l will report more than the actual TX rings. Fixes: 674f50a5b026 ("bnxt_en: Implement new method to reserve rings.") Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250825175927.459987-3-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26bnxt_en: Fix memory corruption when FW resources change during ifdownSreekanth Reddy
bnxt_set_dflt_rings() assumes that it is always called before any TC has been created. So it doesn't take bp->num_tc into account and assumes that it is always 0 or 1. In the FW resource or capability change scenario, the FW will return flags in bnxt_hwrm_if_change() that will cause the driver to reinitialize and call bnxt_cancel_reservations(). This will lead to bnxt_init_dflt_ring_mode() calling bnxt_set_dflt_rings() and bp->num_tc may be greater than 1. This will cause bp->tx_ring[] to be sized too small and cause memory corruption in bnxt_alloc_cp_rings(). Fix it by properly scaling the TX rings by bp->num_tc in the code paths mentioned above. Add 2 helper functions to determine bp->tx_nr_rings and bp->tx_nr_rings_per_tc. Fixes: ec5d31e3c15d ("bnxt_en: Handle firmware reset status during IF_UP.") Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Signed-off-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250825175927.459987-2-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-26net: macb: Fix offset error in gem_update_statsSean Anderson
hw_stats now has only one variable for tx_octets/rx_octets, so we should only increment p once, not twice. This would cause the statistics to be reported under the wrong categories in `ethtool -S --all-groups` (which uses hw_stats) but not `ethtool -S` (which uses ethtool_stats). Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Fixes: f6af690a295a ("net: cadence: macb: Report standard stats") Link: https://patch.msgid.link/20250825172134.681861-1-sean.anderson@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-25net: dlink: fix multicast stats being counted incorrectlyYeounsu Moon
`McstFramesRcvdOk` counts the number of received multicast packets, and it reports the value correctly. However, reading `McstFramesRcvdOk` clears the register to zero. As a result, the driver was reporting only the packets since the last read, instead of the accumulated total. Fix this by updating the multicast statistics accumulatively instaed of instantaneously. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Tested-on: D-Link DGE-550T Rev-A3 Signed-off-by: Yeounsu Moon <yyyynoom@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250823182927.6063-3-yyyynoom@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-25Octeontx2-af: Fix NIX X2P calibration failuresHariprasad Kelam
Before configuring the NIX block, the AF driver initiates the "NIX block X2P bus calibration" and verifies that NIX interfaces such as CGX and LBK are active and functioning correctly. On few silicon variants(CNF10KA and CNF10KB), X2P calibration failures have been observed on some CGX blocks that are not mapped to the NIX block. Since both NIX-mapped and non-NIX-mapped CGX blocks share the same VENDOR,DEVICE,SUBSYS_DEVID, it's not possible to skip probe based on these parameters. This patch introuduces "is_cgx_mapped_to_nix" API to detect and skip probe of non NIX mapped CGX blocks. Fixes: aba53d5dbcea ("octeontx2-af: NIX block admin queue init") Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Link: https://patch.msgid.link/20250822105805.2236528-1-hkelam@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-25ixgbe: fix ixgbe_orom_civd_info struct layoutJedrzej Jagielski
The current layout of struct ixgbe_orom_civd_info causes incorrect data storage due to compiler-inserted padding. This results in issues when writing OROM data into the structure. Add the __packed attribute to ensure the structure layout matches the expected binary format without padding. Fixes: 70db0788a262 ("ixgbe: read the OROM version information") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-25ice: fix incorrect counter for buffer allocation failuresMichal Kubiak
Currently, the driver increments `alloc_page_failed` when buffer allocation fails in `ice_clean_rx_irq()`. However, this counter is intended for page allocation failures, not buffer allocation issues. This patch corrects the counter by incrementing `alloc_buf_failed` instead, ensuring accurate statistics reporting for buffer allocation failures. Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side") Reported-by: Jacob Keller <jacob.e.keller@intel.com> Suggested-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Priya Singh <priyax.singh@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-25ice: use fixed adapter index for E825C embedded devicesJacob Keller
The ice_adapter structure is used by the ice driver to connect multiple physical functions of a device in software. It was introduced by commit 0e2bddf9e5f9 ("ice: add ice_adapter for shared data across PFs on the same NIC") and is primarily used for PTP support, as well as for handling certain cross-PF synchronization. The original design of ice_adapter used PCI address information to determine which devices should be connected. This was extended to support E825C devices by commit fdb7f54700b1 ("ice: Initial support for E825C hardware in ice_adapter"), which used the device ID for E825C devices instead of the PCI address. Later, commit 0093cb194a75 ("ice: use DSN instead of PCI BDF for ice_adapter index") replaced the use of Bus/Device/Function addressing with use of the device serial number. E825C devices may appear in "Dual NAC" configuration which has multiple physical devices tied to the same clock source and which need to use the same ice_adapter. Unfortunately, each "NAC" has its own NVM which has its own unique Device Serial Number. Thus, use of the DSN for connecting ice_adapter does not work properly. It "worked" in the pre-production systems because the DSN was not initialized on the test NVMs and all the NACs had the same zero'd serial number. Since we cannot rely on the DSN, lets fall back to the logic in the original E825C support which used the device ID. This is safe for E825C only because of the embedded nature of the device. It isn't a discreet adapter that can be plugged into an arbitrary system. All E825C devices on a given system are connected to the same clock source and need to be configured through the same PTP clock. To make this separation clear, reserve bit 63 of the 64-bit index values as a "fixed index" indicator. Always clear this bit when using the device serial number as an index. For E825C, use a fixed value defined as the 0x579C E825C backplane device ID bitwise ORed with the fixed index indicator. This is slightly different than the original logic of just using the device ID directly. Doing so prevents a potential issue with systems where only one of the NACs is connected with an external PHY over SGMII. In that case, one NAC would have the E825C_SGMII device ID, but the other would not. Separate the determination of the full 64-bit index from the 32-bit reduction logic. Provide both ice_adapter_index() and a wrapping ice_adapter_xa_index() which handles reducing the index to a long on 32-bit systems. As before, cache the full index value in the adapter structure to warn about collisions. This fixes issues with E825C not initializing PTP on both NACs, due to failure to connect the appropriate devices to the same ice_adapter. Fixes: 0093cb194a75 ("ice: use DSN instead of PCI BDF for ice_adapter index") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-25ice: don't leave device non-functional if Tx scheduler config failsJacob Keller
The ice_cfg_tx_topo function attempts to apply Tx scheduler topology configuration based on NVM parameters, selecting either a 5 or 9 layer topology. As part of this flow, the driver acquires the "Global Configuration Lock", which is a hardware resource associated with programming the DDP package to the device. This "lock" is implemented by firmware as a way to guarantee that only one PF can program the DDP for a device. Unlike a traditional lock, once a PF has acquired this lock, no other PF will be able to acquire it again (including that PF) until a CORER of the device. Future requests to acquire the lock report that global configuration has already completed. The following flow is used to program the Tx topology: * Read the DDP package for scheduler configuration data * Acquire the global configuration lock * Program Tx scheduler topology according to DDP package data * Trigger a CORER which clears the global configuration lock This is followed by the flow for programming the DDP package: * Acquire the global configuration lock (again) * Download the DDP package to the device * Release the global configuration lock. However, if configuration of the Tx topology fails, (i.e. ice_get_set_tx_topo returns an error code), the driver exits ice_cfg_tx_topo() immediately, and fails to trigger CORER. While the global configuration lock is held, the firmware rejects most AdminQ commands, as it is waiting for the DDP package download (or Tx scheduler topology programming) to occur. The current driver flows assume that the global configuration lock has been reset by CORER after programming the Tx topology. Thus, the same PF attempts to acquire the global lock again, and fails. This results in the driver reporting "an unknown error occurred when loading the DDP package". It then attempts to enter safe mode, but ultimately fails to finish ice_probe() since nearly all AdminQ command report error codes, and the driver stops loading the device at some point during its initialization. The only currently known way that ice_get_set_tx_topo() can fail is with certain older DDP packages which contain invalid topology configuration, on firmware versions which strictly validate this data. The most recent releases of the DDP have resolved the invalid data. However, it is still poor practice to essentially brick the device, and prevent access to the device even through safe mode or recovery mode. It is also plausible that this command could fail for some other reason in the future. We cannot simply release the global lock after a failed call to ice_get_set_tx_topo(). Releasing the lock indicates to firmware that global configuration (downloading of the DDP) has completed. Future attempts by this or other PFs to load the DDP will fail with a report that the DDP package has already been downloaded. Then, PFs will enter safe mode as they realize that the package on the device does not meet the minimum version requirement to load. The reported error messages are confusing, as they indicate the version of the default "safe mode" package in the NVM, rather than the version of the file loaded from /lib/firmware. Instead, we need to trigger CORER to clear global configuration. This is the lowest level of hardware reset which clears the global configuration lock and related state. It also clears any already downloaded DDP. Crucially, it does *not* clear the Tx scheduler topology configuration. Refactor ice_cfg_tx_topo() to always trigger a CORER after acquiring the global lock, regardless of success or failure of the topology configuration. We need to re-initialize the HW structure when we trigger the CORER. Thus, it makes sense for this to be the responsibility of ice_cfg_tx_topo() rather than its caller, ice_init_tx_topology(). This avoids needless re-initialization in cases where we don't attempt to update the Tx scheduler topology, such as if it has already been programmed. There is one catch: failure to re-initialize the HW struct should stop ice_probe(). If this function fails, we won't have a valid HW structure and cannot ensure the device is functioning properly. To handle this, ensure ice_cfg_tx_topo() returns a limited set of error codes. Set aside one specifically, -ENODEV, to indicate that the ice_init_tx_topology() should fail and stop probe. Other error codes indicate failure to apply the Tx scheduler topology. This is treated as a non-fatal error, with an informational message informing the system administrator that the updated Tx topology did not apply. This allows the device to load and function with the default Tx scheduler topology, rather than failing to load entirely. Note that this use of CORER will not result in loops with future PFs attempting to also load the invalid Tx topology configuration. The first PF will acquire the global configuration lock as part of programming the DDP. Each PF after this will attempt to acquire the global lock as part of programming the Tx topology, and will fail with the indication from firmware that global configuration is already complete. Tx scheduler topology configuration is only performed during driver init (probe or devlink reload) and not during cleanup for a CORER that happens after probe completes. Fixes: 91427e6d9030 ("ice: Support 5 layer topology") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-25ice: fix NULL pointer dereference in ice_unplug_aux_dev() on resetEmil Tantilov
Issuing a reset when the driver is loaded without RDMA support, will results in a crash as it attempts to remove RDMA's non-existent auxbus device: echo 1 > /sys/class/net/<if>/device/reset BUG: kernel NULL pointer dereference, address: 0000000000000008 ... RIP: 0010:ice_unplug_aux_dev+0x29/0x70 [ice] ... Call Trace: <TASK> ice_prepare_for_reset+0x77/0x260 [ice] pci_dev_save_and_disable+0x2c/0x70 pci_reset_function+0x88/0x130 reset_store+0x5a/0xa0 kernfs_fop_write_iter+0x15e/0x210 vfs_write+0x273/0x520 ksys_write+0x6b/0xe0 do_syscall_64+0x79/0x3b0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ice_unplug_aux_dev() checks pf->cdev_info->adev for NULL pointer, but pf->cdev_info will also be NULL, leading to the deref in the trace above. Introduce a flag to be set when the creation of the auxbus device is successful, to avoid multiple NULL pointer checks in ice_unplug_aux_dev(). Fixes: c24a65b6a27c7 ("iidc/ice/irdma: Update IDC to support multiple consumers") Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-22Merge branch '200GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue Tony Nguyen says: ==================== idpf: replace Tx flow scheduling buffer ring with buffer pool Joshua Hay says: This series fixes a stability issue in the flow scheduling Tx send/clean path that results in a Tx timeout. The existing guardrails in the Tx path were not sufficient to prevent the driver from reusing completion tags that were still in flight (held by the HW). This collision would cause the driver to erroneously clean the wrong packet thus leaving the descriptor ring in a bad state. The main point of this fix is to replace the flow scheduling buffer ring with a large pool/array of buffers. The completion tag then simply is the index into this array. The driver tracks the free tags and pulls the next free one from a refillq. The cleaning routines simply use the completion tag from the completion descriptor to index into the array to quickly find the buffers to clean. All of the code to support this is added first to ensure traffic still passes with each patch. The final patch then removes all of the obsolete stashing code. * '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue: idpf: remove obsolete stashing code idpf: stop Tx if there are insufficient buffer resources idpf: replace flow scheduling buffer ring with buffer pool idpf: simplify and fix splitq Tx packet rollback error path idpf: improve when to set RE bit logic idpf: add support for Tx refillqs in flow scheduling mode ==================== Link: https://patch.msgid.link/20250821180100.401955-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-22Octeontx2-vf: Fix max packet length errorsHariprasad Kelam
Once driver submits the packets to the hardware, each packet traverse through multiple transmit levels in the following order: SMQ -> TL4 -> TL3 -> TL2 -> TL1 The SMQ supports configurable minimum and maximum packet sizes. It enters to a hang state, if driver submits packets with out of bound lengths. To avoid the same, implement packet length validation before submitting packets to the hardware. Increment tx_dropped counter on failure. Fixes: 3184fb5ba96e ("octeontx2-vf: Virtual function driver support") Fixes: 22f858796758 ("octeontx2-pf: Add basic net_device_ops") Fixes: 3ca6c4c882a7 ("octeontx2-pf: Add packet transmission support") Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Link: https://patch.msgid.link/20250821062528.1697992-1-hkelam@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net: macb: fix unregister_netdev call order in macb_remove()luoguangfei
When removing a macb device, the driver calls phy_exit() before unregister_netdev(). This leads to a WARN from kernfs: ------------[ cut here ]------------ kernfs: can not remove 'attached_dev', no directory WARNING: CPU: 1 PID: 27146 at fs/kernfs/dir.c:1683 Call trace: kernfs_remove_by_name_ns+0xd8/0xf0 sysfs_remove_link+0x24/0x58 phy_detach+0x5c/0x168 phy_disconnect+0x4c/0x70 phylink_disconnect_phy+0x6c/0xc0 [phylink] macb_close+0x6c/0x170 [macb] ... macb_remove+0x60/0x168 [macb] platform_remove+0x5c/0x80 ... The warning happens because the PHY is being exited while the netdev is still registered. The correct order is to unregister the netdev before shutting down the PHY and cleaning up the MDIO bus. Fix this by moving unregister_netdev() ahead of phy_exit() in macb_remove(). Fixes: 8b73fa3ae02b ("net: macb: Added ZynqMP-specific initialization") Signed-off-by: luoguangfei <15388634752@163.com> Link: https://patch.msgid.link/20250818232527.1316-1-15388634752@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21idpf: remove obsolete stashing codeJoshua Hay
With the new Tx buffer management scheme, there is no need for all of the stashing mechanisms, the hash table, the reserve buffer stack, etc. Remove all of that. Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-21idpf: stop Tx if there are insufficient buffer resourcesJoshua Hay
The Tx refillq logic will cause packets to be silently dropped if there are not enough buffer resources available to send a packet in flow scheduling mode. Instead, determine how many buffers are needed along with number of descriptors. Make sure there are enough of both resources to send the packet, and stop the queue if not. Fixes: 7292af042bcf ("idpf: fix a race in txq wakeup") Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-21idpf: replace flow scheduling buffer ring with buffer poolJoshua Hay
Replace the TxQ buffer ring with one large pool/array of buffers (only for flow scheduling). This eliminates the tag generation and makes it impossible for a tag to be associated with more than one packet. The completion tag passed to HW through the descriptor is the index into the array. That same completion tag is posted back to the driver in the completion descriptor, and used to index into the array to quickly retrieve the buffer during cleaning. In this way, the tags are treated as a fix sized resource. If all tags are in use, no more packets can be sent on that particular queue (until some are freed up). The tag pool size is 64K since the completion tag width is 16 bits. For each packet, the driver pulls a free tag from the refillq to get the next free buffer index. When cleaning is complete, the tag is posted back to the refillq. A multi-frag packet spans multiple buffers in the driver, therefore it uses multiple buffer indexes/tags from the pool. Each frag pulls from the refillq to get the next free buffer index. These are tracked in a next_buf field that replaces the completion tag field in the buffer struct. This chains the buffers together so that the packet can be cleaned from the starting completion tag taken from the completion descriptor, then from the next_buf field for each subsequent buffer. In case of a dma_mapping_error occurs or the refillq runs out of free buf_ids, the packet will execute the rollback error path. This unmaps any buffers previously mapped for the packet. Since several free buf_ids could have already been pulled from the refillq, we need to restore its original state as well. Otherwise, the buf_ids/tags will be leaked and not used again until the queue is reallocated. Descriptor completions only advance the descriptor ring index to "clean" the descriptors. The packet completions only clean the buffers associated with the given packet completion tag and do not update the descriptor ring index. When operating in queue based scheduling mode, the array still acts as a ring and will only have TxQ descriptor count entries. The tx_bufs are still associated 1:1 with the descriptor ring entries and we can use the conventional indexing mechanisms. Fixes: c2d548cad150 ("idpf: add TX splitq napi poll support") Signed-off-by: Luigi Rizzo <lrizzo@google.com> Signed-off-by: Brian Vazquez <brianvv@google.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-21idpf: simplify and fix splitq Tx packet rollback error pathJoshua Hay
Move (and rename) the existing rollback logic to singleq.c since that will be the only consumer. Create a simplified splitq specific rollback function to loop through and unmap tx_bufs based on the completion tag. This is critical before replacing the Tx buffer ring with the buffer pool since the previous rollback indexing will not work to unmap the chained buffers from the pool. Cache the next_to_use index before any portion of the packet is put on the descriptor ring. In case of an error, the rollback will bump tail to the correct next_to_use value. Because the splitq path now supports different types of context descriptors (and potentially multiple in the future), this will take care of rolling back any and all context descriptors encoded on the ring for the erroneous packet. The previous rollback logic was broken for PTP packets since it would not account for the PTP context descriptor. Fixes: 1a49cf814fe1 ("idpf: add Tx timestamp flows") Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-21idpf: improve when to set RE bit logicJoshua Hay
Track the gap between next_to_use and the last RE index. Set RE again if the gap is large enough to ensure RE bit is set frequently. This is critical before removing the stashing mechanisms because the opportunistic descriptor ring cleaning from the out-of-order completions will go away. Previously the descriptors would be "cleaned" by both the descriptor (RE) completion and the out-of-order completions. Without the latter, we must ensure the RE bit is set more frequently. Otherwise, it's theoretically possible for the descriptor ring next_to_clean to never advance. The previous implementation was dependent on the start of a packet falling on a 64th index in the descriptor ring, which is not guaranteed with large packets. Signed-off-by: Luigi Rizzo <lrizzo@google.com> Signed-off-by: Brian Vazquez <brianvv@google.com> Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-21idpf: add support for Tx refillqs in flow scheduling modeJoshua Hay
In certain production environments, it is possible for completion tags to collide, meaning N packets with the same completion tag are in flight at the same time. In this environment, any given Tx queue is effectively used to send both slower traffic and higher throughput traffic simultaneously. This is the result of a customer's specific configuration in the device pipeline, the details of which Intel cannot provide. This configuration results in a small number of out-of-order completions, i.e., a small number of packets in flight. The existing guardrails in the driver only protect against a large number of packets in flight. The slower flow completions are delayed which causes the out-of-order completions. The fast flow will continue sending traffic and generating tags. Because tags are generated on the fly, the fast flow eventually uses the same tag for a packet that is still in flight from the slower flow. The driver has no idea which packet it should clean when it processes the completion with that tag, but it will look for the packet on the buffer ring before the hash table. If the slower flow packet completion is processed first, it will end up cleaning the fast flow packet on the ring prematurely. This leaves the descriptor ring in a bad state resulting in a crash or Tx timeout. In summary, generating a tag when a packet is sent can lead to the same tag being associated with multiple packets. This can lead to resource leaks, crashes, and/or Tx timeouts. Before we can replace the tag generation, we need a new mechanism for the send path to know what tag to use next. The driver will allocate and initialize a refillq for each TxQ with all of the possible free tag values. During send, the driver grabs the next free tag from the refillq from next_to_clean. While cleaning the packet, the clean routine posts the tag back to the refillq's next_to_use to indicate that it is now free to use. This mechanism works exactly the same way as the existing Rx refill queues, which post the cleaned buffer IDs back to the buffer queue to be reposted to HW. Since we're using the refillqs for both Rx and Tx now, genericize some of the existing refillq support. Note: the refillqs will not be used yet. This is only demonstrating how they will be used to pass free tags back to the send path. Signed-off-by: Joshua Hay <joshua.a.hay@intel.com> Reviewed-by: Madhu Chittim <madhu.chittim@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-08-21net/mlx5e: Preserve shared buffer capacity during headroom updatesArmen Ratner
When port buffer headroom changes, port_update_shared_buffer() recalculates the shared buffer size and splits it in a 3:1 ratio (lossy:lossless) - Currently, the calculation is: lossless = shared / 4; lossy = (shared / 4) * 3; Meaning, the calculation dropped the remainder of shared % 4 due to integer division, unintentionally reducing the total shared buffer by up to three cells on each update. Over time, this could shrink the buffer below usable size. Fix it by changing the calculation to: lossless = shared / 4; lossy = shared - lossless; This retains all buffer cells while still approximating the intended 3:1 split, preventing capacity loss over time. While at it, perform headroom calculations in units of cells rather than in bytes for more accurate calculations avoiding extra divisions. Fixes: a440030d8946 ("net/mlx5e: Update shared buffer along with device buffer changes") Signed-off-by: Armen Ratner <armeng@nvidia.com> Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Alexei Lazar <alazar@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/20250820133209.389065-9-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5e: Query FW for buffer ownershipAlexei Lazar
The SW currently saves local buffer ownership when setting the buffer. This means that the SW assumes it has ownership of the buffer after the command is set. If setting the buffer fails and we remain in FW ownership, the local buffer ownership state incorrectly remains as SW-owned. This leads to incorrect behavior in subsequent PFC commands, causing failures. Instead of saving local buffer ownership in SW, query the FW for buffer ownership when setting the buffer. This ensures that the buffer ownership state is accurately reflected, avoiding the issues caused by incorrect ownership states. Fixes: ecdf2dadee8e ("net/mlx5e: Receive buffer support for DCBX") Signed-off-by: Alexei Lazar <alazar@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-8-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Restore missing scheduling node cleanup on vport enable failureCarolina Jubran
Restore the __esw_qos_free_node() call removed by the offending commit. Fixes: 97733d1e00a0 ("net/mlx5: Add traffic class scheduling support for vport QoS") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-7-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Fix QoS reference leak in vport enable error pathCarolina Jubran
Add missing esw_qos_put() call when __esw_qos_alloc_node() fails in mlx5_esw_qos_vport_enable(). Fixes: be034baba83e ("net/mlx5: Make vport QoS enablement more flexible for future extensions") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Destroy vport QoS element when no configuration remainsCarolina Jubran
If a VF has been configured and the user later clears all QoS settings, the vport element remains in the firmware QoS tree. This leads to inconsistent behavior compared to VFs that were never configured, since the FW assumes that unconfigured VFs are outside the QoS hierarchy. As a result, the bandwidth share across VFs may differ, even though none of them appear to have any configuration. Align the driver behavior with the FW expectation by destroying the vport QoS element when all configurations are removed. Fixes: c9497c98901c ("net/mlx5: Add support for setting VF min rate") Fixes: cf7e73770d1b ("net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/20250820133209.389065-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5e: Preserve tc-bw during parent changesCarolina Jubran
When changing parent of a node/leaf with tc-bw configured, the code saves and restores tc-bw values. However, it was reading the converted hardware bw_share values (where 0 becomes 1) instead of the original user values, causing incorrect tc-bw calculations after parent change. Store original tc-bw values in the node structure and use them directly for save/restore operations. Fixes: cf7e73770d1b ("net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Remove default QoS group and attach vports directly to root TSARCarolina Jubran
Currently, the driver creates a default group (`node0`) and attaches all vports to it unless the user explicitly sets a parent group. As a result, when a user configures tx_share on a group and tx_share on a VF, the expectation is for the group and the VF to share bandwidth relatively. However, since the VF is not connected to the same parent (but to the default node), the proportional share logic is not applied correctly. To fix this, remove the default group (`node0`) and instead connect vports directly to the root TSAR when no parent is specified. This ensures that vports and groups share the same root scheduler and their tx_share values are compared directly under the same hierarchy. Fixes: 0fe132eac38c ("net/mlx5: E-switch, Allow to add vports to rate groups") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net/mlx5: Base ECVF devlink port attrs from 0Daniel Jurgens
Adjust the vport number by the base ECVF vport number so the port attributes start at 0. Previously the port attributes would start 1 after the maximum number of host VFs. Fixes: dc13180824b7 ("net/mlx5: Enable devlink port for embedded cpu VF vports") Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-2-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21Octeontx2-af: Skip overlap check for SPI fieldHariprasad Kelam
Octeontx2/CN10K silicon supports generating a 256-bit key per packet. The specific fields to be extracted from a packet for key generation are configurable via a Key Extraction (MKEX) Profile. The AF driver scans the configured extraction profile to ensure that fields from upper layers do not overwrite fields from lower layers in the key. Example Packet Field Layout: LA: DMAC + SMAC LB: VLAN LC: IPv4/IPv6 LD: TCP/UDP Valid MKEX Profile Configuration: LA -> DMAC -> key_offset[0-5] LC -> SIP -> key_offset[20-23] LD -> SPORT -> key_offset[30-31] Invalid MKEX profile configuration: LA -> DMAC -> key_offset[0-5] LC -> SIP -> key_offset[20-23] LD -> SPORT -> key_offset[2-3] // Overlaps with DMAC field In another scenario, if the MKEX profile is configured to extract the SPI field from both AH and ESP headers at the same key offset, the driver rejecting this configuration. In a regular traffic, ipsec packet will be having either AH(LD) or ESP (LE). This patch relaxes the check for the same. Fixes: 12aa0a3b93f3 ("octeontx2-af: Harden rule validation.") Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Link: https://patch.msgid.link/20250820063919.1463518-1-hkelam@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21net: airoha: ppe: Do not invalid PPE entries in case of SW hash collisionLorenzo Bianconi
SW hash computed by airoha_ppe_foe_get_entry_hash routine (used for foe_flow hlist) can theoretically produce collisions between two different HW PPE entries. In airoha_ppe_foe_insert_entry() if the collision occurs we will mark the second PPE entry in the list as stale (setting the hw hash to 0xffff). Stale entries are no more updated in airoha_ppe_foe_flow_entry_update routine and so they are removed by Netfilter. Fix the problem not marking the second entry as stale in airoha_ppe_foe_insert_entry routine if we have already inserted the brand new entry in the PPE table and let Netfilter remove real stale entries according to their timestamp. Please note this is just a theoretical issue spotted reviewing the code and not faced running the system. Fixes: cd53f622611f9 ("net: airoha: Add L2 hw acceleration support") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20250818-airoha-en7581-hash-collision-fix-v1-1-d190c4b53d1c@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-20Revert "net: cadence: macb: sama7g5_emac: Remove USARIO CLKEN flag"Ryan Wanner
This reverts commit db400061b5e7cc55f9b4dd15443e9838964119ea. This commit can cause a Devicetree ABI break for older DTS files that rely this flag for RMII configuration. Adding this back in ensures that the older DTBs will not break. Fixes: db400061b5e7 ("net: cadence: macb: sama7g5_emac: Remove USARIO CLKEN flag") Signed-off-by: Ryan Wanner <Ryan.Wanner@microchip.com> Link: https://patch.msgid.link/20250819163236.100680-1-Ryan.Wanner@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-20igc: fix disabling L1.2 PCI-E link substate on I226 on initValdikSS
Device ID comparison in igc_is_device_id_i226 is performed before the ID is set, resulting in always failing check on init. Before the patch: * L1.2 is not disabled on init * L1.2 is properly disabled after suspend-resume cycle With the patch: * L1.2 is properly disabled both on init and after suspend-resume How to test: Connect to the 1G link with 300+ mbit/s Internet speed, and run the download speed test, such as: curl -o /dev/null http://speedtest.selectel.ru/1GB Without L1.2 disabled, the speed would be no more than ~200 mbit/s. With L1.2 disabled, the speed would reach 1 gbit/s. Note: it's required that the latency between your host and the remote be around 3-5 ms, the test inside LAN (<1 ms latency) won't trigger the issue. Link: https://lore.kernel.org/intel-wired-lan/15248b4f-3271-42dd-8e35-02bfc92b25e1@intel.com Fixes: 0325143b59c6 ("igc: disable L1.2 PCI-E link substate to avoid performance issue") Signed-off-by: ValdikSS <iam@valdikss.org.ru> Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20250819222000.3504873-6-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>