summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-05-14net: cpsw: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()Vladimir Oltean
New timestamping API was introduced in commit 66f7223039c0 ("net: add NDOs for configuring hardware timestamping") from kernel v6.6. It is time to convert the two cpsw drivers to the new API, so that the ndo_eth_ioctl() path can be removed completely. The cpsw_hwtstamp_get() and cpsw_hwtstamp_set() methods (and their shim definitions, for the case where CONFIG_TI_CPTS is not enabled) must have their prototypes adjusted. These methods are used by two drivers (cpsw and cpsw_new), with vastly different configurations: - cpsw has two operating modes: - "dual EMAC" - enabled through the "dual_emac" device tree property - creates one net_device per EMAC / slave interface (but there is no bridging offload) - "switch mode" - default - there is a single net_device, with two EMACs/slaves behind it (and switching between them happens unbeknownst to the network stack). - cpsw_new always registers one net_device for each EMAC which doesn't have status = "disabled". In terms of switching, it has two modes: - "dual EMAC": default, no switching between ports, no switchdev offload. - "switch mode": enabled through the "switch_mode" devlink parameter, offloads the Linux bridge through switchdev Essentially, in 3 out of 4 operating modes, there is a bijective relation between the net_device and the slave. Timestamping can thus be configured on individual slaves. But in the "switch mode" of the cpsw driver, ndo_eth_ioctl() targets a single slave, designated using the "active_slave" device tree property. To deal with these different cases, the common portion of the drivers, cpsw_priv.c, has the cpsw_slave_index() function pointer, set to separate, identically named cpsw_slave_index_priv() by the 2 drivers. This is all relevant because cpsw_ndo_ioctl() has the old-style phy_has_hwtstamp() logic which lets the PHY handle the timestamping ioctls. Normally, that logic should be obsoleted by the more complex logic in the core, which permits dynamically selecting the timestamp provider - see dev_set_hwtstamp_phylib(). But I have doubts as to how this works for the "switch mode" of the dual EMAC driver, because the core logic only engages if the PHY is visible through ndev->phydev (this is set by phy_attach_direct()). In cpsw.c, we have: cpsw_ndo_open() -> for_each_slave(priv, cpsw_slave_open, priv); // continues on errors -> of_phy_connect() -> phy_connect_direct() -> phy_attach_direct() OR -> phy_connect() -> phy_connect_direct() -> phy_attach_direct() The problem for "switch mode" is that the behavior of phy_attach_direct() called twice in a row for the same net_device (once for each slave) is probably undefined. For sure it will overwrite dev->phydev. I don't see any explicit error checks for this case, and even if there were, the for_each_slave() call makes them non-fatal to cpsw_ndo_open() anyway. I have no idea what is the extent to which this provides a usable result, but the point is: only the last attached PHY will be visible in dev->phydev, and this may well be a different PHY than cpsw->slaves[slave_no].phy for the "active_slave". In dual EMAC mode, as well as in cpsw_new, this should not be a problem. I don't know whether PHY timestamping is a use case for the cpsw "switch mode" as well, and I hope that there isn't, because for the sake of simplicity, I've decided to deliberately break that functionality, by refusing all PHY timestamping. Keeping it would mean blocking the old API from ever being removed. In the new dev_set_hwtstamp_phylib() API, it is not possible to operate on a phylib PHY other than dev->phydev, and I would very much prefer not adding that much complexity for bizarre driver decisions. Final point about the cpsw_hwtstamp_get() conversion: we don't need to propagate the unnecessary "config.flags = 0;", because dev_get_hwtstamp() provides a zero-initialized struct kernel_hwtstamp_config. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://patch.msgid.link/20250512114422.4176010-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14Merge tag 'tpmdd-next-6.15-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd Pull tpm fixes from Jarkko Sakkinen: "A few last minute fixes for v6.15" * tag 'tpmdd-next-6.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd: tpm: tis: Double the timeout B to 4s char: tpm: tpm-buf: Add sanity check fallback in read helpers tpm: Mask TPM RC in tpm2_start_auth_session()
2025-05-14Merge branch 'misc-drivers-sw-timestamp-changes'Jakub Kicinski
Jason Xing says: ==================== misc drivers' sw timestamp changes This series modified three outstanding drivers among more than 100 drivers because the software timestamp generation is too early. The idea of this series is derived from the brief talk[1] with Willem. In conclusion, this series makes the generation of software timestamp near/before kicking the doorbell for drivers. [1]: https://lore.kernel.org/all/681b9d2210879_1f6aad294bc@willemb.c.googlers.com.notmuch/ v2: https://lore.kernel.org/20250508033328.12507-1-kerneljasonxing@gmail.com ==================== Link: https://patch.msgid.link/20250510134812.48199-1-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net: stmmac: generate software timestamp just before the doorbellJason Xing
Make sure the call of skb_tx_timestamp is as close as possbile to the doorbell. The patch also adjusts the order of setting SKBTX_IN_PROGRESS and generate software timestamp so that without SOF_TIMESTAMPING_OPT_TX_SWHW being set the software and hardware timestamps will not appear in the error queue of socket nearly at the same time (Please see __skb_tstamp_tx()). Signed-off-by: Jason Xing <kernelxing@tencent.com> Link: https://patch.msgid.link/20250510134812.48199-4-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net: cxgb4: generate software timestamp just before the doorbellJason Xing
Make sure the call of skb_tx_timestamp is as close as possible to the doorbell. Signed-off-by: Jason Xing <kernelxing@tencent.com> Link: https://patch.msgid.link/20250510134812.48199-3-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net: atlantic: generate software timestamp just before the doorbellJason Xing
Make sure the call of skb_tx_timestamp is as close as possible to the doorbell. Signed-off-by: Jason Xing <kernelxing@tencent.com> Link: https://patch.msgid.link/20250510134812.48199-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14octeontx2-af: Fix CGX Receive countersHariprasad Kelam
Each CGX block supports 4 logical MACs (LMACS). Receive counters CGX_CMR_RX_STAT0-8 are per LMAC and CGX_CMR_RX_STAT9-12 are per CGX. Due a bug in previous patch, stale Per CGX counters values observed. Fixes: 66208910e57a ("octeontx2-af: Support to retrieve CGX LMAC stats") Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Link: https://patch.msgid.link/20250513071554.728922-1-hkelam@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net: ethernet: mtk_eth_soc: fix typo for declaration MT7988 ESW capabilityBo-Cun Chen
Since MTK_ESW_BIT is a bit number rather than a bitmap, it causes MTK_HAS_CAPS to produce incorrect results. This leads to the ETH driver not declaring MAC capabilities correctly for the MT7988 ESW. Fixes: 445eb6448ed3 ("net: ethernet: mtk_eth_soc: add basic support for MT7988 SoC") Signed-off-by: Bo-Cun Chen <bc-bocun.chen@mediatek.com> Signed-off-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/b8b37f409d1280fad9c4d32521e6207f63cd3213.1747110258.git.daniel@makrotopia.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net: libwx: Fix FW mailbox unknown commandJiawen Wu
For the new SW-FW interaction, missing the error return if there is an unknown command. It causes the driver to mistakenly believe that the interaction is complete. This problem occurs when new driver is paired with old firmware, which does not support the new mailbox commands. Fixes: 2e5af6b2ae85 ("net: txgbe: Add basic support for new AML devices") Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/64DBB705D35A0016+20250513021009.145708-4-jiawenwu@trustnetic.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net: libwx: Fix FW mailbox reply timeoutJiawen Wu
For the new SW-FW interaction, the timeout waiting for the firmware to return is too short. So that some mailbox commands cannot be completed. Use the 'timeout' parameter instead of fixed timeout value for flexible configuration. Fixes: 2e5af6b2ae85 ("net: txgbe: Add basic support for new AML devices") Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/5D5BDE3EA501BDB8+20250513021009.145708-3-jiawenwu@trustnetic.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net: txgbe: Fix to calculate EEPROM checksum for AML devicesJiawen Wu
In the new firmware version, the shadow ram reserves some space to store I2C information, so the checksum calculation needs to skip this section. Otherwise, the driver will fail to probe because the invalid EEPROM checksum. Fixes: 2e5af6b2ae85 ("net: txgbe: Add basic support for new AML devices") Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1C6BF7A937237F5A+20250513021009.145708-2-jiawenwu@trustnetic.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net: apple: bmac: use crc32() instead of hand-rolled equivalentEric Biggers
The calculation done by bmac_crc(addr) followed by taking the low 6 bits and reversing them is equivalent to taking the high 6 bits from crc32(~0, addr, ETH_ALEN). Just do that instead. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://patch.msgid.link/20250513050142.635391-1-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14octeontx2-pf: macsec: Fix incorrect max transmit size in TX secySubbaraya Sundeep
MASCEC hardware block has a field called maximum transmit size for TX secy. Max packet size going out of MCS block has be programmed taking into account full packet size which has L2 header,SecTag and ICV. MACSEC offload driver is configuring max transmit size as macsec interface MTU which is incorrect. Say with 1500 MTU of real device, macsec interface created on top of real device will have MTU of 1468(1500 - (SecTag + ICV)). This is causing packets from macsec interface of size greater than or equal to 1468 are not getting transmitted out because driver programmed max transmit size as 1468 instead of 1514(1500 + ETH_HDR_LEN). Fixes: c54ffc73601c ("octeontx2-pf: mcs: Introduce MACSEC hardware offloading") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1747053756-4529-1-git-send-email-sbhatta@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14openvswitch: Stricter validation for the userspace actionEelco Chaudron
This change enhances the robustness of validate_userspace() by ensuring that all Netlink attributes are fully contained within the parent attribute. The previous use of nla_parse_nested_deprecated() could silently skip trailing or malformed attributes, as it stops parsing at the first invalid entry. By switching to nla_parse_deprecated_strict(), we make sure only fully validated attributes are copied for later use. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Ilya Maximets <i.maximets@ovn.org> Link: https://patch.msgid.link/67eb414e2d250e8408bb8afeb982deca2ff2b10b.1747037304.git.echaudro@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net: phy: remove Kconfig symbol MDIO_DEVRESHeiner Kallweit
MDIO_DEVRES is only set where PHYLIB/PHYLINK are set which select MDIO_DEVRES. So we can remove this symbol. Note: Due to circular module dependencies we can't simply make mdio_devres.c part of phylib. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/27cba535-f507-4b32-84a3-0744c783a465@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14netlink: specs: tc: all actions are indexed arraysJakub Kicinski
Some TC filters have actions listed as indexed arrays of nests and some as just nests. They are all indexed arrays, the handling is common across filters. Fixes: 2267672a6190 ("doc/netlink/specs: Update the tc spec") Link: https://patch.msgid.link/20250513221638.842532-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14netlink: specs: tc: fix a couple of attribute namesJakub Kicinski
Fix up spelling of two attribute names. These are clearly typoes and will prevent C codegen from working. Let's treat this as a fix to get the correction into users' hands ASAP, and prevent anyone depending on the wrong names. Fixes: a1bcfde83669 ("doc/netlink/specs: Add a spec for tc") Link: https://patch.msgid.link/20250513221316.841700-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14net/tg3: use crc32() instead of hand-rolled equivalentEric Biggers
The calculation done by calc_crc() is equivalent to ~crc32(~0, buf, len), so just use that instead. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://patch.msgid.link/20250513041402.541527-1-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14dt-bindings: net: snps,dwmac: Align mdio node in example with bindingsGeert Uytterhoeven
According to the bindings, the MDIO subnode should be called "mdio". Update the example to match this. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Rob Herring (Arm) <robh@kernel.org> Link: https://patch.msgid.link/308d72c2fe8e575e6e137b99743329c2d53eceea.1747121550.git.geert+renesas@glider.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-14documentation: networking: devlink: Fix a typo in devlink-trap.rstAlper Ak
Fix a typo in the documentation: "errorrs" -> "errors". Signed-off-by: Alper Ak <alperyasinak1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250513092451.22387-1-alperyasinak1@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15tpm: tis: Double the timeout B to 4sMichal Suchanek
With some Infineon chips the timeouts in tpm_tis_send_data (both B and C) can reach up to about 2250 ms. Timeout C is retried since commit de9e33df7762 ("tpm, tpm_tis: Workaround failed command reception on Infineon devices") Timeout B still needs to be extended. The problem is most commonly encountered with context related operation such as load context/save context. These are issued directly by the kernel, and there is no retry logic for them. When a filesystem is set up to use the TPM for unlocking the boot fails, and restarting the userspace service is ineffective. This is likely because ignoring a load context/save context result puts the real TPM state and the TPM state expected by the kernel out of sync. Chips known to be affected: tpm_tis IFX1522:00: 2.0 TPM (device-id 0x1D, rev-id 54) Description: SLB9672 Firmware Revision: 15.22 tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1B, rev-id 22) Firmware Revision: 7.83 tpm_tis MSFT0101:00: 2.0 TPM (device-id 0x1A, rev-id 16) Firmware Revision: 5.63 Link: https://lore.kernel.org/linux-integrity/Z5pI07m0Muapyu9w@kitsune.suse.cz/ Signed-off-by: Michal Suchanek <msuchanek@suse.de> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2025-05-15char: tpm: tpm-buf: Add sanity check fallback in read helpersPurva Yeshi
Fix Smatch-detected issue: drivers/char/tpm/tpm-buf.c:208 tpm_buf_read_u8() error: uninitialized symbol 'value'. drivers/char/tpm/tpm-buf.c:225 tpm_buf_read_u16() error: uninitialized symbol 'value'. drivers/char/tpm/tpm-buf.c:242 tpm_buf_read_u32() error: uninitialized symbol 'value'. Zero-initialize the return values in tpm_buf_read_u8(), tpm_buf_read_u16(), and tpm_buf_read_u32() to guard against uninitialized data in case of a boundary overflow. Add defensive initialization ensures the return values are always defined, preventing undefined behavior if the unexpected happens. Signed-off-by: Purva Yeshi <purvayeshi550@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2025-05-15tpm: Mask TPM RC in tpm2_start_auth_session()Jarkko Sakkinen
tpm2_start_auth_session() does not mask TPM RC correctly from the callers: [ 28.766528] tpm tpm0: A TPM error (2307) occurred start auth session Process TPM RCs inside tpm2_start_auth_session(), and map them to POSIX error codes. Cc: stable@vger.kernel.org # v6.10+ Fixes: 699e3efd6c64 ("tpm: Add HMAC session start and end functions") Reported-by: Herbert Xu <herbert@gondor.apana.org.au> Closes: https://lore.kernel.org/linux-integrity/Z_NgdRHuTKP6JK--@gondor.apana.org.au/ Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2025-05-14Merge tag 'for-6.15-rc6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix potential endless loop when discarding a block group when disabling discard - reinstate message when setting a large value of mount option 'commit' - fix a folio leak when async extent submission fails * tag 'for-6.15-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: add back warning for mount option commit values exceeding 300 btrfs: fix folio leak in submit_one_async_extent() btrfs: fix discard worker infinite loop after disabling discard
2025-05-14Merge tag 'trace-v6.15-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix sample code that uses trace_array_printk() The sample code for in kernel use of trace_array (that creates an instance for use within the kernel) and shows how to use trace_array_printk() that writes into the created instance, used trace_printk_init_buffers(). But that function is used to initialize normal trace_printk() and produces the NOTICE banner which is not needed for use of trace_array_printk(). The function to initialize that is trace_array_init_printk() that takes the created trace array instance as a parameter. Update the sample code to reflect the proper usage. - Fix preemption count output for stacktrace event The tracing buffer shows the preempt count level when an event executes. Because writing the event itself disables preemption, this needs to be accounted for when recording. The stacktrace event did not account for this so the output of the stacktrace event showed preemption was disabled while the event that triggered the stacktrace shows preemption is enabled and this leads to confusion. Account for preemption being disabled for the stacktrace event. The same happened for stack traces triggered by function tracer. - Fix persistent ring buffer when trace_pipe is used The ring buffer swaps the reader page with the next page to read from the write buffer when trace_pipe is used. If there's only a page of data in the ring buffer, this swap will cause the "commit" pointer (last data written) to be on the reader page. If more data is written to the buffer, it is added to the reader page until it falls off back into the write buffer. If the system reboots and the commit pointer is still on the reader page, even if new data was written, the persistent buffer validator will miss finding the commit pointer because it only checks the write buffer and does not check the reader page. This causes the validator to fail the validation and clear the buffer, where the new data is lost. There was a check for this, but it checked the "head pointer", which was incorrect, because the "head pointer" always stays on the write buffer and is the next page to swap out for the reader page. Fix the logic to catch this case and allow the user to still read the data after reboot. * tag 'trace-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Fix persistent buffer when commit page is the reader page ftrace: Fix preemption accounting for stacktrace filter command ftrace: Fix preemption accounting for stacktrace trigger command tracing: samples: Initialize trace_array_printk() with the correct function
2025-05-14ring-buffer: Fix persistent buffer when commit page is the reader pageSteven Rostedt
The ring buffer is made up of sub buffers (sometimes called pages as they are by default PAGE_SIZE). It has the following "pages": "tail page" - this is the page that the next write will write to "head page" - this is the page that the reader will swap the reader page with. "reader page" - This belongs to the reader, where it will swap the head page from the ring buffer so that the reader does not race with the writer. The writer may end up on the "reader page" if the ring buffer hasn't written more than one page, where the "tail page" and the "head page" are the same. The persistent ring buffer has meta data that points to where these pages exist so on reboot it can re-create the pointers to the cpu_buffer descriptor. But when the commit page is on the reader page, the logic is incorrect. The check to see if the commit page is on the reader page checked if the head page was the reader page, which would never happen, as the head page is always in the ring buffer. The correct check would be to test if the commit page is on the reader page. If that's the case, then it can exit out early as the commit page is only on the reader page when there's only one page of data in the buffer. There's no reason to iterate the ring buffer pages to find the "commit page" as it is already found. To trigger this bug: # echo 1 > /sys/kernel/tracing/instances/boot_mapped/events/syscalls/sys_enter_fchownat/enable # touch /tmp/x # chown sshd /tmp/x # reboot On boot up, the dmesg will have: Ring buffer meta [0] is from previous boot! Ring buffer meta [1] is from previous boot! Ring buffer meta [2] is from previous boot! Ring buffer meta [3] is from previous boot! Ring buffer meta [4] commit page not found Ring buffer meta [5] is from previous boot! Ring buffer meta [6] is from previous boot! Ring buffer meta [7] is from previous boot! Where the buffer on CPU 4 had a "commit page not found" error and that buffer is cleared and reset causing the output to be empty and the data lost. When it works correctly, it has: # cat /sys/kernel/tracing/instances/boot_mapped/trace_pipe <...>-1137 [004] ..... 998.205323: sys_enter_fchownat: __syscall_nr=0x104 (260) dfd=0xffffff9c (4294967196) filename=(0xffffc90000a0002c) user=0x3e8 (1000) group=0xffffffff (4294967295) flag=0x0 (0 Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250513115032.3e0b97f7@gandalf.local.home Fixes: 5f3b6e839f3ce ("ring-buffer: Validate boot range memory events") Reported-by: Tasos Sahanidis <tasos@tasossah.com> Tested-by: Tasos Sahanidis <tasos@tasossah.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14ftrace: Fix preemption accounting for stacktrace filter commandpengdonglin
The preemption count of the stacktrace filter command to trace ksys_read is consistently incorrect: $ echo ksys_read:stacktrace > set_ftrace_filter <...>-453 [004] ...1. 38.308956: <stack trace> => ksys_read => do_syscall_64 => entry_SYSCALL_64_after_hwframe The root cause is that the trace framework disables preemption when invoking the filter command callback in function_trace_probe_call: preempt_disable_notrace(); probe_ops->func(ip, parent_ip, probe_opsbe->tr, probe_ops, probe->data); preempt_enable_notrace(); Use tracing_gen_ctx_dec() to account for the preempt_disable_notrace(), which will output the correct preemption count: $ echo ksys_read:stacktrace > set_ftrace_filter <...>-410 [006] ..... 31.420396: <stack trace> => ksys_read => do_syscall_64 => entry_SYSCALL_64_after_hwframe Cc: stable@vger.kernel.org Fixes: 36590c50b2d07 ("tracing: Merge irqflags + preempt counter.") Link: https://lore.kernel.org/20250512094246.1167956-2-dolinux.peng@gmail.com Signed-off-by: pengdonglin <dolinux.peng@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14ftrace: Fix preemption accounting for stacktrace trigger commandpengdonglin
When using the stacktrace trigger command to trace syscalls, the preemption count was consistently reported as 1 when the system call event itself had 0 ("."). For example: root@ubuntu22-vm:/sys/kernel/tracing/events/syscalls/sys_enter_read $ echo stacktrace > trigger $ echo 1 > enable sshd-416 [002] ..... 232.864910: sys_read(fd: a, buf: 556b1f3221d0, count: 8000) sshd-416 [002] ...1. 232.864913: <stack trace> => ftrace_syscall_enter => syscall_trace_enter => do_syscall_64 => entry_SYSCALL_64_after_hwframe The root cause is that the trace framework disables preemption in __DO_TRACE before invoking the trigger callback. Use the tracing_gen_ctx_dec() that will accommodate for the increase of the preemption count in __DO_TRACE when calling the callback. The result is the accurate reporting of: sshd-410 [004] ..... 210.117660: sys_read(fd: 4, buf: 559b725ba130, count: 40000) sshd-410 [004] ..... 210.117662: <stack trace> => ftrace_syscall_enter => syscall_trace_enter => do_syscall_64 => entry_SYSCALL_64_after_hwframe Cc: stable@vger.kernel.org Fixes: ce33c845b030c ("tracing: Dump stacktrace trigger to the corresponding instance") Link: https://lore.kernel.org/20250512094246.1167956-1-dolinux.peng@gmail.com Signed-off-by: pengdonglin <dolinux.peng@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14Merge tag 'execve-v6.15-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve fix from Kees Cook: "This fixes a corner case for ASLR-disabled static-PIE brk collision with vdso allocations: - binfmt_elf: Move brk for static PIE even if ASLR disabled" * tag 'execve-v6.15-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_elf: Move brk for static PIE even if ASLR disabled
2025-05-14Merge tag 'soc-fixes-6.15-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull SoC fixes from Arnd Bergmann: "These all address issues in devicetree files: - The Rockchip rk3588j are now limited the same way as the vendor kernel, to allow room for the industrial-grade temperature ranges. - Seven more Rockchip fixes address minor issues with specific boards - Invalid clk controller references in multiple amlogic chips, plus one accidentally disabled audio on clock - Two devicetree fixes for i.MX8MP boards, both for incorrect regulator settings - A power domain change for apple laptop touchbar, fixing suspend/resume problems - An incorrect DMA controller setting for sophgo cv18xx chips" * tag 'soc-fixes-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: arm64: dts: amazon: Fix simple-bus node name schema warnings MAINTAINERS: delete email for Shiraz Hashim arm64: dts: imx8mp-var-som: Fix LDO5 shutdown causing SD card timeout arm64: dts: imx8mp: use 800MHz NoC OPP for nominal drive mode arm64: dts: amlogic: dreambox: fix missing clkc_audio node riscv: dts: sophgo: fix DMA data-width configuration for CV18xx arm64: dts: rockchip: fix Sige5 RTC interrupt pin arm64: dts: rockchip: Assign RT5616 MCLK rate on rk3588-friendlyelec-cm3588 arm64: dts: rockchip: Align wifi node name with bindings in CB2 arm64: dts: amlogic: g12: fix reference to unknown/untested PWM clock arm64: dts: amlogic: gx: fix reference to unknown/untested PWM clock ARM: dts: amlogic: meson8b: fix reference to unknown/untested PWM clock ARM: dts: amlogic: meson8: fix reference to unknown/untested PWM clock arm64: dts: apple: touchbar: Mark ps_dispdfr_be as always-on mailmap: Update email for Asahi Lina arm64: dts: rockchip: Fix mmc-pwrseq clock name on rock-pi-4 arm64: dts: rockchip: Use "regulator-fixed" for btreg on px30-engicam for vcc3v3-btreg arm64: dts: rockchip: Add pinmuxing for eMMC on QNAP TS433 arm64: dts: rockchip: Remove overdrive-mode OPPs from RK3588J SoC dtsi arm64: dts: rockchip: Allow Turing RK1 cooling fan to spin down
2025-05-14octeontx2-pf: Fix ethtool support for SDP representorsHariprasad Kelam
The hardware supports multiple MAC types, including RPM, SDP, and LBK. However, features such as link settings and pause frames are only available on RPM MAC, and not supported on SDP or LBK. This patch updates the ethtool operations logic accordingly to reflect this behavior. Fixes: 2f7f33a09516 ("octeontx2-pf: Add representors for sdp MAC") Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-14net: enetc: fix implicit declaration of function FIELD_PREPWei Fang
The kernel test robot reported the following error: drivers/net/ethernet/freescale/enetc/ntmp.c: In function 'ntmp_fill_request_hdr': drivers/net/ethernet/freescale/enetc/ntmp.c:203:38: error: implicit declaration of function 'FIELD_PREP' [-Wimplicit-function-declaration] 203 | cbd->req_hdr.access_method = FIELD_PREP(NTMP_ACCESS_METHOD, | ^~~~~~~~~~ Therefore, add "bitfield.h" to ntmp_private.h to fix this issue. Fixes: 4701073c3deb ("net: enetc: add initial netc-lib driver to support NTMP") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202505101047.NTMcerZE-lkp@intel.com/ Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-14net: wangxun: Correct clerical errors in commentsJiawen Wu
There are wrong "#endif" comments in .h files need to be corrected. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-05-13qlcnic: fix memory leak in qlcnic_sriov_channel_cfg_cmd()Abdun Nihaal
In one of the error paths in qlcnic_sriov_channel_cfg_cmd(), the memory allocated in qlcnic_sriov_alloc_bc_mbx_args() for mailbox arguments is not freed. Fix that by jumping to the error path that frees them, by calling qlcnic_free_mbx_args(). This was found using static analysis. Fixes: f197a7aa6288 ("qlcnic: VF-PF communication channel implementation") Signed-off-by: Abdun Nihaal <abdun.nihaal@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250512044829.36400-1-abdun.nihaal@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net: phy: remove stub for mdiobus_register_board_infoHeiner Kallweit
The functionality of mdiobus_register_board_info() typically isn't optional for the caller. Therefore remove the stub. Note: Currently we have only one caller of mdiobus_register_board_info(), in a DSA/PHYLINK context. Therefore CONFIG_MDIO_DEVICE is selected anyway. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/410a2222-c4e8-45b0-9091-d49674caeb00@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net: mlxsw: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()Vladimir Oltean
New timestamping API was introduced in commit 66f7223039c0 ("net: add NDOs for configuring hardware timestamping") from kernel v6.6. It is time to convert the mlxsw driver to the new API, so that the ndo_eth_ioctl() path can be removed completely. The UAPI is still ioctl-only, but it's best to remove the "ioctl" mentions from the driver in case a netlink variant appears. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250512154411.848614-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net: ipa: Make the SMEM item ID constantKonrad Dybcio
It can't vary, stop storing the same magic number everywhere. Signed-off-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Reviewed-by: Alex Elder <elder@kernel.org> Link: https://patch.msgid.link/20250512-topic-ipa_smem-v1-1-302679514a0d@oss.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13docs: networking: timestamping: improve stacked PHC sentenceVladimir Oltean
The first paragraph makes no grammatical sense. I suppose a portion of the intended sentece is missing: "[The challenge with ] stacked PHCs (...) is that they uncover bugs". Rephrase, and at the same time simplify the structure of the sentence a little bit, it is not easy to follow. Fixes: 94d9f78f4d64 ("docs: networking: timestamping: add section for stacked PHC devices") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Link: https://patch.msgid.link/20250512131751.320283-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net: enetc: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()Vladimir Oltean
New timestamping API was introduced in commit 66f7223039c0 ("net: add NDOs for configuring hardware timestamping") from kernel v6.6. It is time to convert the ENETC driver to the new API, so that the ndo_eth_ioctl() path can be removed completely. Move the enetc_hwtstamp_get() and enetc_hwtstamp_set() calls away from enetc_ioctl() to dedicated net_device_ops for the LS1028A PF and VF (NETC v4 does not yet implement enetc_ioctl()), adapt the prototypes and export these symbols (enetc_ioctl() is also exported). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Wei Fang <wei.fang@nxp.com> Link: https://patch.msgid.link/20250512112402.4100618-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net: txgbe: Fix pending interruptJiawen Wu
For unknown reasons, sometimes the value of MISC interrupt is 0 in the IRQ handle function. In this case, wx_intr_enable() is also should be invoked to clear the interrupt. Otherwise, the next interrupt would never be reported. Fixes: a9843689e2de ("net: txgbe: add sriov function support") Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/F4F708403CE7090B+20250512100652.139510-1-jiawenwu@trustnetic.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net/mlx5e: Disable MACsec offload for uplink representor profileCarolina Jubran
MACsec offload is not supported in switchdev mode for uplink representors. When switching to the uplink representor profile, the MACsec offload feature must be cleared from the netdevice's features. If left enabled, attempts to add offloads result in a null pointer dereference, as the uplink representor does not support MACsec offload even though the feature bit remains set. Clear NETIF_F_HW_MACSEC in mlx5e_fix_uplink_rep_features(). Kernel log: Oops: general protection fault, probably for non-canonical address 0xdffffc000000000f: 0000 [#1] SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000078-0x000000000000007f] CPU: 29 UID: 0 PID: 4714 Comm: ip Not tainted 6.14.0-rc4_for_upstream_debug_2025_03_02_17_35 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:__mutex_lock+0x128/0x1dd0 Code: d0 7c 08 84 d2 0f 85 ad 15 00 00 8b 35 91 5c fe 03 85 f6 75 29 49 8d 7e 60 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 a6 15 00 00 4d 3b 76 60 0f 85 fd 0b 00 00 65 ff RSP: 0018:ffff888147a4f160 EFLAGS: 00010206 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 000000000000000f RSI: 0000000000000000 RDI: 0000000000000078 RBP: ffff888147a4f2e0 R08: ffffffffa05d2c19 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: dffffc0000000000 R14: 0000000000000018 R15: ffff888152de0000 FS: 00007f855e27d800(0000) GS:ffff88881ee80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004e5768 CR3: 000000013ae7c005 CR4: 0000000000372eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 Call Trace: <TASK> ? die_addr+0x3d/0xa0 ? exc_general_protection+0x144/0x220 ? asm_exc_general_protection+0x22/0x30 ? mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core] ? __mutex_lock+0x128/0x1dd0 ? lockdep_set_lock_cmp_fn+0x190/0x190 ? mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core] ? mutex_lock_io_nested+0x1ae0/0x1ae0 ? lock_acquire+0x1c2/0x530 ? macsec_upd_offload+0x145/0x380 ? lockdep_hardirqs_on_prepare+0x400/0x400 ? kasan_save_stack+0x30/0x40 ? kasan_save_stack+0x20/0x40 ? kasan_save_track+0x10/0x30 ? __kasan_kmalloc+0x77/0x90 ? __kmalloc_noprof+0x249/0x6b0 ? genl_family_rcv_msg_attrs_parse.constprop.0+0xb5/0x240 ? mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core] mlx5e_macsec_add_secy+0xf9/0x700 [mlx5_core] ? mlx5e_macsec_add_rxsa+0x11a0/0x11a0 [mlx5_core] macsec_update_offload+0x26c/0x820 ? macsec_set_mac_address+0x4b0/0x4b0 ? lockdep_hardirqs_on_prepare+0x284/0x400 ? _raw_spin_unlock_irqrestore+0x47/0x50 macsec_upd_offload+0x2c8/0x380 ? macsec_update_offload+0x820/0x820 ? __nla_parse+0x22/0x30 ? genl_family_rcv_msg_attrs_parse.constprop.0+0x15e/0x240 genl_family_rcv_msg_doit+0x1cc/0x2a0 ? genl_family_rcv_msg_attrs_parse.constprop.0+0x240/0x240 ? cap_capable+0xd4/0x330 genl_rcv_msg+0x3ea/0x670 ? genl_family_rcv_msg_dumpit+0x2a0/0x2a0 ? lockdep_set_lock_cmp_fn+0x190/0x190 ? macsec_update_offload+0x820/0x820 netlink_rcv_skb+0x12b/0x390 ? genl_family_rcv_msg_dumpit+0x2a0/0x2a0 ? netlink_ack+0xd80/0xd80 ? rwsem_down_read_slowpath+0xf90/0xf90 ? netlink_deliver_tap+0xcd/0xac0 ? netlink_deliver_tap+0x155/0xac0 ? _copy_from_iter+0x1bb/0x12c0 genl_rcv+0x24/0x40 netlink_unicast+0x440/0x700 ? netlink_attachskb+0x760/0x760 ? lock_acquire+0x1c2/0x530 ? __might_fault+0xbb/0x170 netlink_sendmsg+0x749/0xc10 ? netlink_unicast+0x700/0x700 ? __might_fault+0xbb/0x170 ? netlink_unicast+0x700/0x700 __sock_sendmsg+0xc5/0x190 ____sys_sendmsg+0x53f/0x760 ? import_iovec+0x7/0x10 ? kernel_sendmsg+0x30/0x30 ? __copy_msghdr+0x3c0/0x3c0 ? filter_irq_stacks+0x90/0x90 ? stack_depot_save_flags+0x28/0xa30 ___sys_sendmsg+0xeb/0x170 ? kasan_save_stack+0x30/0x40 ? copy_msghdr_from_user+0x110/0x110 ? do_syscall_64+0x6d/0x140 ? lock_acquire+0x1c2/0x530 ? __virt_addr_valid+0x116/0x3b0 ? __virt_addr_valid+0x1da/0x3b0 ? lock_downgrade+0x680/0x680 ? __delete_object+0x21/0x50 __sys_sendmsg+0xf7/0x180 ? __sys_sendmsg_sock+0x20/0x20 ? kmem_cache_free+0x14c/0x4e0 ? __x64_sys_close+0x78/0xd0 do_syscall_64+0x6d/0x140 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f855e113367 Code: 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10 RSP: 002b:00007ffd15e90c88 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f855e113367 RDX: 0000000000000000 RSI: 00007ffd15e90cf0 RDI: 0000000000000004 RBP: 00007ffd15e90dbc R08: 0000000000000028 R09: 000000000045d100 R10: 00007f855e011dd8 R11: 0000000000000246 R12: 0000000000000019 R13: 0000000067c6b785 R14: 00000000004a1e80 R15: 0000000000000000 </TASK> Modules linked in: 8021q garp mrp sch_ingress openvswitch nsh mlx5_ib mlx5_fwctl mlx5_dpll mlx5_core rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay zram zsmalloc fuse [last unloaded: mlx5_core] ---[ end trace 0000000000000000 ]--- Fixes: 8ff0ac5be144 ("net/mlx5: Add MACsec offload Tx command support") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1746958552-561295-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13Merge branch 'net-mlx5-hws-complex-matchers-and-rehash-mechanism-fixes'Jakub Kicinski
Tariq Toukan says: ==================== net/mlx5: HWS, Complex Matchers and rehash mechanism fixes Motivation: ---------- A matcher can match a certain set of match parameters. However, the number and size of match params for a single matcher are limited — all the parameters must fit within a single definer. A common example of this limitation is IPv6 address matching, where matching both source and destination IPs requires more bits than a single definer can support. SW Steering addresses this limitation by chaining multiple Steering Table Entries (STEs) within the same matcher, where each STE matches on a subset of the parameters. In HW Steering, such chaining is not possible — the matcher's STEs are managed in a hash table, and a single definer is used to calculate the hash index for STEs. Overview: -------- To address this limitation in HW Steering, we introduce *Complex Matchers*, which consist of two chained matchers. This allows matching on twice as many parameters. Complex Matchers are filled with *Complex Rules* — rules that are split into two parts and inserted into their respective matchers. The first half of the Complex Matcher is a regular matcher and points to the second half, which is an *Isolated Matcher*. An Isolated Matcher has its own isolated table and is accessible only by traffic coming from the first half of the Complex Matcher. This splitting of matchers/rules into multiple parts is transparent to users. It is hidden behind the BWC HWS API. It becomes visible only when dumping steering debug information, where the Complex Matcher appears as two separate matchers: one in the user-created table and another in its isolated table. Implementation Details: ---------------------- All user actions are performed on the second part of the rules only. The first part handles matching and applies two actions: modify header (set metadata, see details below) and go-to-table (directing traffic to the isolated table containing the isolated matcher). Rule updates (updating rule actions) are applied to the second part of the rule since user-provided actions are not executed in the first matcher. We use REG_C_6 metadata register to set and match on unique per-rule tag (see details below). Splitting rules into two parts introduces new challenges: 1. Invalid Combinations Consider two rules with different matching values: - Rule 1: A+B - Rule 2: C+D Let's split the rules into two parts as follows: |-----Complex Matcher-------| | | | 1st matcher 2nd matcher | | |---| |---| | | | A | | B | | | |---| -----> |---| | | | C | | D | | | |---| |---| | | | |---------------------------| Splitting these rules results in invalid combinations: A+D and C+B: any packet that matched on A will be forwarded to the 2nd matcher, where it will try to match on B (which is legal, and it is what the user asked for), but it will also try to match on D (which is not what the user asked for). To resolve this, we assign unique tags to each rule on the first matcher and match on these tags on the second matcher: |----------| |---------| | A | | B, TagA | | action: | | | | set TagA | | | |----------| --> |---------| | C | | D, TagB | | action: | | | | set TagB | | | |----------| |---------| 2. Duplicated Entries: Consider two rules with overlapping values: - Rule 1: A+B - Rule 2: A+D Let's split the rules into two parts as follows: |---| |---| | A | | B | |---| --> |---| | | | D | |---| |---| This leads to the duplicated entries on the first matcher, which HWS doesn't allow: subsequent delete of either of the rules will delete the only entry in the first matcher, leaving the remaining rule broken. To address this, we use a reference count for entries in the first matcher and delete STEs only when their refcount reaches zero. Both challenges are resolved by having a per-matcher data structure (implemented with rhashtable) that manages refcounts for the first part of the rules and holds unique tags (managed via IDA) for these rules to set and to match on the second matcher. Limitations: ----------- We utilize metadata register REG_C_6 in this implementation, so its usage anywhere along the flow that might include the need for Complex Matcher is prohibited. The number and size of match parameters remain limited — now constrained by what can be represented by two definers instead of one. This architectural limitation arises from the structure of Complex Matchers. If future requirements demand more parameters, Complex Matchers can be extended beyond two matchers. Additionally, there is an implementation limit of 32 match parameters per matcher (disregarding parameter size). This limit can be lifted if needed. Patches: ------- - Patches 1-3: small additions/refactoring in preparation for Complex Matcher: exposed mlx5hws_table_ft_set_next_ft() in header, added definer function to convert field name enum to string, expose the polling function mlx5hws_bwc_queue_poll() in a header. - Patch 4: in preparation for Complex Matcher, this patch adds support for Isolated Matcher. - Patch 5: the main patch - Complex Matchers implementation. [2] Patch 6: fixing the usecase where rule insertion was failing, but rehash couldn't be initiated if the number of rules in the table is below the rehash threshold. Patch 7: fixing the usecase where many rules in parallel would require rehash, due to the way the counting of rules was done. Patch 8: fixing the case where rules were requiring action template extension in parallel, leading to unneeded extensions with the same templates. Patch 9: refactor and simplify the rehash loop. Patch 10: dump error completion details, which helps a lot in trying to understand what went wrong, especially during rehash. ==================== Link: https://patch.msgid.link/1746992290-568936-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net/mlx5: HWS, dump bad completion detailsYevgeny Kliteynik
Failing to insert/delete a rule should not happen. If it does happen, it would be good to know at which stage it happened and what was the failure. This patch adds printing of bad CQE details. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-11-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net/mlx5: HWS, rework rehash loopYevgeny Kliteynik
Reworking the rehash loop - simplifying the code and making it less error prone: - Instead of doing round-robin on all the queues with batch of rules in each cycle, just go over all the queues and move all the rules that belong to this queue. - If at some stage of moving the rule we get a failure (which should not happen), this can't be rolled back. So instead of aborting rehash and leaving the matcher in a broken state, allow the loop to continue: attempt to move the rest of the rules and delete the old matcher. A rule that failed to move to a new matcher will loose its match STE once the rehash is completed and the old matcher is deleted, so the rule won't match any traffic any more. This rule's packets will fall back to the steering pipeline w/o HW offload. Rehash procedure will return an error, which will cause the rule insertion to fail for the rule that started this whole rehash. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-10-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net/mlx5: HWS, fix redundant extension of action templatesYevgeny Kliteynik
When a rule is inserted into a matcher, we search for the suitable action template. If such template is not found, action template array is extended with the new template. However, when several threads are performing this in parallel, there is a race - we can end up with extending the action templates array with the same template. This patch is doing the following: - refactor the code to find action template index in rule create and update, have the common code in an auxiliary function - after locking all the queues, check again if the action template array still needs to be extended Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-9-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net/mlx5: HWS, fix counting of rules in the matcherYevgeny Kliteynik
Currently the counter that counts number of rules in a matcher is increased only when rule insertion is completed. In a multi-threaded usecase this can lead to a scenario that many rules can be in process of insertion in the same matcher, while none of them has completed the insertion and the rule counter is not updated. This results in a rule insertion failure for many of them at first attempt, which leads to all of them requiring rehash and requiring locking of all the queue locks. This patch fixes the case by increasing the rule counter in the beginning of insertion process and decreasing in case of any failure. Signed-off-by: Vlad Dogaru <vdogaru@nvidia.com> Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-8-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net/mlx5: HWS, force rehash when rule insertion failedYevgeny Kliteynik
Rules are inserted into hash table in accordance with their hash index. When a certain number of rules is reached, the table is rehashed: a bigger new table is allocated and all the rules are moved there. But sometimes a new rule can't be inserted into the hash table because its index is full, even though the number of rules in the table is well below the threshold. The hash function is not perfect, so such cases are not rare. When that happens, we want to do the same rehash, in order to increase the table size and lower the probability for such cases. This patch fixes the usecase where rule insertion was failing, but rehash couldn't be initiated due to low number of rules: it adds flag that denotes that rehash is required, even if the number of rules in the table is below the rehash threshold. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-7-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net/mlx5: HWS, support complex matchersYevgeny Kliteynik
This patch adds support for Complex Matchers/Rules Overview: -------- A matcher can match on a certain set of match parameters. However, the number and size of match params for a single matcher are limited: all the parameters must fit within a single definer. A common example of this limitation is IPv6 address matching, where matching both source and destination IPs requires more bits than a single definer can support. SW Steering addresses this limitation by chaining multiple Steering Table Entries (STEs) within the same matcher, where each STE matches on a subset of the parameters. In HW Steering, such chaining is not possible — the matcher's STEs are managed in a hash table, and a single definer is used to calculate the hash index for STEs. To address this limitation in HW Steering, we introduce Complex Matchers, which consist of two chained matchers. This allows matching on twice as many parameters. Complex Matchers are filled with Complex Rules — rules that are split into two parts and inserted into their respective matchers. The first half of the Complex Matcher is a regular matcher and points to the second half, which is an Isolated Matcher. An Isolated Matcher has its own isolated table and is accessible only by traffic coming from the first half of the Complex Matcher. This splitting of matchers/rules into multiple parts is transparent to users. It is hidden under the BWC HWS API. It becomes visible only when dumping steering debug information, where the Complex Matcher appears as two separate matchers: one in the user-created table and another in its isolated table. Some implementation details: --------------------------- All user actions are performed on the second part of the rules only. The first part handles matching and applies two actions: modify header (set metadata, see details below) and go-to-table (directing traffic to the isolated table containing the isolated matcher). Rule updates (updating rule actions) are applied to the second part of the rule since user-provided actions are not executed in the first matcher. We use REG_C_6 metadata register to set and match on unique per-rule tag (see details below). Splitting rules into two parts introduces new challenges: 1. Invalid Combinations Consider two rules with different matching values: - Rule 1: A+B - Rule 2: C+D Let's split the rules into two parts as follows: |---| |---| | A | | B | |---| --> |---| | C | | D | |---| |---| Splitting these rules results in invalid combinations like A+D and C+B. To resolve this, we assign unique tags to each rule on the first matcher and match these tags on the second matcher (the tag is implemented through modify_hdr action that sets value to metadata register REG_C_6): |----------| |---------| | A | | B, TagA | | action: | | | | set TagA | | | |----------| --> |---------| | C | | D, TagB | | action: | | | | set TagB | | | |----------| |---------| 2. Duplicated Entries: Consider two rules with overlapping values: - Rule 1: A+B - Rule 2: A+D Let's split the rules into two parts as follows: |---| |---| | A | | B | |---| --> |---| | | | D | |---| |---| This leads to the duplicated entries on the first matcher, which HWS doesn't allow: subsequent delete of either of the rules will delete the only entry in the first matcher, leaving the remaining rule broken. To address this, we use a reference count for entries in the first matcher and delete STEs only when their refcount reaches zero. Both challenges are resolved by having a per-matcher data structure (implemented with rhashtable) that manages refcounts for the first part of the rules and holds unique tags (managed via IDA) for these rules to set and to match on the second matcher. Limitations: ----------- We utilize metadata register REG_C_6 in this implementation, so its usage anywhere along the steering of the flow that might include the need for Complex Matcher is prohibited. The number and size of match parameters remain limited — now it is constrained by what can be represented by two definers instead of one. This architectural limitation arises from the structure of Complex Matchers. If future requirements demand more parameters, Complex Matchers can be extended beyond two matchers. Additionally, there is an implementation limit of 32 match parameters per rule (disregarding parameter size). This limit can be lifted if needed. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-6-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net/mlx5: HWS, introduce isolated matchersYevgeny Kliteynik
In preparation for complex matcher support, introduce the isolated matcher. Isolated matcher is a matcher that has its own isolated table. It is used as the second half of the complex matcher: when the rule is split into two parts (complex rule), then matching on the first part will send the packet to the isolated matcher that will try to match on the second part. In case of miss, the packet goes back to the matcher's end flow table. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-5-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net/mlx5: HWS, expose polling function in header fileYevgeny Kliteynik
In preparation for complex matcher, expose the function that is polling queue for completion (mlx5hws_bwc_queue_poll) in header file, so that it will be used by complex matcher code. Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Vlad Dogaru <vdogaru@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1746992290-568936-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>