summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-02-26net: move aRFS rmap management and CPU affinity to coreAhmed Zaki
A common task for most drivers is to remember the user-set CPU affinity to its IRQs. On each netdev reset, the driver should re-assign the user's settings to the IRQs. Unify this task across all drivers by moving the CPU affinity to napi->config. However, to move the CPU affinity to core, we also need to move aRFS rmap management since aRFS uses its own IRQ notifiers. For the aRFS, add a new netdev flag "rx_cpu_rmap_auto". Drivers supporting aRFS should set the flag via netif_enable_cpu_rmap() and core will allocate and manage the aRFS rmaps. Freeing the rmap is also done by core when the netdev is freed. For better IRQ affinity management, move the IRQ rmap notifier inside the napi_struct and add new notify.notify and notify.release functions: netif_irq_cpu_rmap_notify() and netif_napi_affinity_release(). Now we have the aRFS rmap management in core, add CPU affinity mask to napi_config. To delegate the CPU affinity management to the core, drivers must: 1 - set the new netdev flag "irq_affinity_auto": netif_enable_irq_affinity(netdev) 2 - create the napi with persistent config: netif_napi_add_config() 3 - bind an IRQ to the napi instance: netif_napi_set_irq() the core will then make sure to use re-assign affinity to the napi's IRQ. The default IRQ mask is set to one cpu starting from the closest NUMA. Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Link: https://patch.msgid.link/20250224232228.990783-2-ahmed.zaki@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26bonding: report duplicate MAC address in all situationsHangbin Liu
Normally, a bond uses the MAC address of the first added slave as the bond’s MAC address. And the bond will set active slave’s MAC address to bond’s address if fail_over_mac is set to none (0) or follow (2). When the first slave is removed, the bond will still use the removed slave’s MAC address, which can lead to a duplicate MAC address and potentially cause issues with the switch. To avoid confusion, let's warn the user in all situations, including when fail_over_mac is set to 2 or not in active-backup mode. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20250225033914.18617-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26net: skb: free up one bit in tx_flagsWillem de Bruijn
The linked series wants to add skb tx completion timestamps. That needs a bit in skb_shared_info.tx_flags, but all are in use. A per-skb bit is only needed for features that are configured on a per packet basis. Per socket features can be read from sk->sk_tsflags. Per packet tsflags can be set in sendmsg using cmsg, but only those in SOF_TIMESTAMPING_TX_RECORD_MASK. Per packet tsflags can also be set without cmsg by sandwiching a send inbetween two setsockopts: val |= SOF_TIMESTAMPING_$FEATURE; setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); write(fd, buf, sz); val &= ~SOF_TIMESTAMPING_$FEATURE; setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); Changing a datapath test from skb_shinfo(skb)->tx_flags to skb->sk->sk_tsflags can change behavior in that case, as the tx_flags is written before the second setsockopt updates sk_tsflags. Therefore, only bits can be reclaimed that cannot be set by cmsg and are also highly unlikely to be used to target individual packets otherwise. Free up the bit currently used for SKBTX_HW_TSTAMP_USE_CYCLES. This selects between clock and free running counter source for HW TX timestamps. It is probable that all packets of the same socket will always use the same source. Link: https://lore.kernel.org/netdev/cover.1739988644.git.pav@iki.fi/ Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Link: https://patch.msgid.link/20250225023416.2088705-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26Merge branch 'expand-cmsg_ipv6-sh-with-ipv4-support'Jakub Kicinski
Willem de Bruijn says: ==================== expand cmsg_ipv6.sh with ipv4 support Expand IPV6_TCLASS to also cover IP_TOS. Expand IPV6_HOPLIMIT to also cover IP_TTL. A series of two patches for basic readability (patch 1 is a noop), and so that git does not interpret code changes + file rename as a whole file del + add. ==================== Link: https://patch.msgid.link/20250225022431.2083926-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26selftests/net: expand cmsg_ipv6.sh with ipv4Willem de Bruijn
Expand IPV6_TCLASS to also cover IP_TOS. Expand IPV6_HOPLIMIT to also cover IP_TTL. Expand csmg_sender.c to allow setting IPv4 setsockopts. Also rename struct v6 to cmsg to match its expanded scope. Don't bother updating all occurrences of tclass and hoplimit. Rename cmsg_ipv6.sh to cmsg_ip.sh to match the expanded scope. Be careful around the subtle API difference between TCLASS and TOS. IP_TOS includes ECN bits. Add a test to verify that these are masked when making routing decisions. Diff is more concise with --word-diff Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250225022431.2083926-3-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26selftests/net: prepare cmsg_ipv6.sh for ipv4Willem de Bruijn
Move IPV6_TCLASS and IPV6_HOPLIMIT into loops, to be able to use them for IP_TOS and IP_TTL in a follow-on patch. Indentation in this file is a mix of four spaces and tabs for double indents. To minimize code churn, maintain that pattern. Very small diff if viewing with -w. Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250225022431.2083926-2-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26tcp: be less liberal in TSEcr received while in SYN_RECV stateEric Dumazet
Yong-Hao Zou mentioned that linux was not strict as other OS in 3WHS, for flows using TCP TS option (RFC 7323) As hinted by an old comment in tcp_check_req(), we can check the TSEcr value in the incoming packet corresponds to one of the SYNACK TSval values we have sent. In this patch, I record the oldest and most recent values that SYNACK packets have used. Send a challenge ACK if we receive a TSEcr outside of this range, and increase a new SNMP counter. nstat -az | grep TSEcrRejected TcpExtTSEcrRejected 0 0.0 Due to TCP fastopen implementation, do not apply yet these checks for fastopen flows. v2: No longer use req->num_timeout, but treq->snt_tsval_first to detect when first SYNACK is prepared. This means we make sure to not send an initial zero TSval. Make sure MPTCP and TCP selftests are passing. Change MIB name to TcpExtTSEcrRejected v1: https://lore.kernel.org/netdev/CADVnQykD8i4ArpSZaPKaoNxLJ2if2ts9m4As+=Jvdkrgx1qMHw@mail.gmail.com/T/ Reported-by: Yong-Hao Zou <yonghaoz1994@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250225171048.3105061-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25enic: add dependency on Page PoolJohn Daley
Driver was not configured to select page_pool, causing a compile error if page pool module was not already selected. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202502211253.3XRosM9I-lkp@intel.com/ Fixes: d24cb52b2d8a ("enic: Use the Page Pool API for RX") Reviewed-by: Nelson Escobar <neescoba@cisco.com> Signed-off-by: John Daley <johndale@cisco.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20250224234350.23157-2-johndale@cisco.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25Merge branch 'mlx5-next' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2025-02-24 The following pull-request contains common mlx5 updates for your *net-next* tree. * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Change POOL_NEXT_SIZE define value and make it global net/mlx5: Add new health syndrome error and crr bit offset ==================== Link: https://patch.msgid.link/20250224212446.523259-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25net: wangxun: fix LIBWX dependenciesArnd Bergmann
Selecting LIBWX requires that its dependencies are met first: WARNING: unmet direct dependencies detected for LIBWX Depends on [m]: NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_WANGXUN [=y] && PTP_1588_CLOCK_OPTIONAL [=m] Selected by [y]: - TXGBE [=y] && NETDEVICES [=y] && ETHERNET [=y] && NET_VENDOR_WANGXUN [=y] && PCI [=y] && COMMON_CLK [=y] && I2C_DESIGNWARE_PLATFORM [=y] ld.lld-21: error: undefined symbol: ptp_schedule_worker >>> referenced by wx_ptp.c:747 (/home/arnd/arm-soc/drivers/net/ethernet/wangxun/libwx/wx_ptp.c:747) >>> drivers/net/ethernet/wangxun/libwx/wx_ptp.o:(wx_ptp_reset) in archive vmlinux.a Add the smae dependency on PTP_1588_CLOCK_OPTIONAL to the two driver using this library module. Fixes: 06e75161b9d4 ("net: wangxun: Add support for PTP clock") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20250224140516.1168214-1-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25Merge branch 'symmetric-or-xor-rss-hash'Jakub Kicinski
Gal Pressman says: ==================== Symmetric OR-XOR RSS hash Add support for a new type of input_xfrm: Symmetric OR-XOR. Symmetric OR-XOR performs hash as follows: (SRC_IP | DST_IP, SRC_IP ^ DST_IP, SRC_PORT | DST_PORT, SRC_PORT ^ DST_PORT) Configuration is done through ethtool -x/X command. For mlx5, the default is already symmetric hash, this patch now exposes this to userspace and allows enabling/disabling of the feature. v5: https://lore.kernel.org/20250220113435.417487-1-gal@nvidia.com v4: https://lore.kernel.org/20250216182453.226325-1-gal@nvidia.com v3: https://lore.kernel.org/20250205135341.542720-1-gal@nvidia.com v2: https://lore.kernel.org/20250203150039.519301-1-gal@nvidia.com ==================== Link: https://patch.msgid.link/20250224174416.499070-1-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25selftests: drv-net-hw: Add a test for symmetric RSS hashGal Pressman
Add a selftest that verifies symmetric RSS hash is working as intended. The test runs iterations of traffic, swapping the src/dst UDP ports, and verifies that the same RX queue is receiving the traffic in both cases. Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250224174416.499070-5-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25selftests: drv-net: Make rand_port() get a port more reliablyGal Pressman
Instead of guessing a port and checking whether it's available, get an available port from the OS. Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250224174416.499070-4-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25net/mlx5e: Symmetric OR-XOR RSS hash controlGal Pressman
Allow control over the symmetric RSS hash, which was previously set to enabled by default by the driver. Symmetric OR-XOR RSS can now be queried and controlled using the 'ethtool -x/X' command. Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250224174416.499070-3-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25ethtool: Symmetric OR-XOR RSS hashGal Pressman
Add an additional type of symmetric RSS hash type: OR-XOR. The "Symmetric-OR-XOR" algorithm transforms the input as follows: (SRC_IP | DST_IP, SRC_IP ^ DST_IP, SRC_PORT | DST_PORT, SRC_PORT ^ DST_PORT) Change 'cap_rss_sym_xor_supported' to 'supported_input_xfrm', a bitmap of supported RXH_XFRM_* types. Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/20250224174416.499070-2-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25drivers: net: xgene: Don't use "proxy" headersAndy Shevchenko
Update header inclusions to follow IWYU (Include What You Use) principle. In this case replace *gpio.h, which are subject to remove by the GPIOLIB subsystem, with the respective headers that are being used by the driver. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/20250224120037.3801609-1-andriy.shevchenko@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25Merge branch 'net-stmmac-dwc-qos-clean-up-clock-initialisation'Jakub Kicinski
Russell King says: ==================== net: stmmac: dwc-qos: clean up clock initialisation My single v1 patch has become two patches as a result of the build error, as it appears this code uses "data" differently from others. v2 still produced build warnings despite local builds being clean, so v3 addresses those. The first patch brings some consistency with other drivers, naming local variables that refer to struct plat_stmmacenet_data as "plat_dat", as is used elsewhere in this driver. ==================== Link: https://patch.msgid.link/Z7yj_BZa6yG02KcI@shell.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25net: stmmac: dwc-qos: clean up clock initialisationRussell King (Oracle)
Clean up the clock initialisation by providing a helper to find a named clock in the bulk clocks, and provide the name of the stmmac clock in match data so we can locate the stmmac clock in generic code. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Thierry Reding <treding@nvidia.com> Link: https://patch.msgid.link/E1tmbhj-004vSz-Pt@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25net: stmmac: dwc-qos: name struct plat_stmmacenet_data consistentlyRussell King (Oracle)
Most of the stmmac driver uses "plat_dat" to name the platform data, and "data" is used for the glue driver data. make dwc-qos follow this pattern to avoid silly mistakes. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Thierry Reding <treding@nvidia.com> Link: https://patch.msgid.link/E1tmbhe-004vSt-M3@rmk-PC.armlinux.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25Add OVN to `rtnetlink.h`Jonas Gottlieb
- The Open Virtual Network (OVN) routing netlink handler uses ID 84 - Will also add to `/etc/iproute2/rt_protos` once this is accepted - For more information: https://github.com/ovn-org/ovn Signed-off-by: Jonas Gottlieb <jonas.gottlieb@stackit.cloud> Link: https://patch.msgid.link/Z7w_e7cfA3xmHDa6@SIT-SDELAP4051.int.lidl.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25selftests/net: ensure mptcp is enabled in netnsHangbin Liu
Some distributions may not enable MPTCP by default. All other MPTCP tests source mptcp_lib.sh to ensure MPTCP is enabled before testing. However, the ip_local_port_range test is the only one that does not include this step. Let's also ensure MPTCP is enabled in netns for ip_local_port_range so that it passes on all distributions. Suggested-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250224094013.13159-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25Octeontx2-af: RPM: Register driver with PCI subsys IDsHariprasad Kelam
Although the PCI device ID and Vendor ID for the RPM (MAC) block have remained the same across Octeon CN10K and the next-generation CN20K silicon, Hardware architecture has changed (NIX mapped RPMs and RFOE Mapped RPMs). Add PCI Subsystem IDs to the device table to ensure that this driver can be probed from NIX mapped RPM devices only. Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Link: https://patch.msgid.link/20250224035603.1220913-1-hkelam@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25sctp: Remove unused payload from sctp_idatahdrThorsten Blum
Remove the unused payload array from the struct sctp_idatahdr. Cc: Kees Cook <kees@kernel.org> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20250223204505.2499-3-thorsten.blum@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-25Merge branch 'eth-fbnic-update-fbnic-driver'Paolo Abeni
Mohsin Bashir says: ==================== eth: fbnic: Update fbnic driver This patchset makes following trivial changes to the fbnic driver: 1) Add coverage for PCIe CSRs in the ethtool register dump. 2) Consolidate the PUL_USER CSR section, update the end boundary, and remove redundant definition of the end boundary. 3) Update the return value in kdoc for fbnic_netdev_alloc(). ==================== Link: https://patch.msgid.link/20250221201813.2688052-1-mohsin.bashr@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-25eth: fbnic: Update return value in kdocMohsin Bashir
Fix return value in kdoc for fbnic_netdev_alloc() Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-25eth: fbnic: Consolidate PUL_USER CSR sectionMohsin Bashir
Move PUL_USER CSRs in the relevant section, update the end boundary address, and remove the redundant definition of end boundary. Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-25eth: fbnic: Add PCIe registers dumpMohsin Bashir
Provide coverage to PCIe registers in ethtool register dump Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-24net: phy: add phylib-internal.hHeiner Kallweit
This patch is a starting point for moving phylib-internal declarations to a private header file. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/082eacd2-a888-4716-8797-b3491ce02820@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24Merge branch 'mptcp-pm-misc-cleanups-part-3'Jakub Kicinski
Matthieu Baerts says: ==================== mptcp: pm: misc cleanups, part 3 These cleanups lead the way to the unification of the path-manager interfaces, and allow future extensions. The following patches are not all linked to each others, but are all related to the path-managers, except the last three. - Patch 1: remove unused returned value in mptcp_nl_set_flags(). - Patch 2: new flag: avoid iterating over all connections if not needed. - Patch 3: add a build check making sure there is enough space in cb-ctx. - Patch 4: new mptcp_pm_genl_fill_addr helper to reduce duplicated code. - Patch 5: simplify userspace_pm_append_new_local_addr helper. - Patch 6: drop unneeded inet6_sk(). - Patch 7: use ipv6_addr_equal() instead of !ipv6_addr_cmp() - Patch 8: scheduler: split an interface in two. - Patch 9: scheduler: save 64 bytes of currently unused data. - Patch 10: small optimisation to exit early in case of retransmissions. ==================== Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-0-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: blackhole: avoid checking the state twiceMatthieu Baerts (NGI0)
A small cleanup, reordering the conditions to avoid checking things twice. The code here is called in case of timeout on a TCP connection, before triggering a retransmission. But it only acts on SYN + MPC packets. So the conditions can be re-order to exit early in case of non-MPTCP SYN + MPC. This also reduce the indentation levels. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-10-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: sched: reduce size for unused dataMatthieu Baerts (NGI0)
Thanks for the previous commit ("mptcp: sched: split get_subflow interface into two"), the mptcp_sched_data structure is now currently unused. This structure has been added to allow future extensions that are not ready yet. At the end, this structure will not even be used at all when mptcp_subflow bpf_iter will be supported [1]. Here is a first step to save 64 bytes on the stack for each scheduling operation. The structure is not removed yet not to break the WIP work on these extensions, but will be done when [1] will be ready and applied. Link: https://lore.kernel.org/6645ad6e-8874-44c5-8730-854c30673218@linux.dev [1] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-9-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: sched: split get_subflow interface into twoGeliang Tang
get_retrans() interface of the burst packet scheduler invokes a sleeping function mptcp_pm_subflow_chk_stale(), which calls __lock_sock_fast(). So get_retrans() interface should be set with BPF_F_SLEEPABLE flag in BPF. But get_send() interface of this scheduler can't be set with BPF_F_SLEEPABLE flag since it's invoked in ack_update_msk() under mptcp data lock. So this patch has to split get_subflow() interface of packet scheduer into two interfaces: get_send() and get_retrans(). Then we can set get_retrans() interface alone with BPF_F_SLEEPABLE flag. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-8-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: pm: use ipv6_addr_equal in addresses_equalGeliang Tang
Use ipv6_addr_equal() to check whether two IPv6 addresses are equal in mptcp_addresses_equal(). This is more appropriate than using !ipv6_addr_cmp(). Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-7-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: pm: drop inet6_sk after inet_skGeliang Tang
In mptcp_event_add_subflow(), mptcp_event_pm_listener() and mptcp_nl_find_ssk(), 'issk' has already been got through inet_sk(). No need to use inet6_sk() to get 'ipv6_pinfo' again, just use issk->pinet6 instead. This patch also drops these 'ipv6_pinfo' variables. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-6-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: pm: drop match in userspace_pm_append_new_local_addrGeliang Tang
The variable 'match' in mptcp_userspace_pm_append_new_local_addr() is a redundant one, and this patch drops it. No need to define 'match' as 'struct mptcp_pm_addr_entry *' type. In this function, it's only used to check whether it's NULL. It can be defined as a Boolean one. Also other variables 'addr_match' and 'id_match' make 'match' a redundant one, which can be replaced by directly checking 'addr_match && id_match'. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-5-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: pm: add mptcp_pm_genl_fill_addr helperGeliang Tang
To save some redundant code in dump_addr() interfaces of both the netlink PM and userspace PM, the code that calls netlink message helpers (genlmsg_put/cancel/end) and mptcp_nl_fill_addr() is wrapped into a new helper mptcp_pm_genl_fill_addr(). Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-4-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: pm: add a build check for userspace_pm_dump_addrGeliang Tang
This patch adds a build check for mptcp_userspace_pm_dump_addr() to make sure there is enough space in 'cb->ctx' to store an address id bitmap. Just in case info stored in 'cb->ctx' are increased later. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-3-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: pm: change to fullmesh only for 'subflow'Matthieu Baerts (NGI0)
If an endpoint doesn't have the 'subflow' flag -- in fact, has no type, so not 'subflow', 'signal', nor 'implicit' -- there are then no subflows created from this local endpoint to at least the initial destination address. In this case, no need to call mptcp_pm_nl_fullmesh() which is there to recreate the subflows to reflect the new value of the fullmesh attribute. Similarly, there is then no need to iterate over all connections to do nothing, if only the 'fullmesh' flag has been changed, and the endpoint doesn't have the 'subflow' one. So stop early when dealing with this specific case. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-2-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: pm: remove unused ret value to set flagsMatthieu Baerts (NGI0)
The returned value is not used, it can then be dropped. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-1-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net/mlx5: Use secs_to_jiffies() instead of msecs_to_jiffies()Thorsten Blum
Use secs_to_jiffies() and simplify the code. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Link: https://patch.msgid.link/20250221085350.198024-3-thorsten.blum@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24Merge branch 'net-mlx5e-move-ipsec-policy-check-after-decryption'Jakub Kicinski
Tariq Toukan says: ==================== net/mlx5e: Move IPSec policy check after decryption This series by Jianbo adds IPsec policy check after decryption. In current mlx5 driver, the policy check is done before decryption for IPSec crypto and packet offload. This series changes that order to make it consistent with the processing in kernel xfrm. Besides, RX state with UPSPEC selector is supported correctly after new steering table is added after decryption and before the policy check. ==================== Link: https://patch.msgid.link/20250220213959.504304-1-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net/mlx5e: Support RX xfrm state selector's UPSPEC for packet offloadJianbo Liu
Previously, the upper layer matches are added for the decryption rule when xfrm selector's UPSPEC is specified in the command. However, it's impossible as packets are not decrypted, and there is no way to do match on the upper protocol (TCP/UDP) with specific source/destination port. The result is that packets are not decrypted by hardware because of this mismatch. Instead, they are forwarded to kernel, and decryption is done by software. To resolve this issue, this patch adds new table (sa_sel) after status table and before policy table. When UPSPEC's proto is specified in xfrm state's selector, a rule is added in status table to forward the decrypted packets to sa_sel table, where the corresponding rule for selector's UPSPEC is added, and packet's upper headers are checked there. If matched, they will be forward to policy table to do policy check. Otherwise, they are dropped immediately. Besides, add a global count for this kind of packet drop. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-9-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net/mlx5e: Add pass flow group for IPSec RX status tableJianbo Liu
This flow group is added for the pass rules for both crypto offload and packet offload. It is placed at the end of the table, and right before the miss group. There are two entries, and the default pass rules for both offloads are added in this group. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-8-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net/mlx5e: Add num_reserved_entries param for ipsec_ft_create()Jianbo Liu
Add parameter for ipsec_ft_create() to pass the number of the reserved entries when creating auto-grouped flow table. It's used to create table with pre-defined group(s) which may have more than one rule. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-7-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net/mlx5e: Skip IPSec RX policy check for crypto offloadJianbo Liu
For crypto offload, there is no xfrm policy rule offloaded to hardware, so no need to continue with policy check for it. Previously, for crypto offload, the hardware metadata reg c4 is not used and not changed, but set to ASO_OK(0) before decryption to avoid garbage data. Then a default rule is added to check ipsec_syndrome and this register. Packets are forwarded to policy table if succeed, or drop if fails. According to hardware document, this register value could be 0, 1. So a special value (0xAA), which is not used by hardware, is chosen as an indication for crypto offload. It is set to c4 before decryption. Then a default rule, which matches on 0xAA (and ipsec_syndrome on 0), is added, which means packets are done by crypto offload, and sends them to kernel directly, thus skips the policy check. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-6-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net/mlx5e: Move IPSec policy check after decryptionJianbo Liu
Currently, xfrm policy check is done before decryption in mlx5 driver. If matching any policy, packets are forwarded to xfrm state table for decryption. But this is exact opposite to what software does. For kernel implementation, xfrm decode is unconditionally activated whenever an IPSec packet reaches the input flow if there’s a matching state rule. This patch changes the order, move policy check after decryption. Besides, a miss flow table is added at the end for legacy mode, to make it easier to update the default destination of the steering rules. So ESP packets are firstly forwarded to SA table for decryption, then the result is checked in status table. If the decryption succeeds, packets are forwarded to another table to check xfrm policy rules. When a policy with allow action is matched, if in legacy mode packets are forwarded to miss flow table with one rule to forward them to RoCE tables, if in switchdev mode they are forwarded directly to TC root chain instead. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-5-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net/mlx5e: Add correct match to check IPSec syndromes for switchdev modeJianbo Liu
In commit dddb49b63d86 ("net/mlx5e: Add IPsec and ASO syndromes check in HW"), IPSec and ASO syndromes checks after decryption for the specified ASO object were added. But they are correct only for eswith in legacy mode. For switchdev mode, metadata register c1 is used to save the mapped id (not ASO object id). So, need to change the match accordingly for the check rules in status table. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-4-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net/mlx5e: Change the destination of IPSec RX SA miss ruleJianbo Liu
For eswitch in legacy mode, the packets decrypted in RX SA table will continue to be processed for RoCE. But this is not necessary for the un-decrypted packets, which don't match any decryption rules but hit the miss rule at the end of the table. So, change the destination of miss rule to TTC default one and skip RoCE. For eswitch in switchdev mode, the destination is unchanged. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net/mlx5e: Add helper function to update IPSec default destinationJianbo Liu
The default destination of IPSec steering rules for MPV mode will be updated when the master device is brought up or down. Move the common code into the helper function. It’s convenient to update destinations in later patches. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Patrisious Haddad <phaddad@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20250220213959.504304-2-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net: wangxun: Replace the judgement of MAC type with flagsJiawen Wu
Since device MAC types are constantly being added, the judgments of wx->mac.type are complex. Try to convert the types to flags depending on functions. Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Link: https://patch.msgid.link/20250221065718.197544-2-jiawenwu@trustnetic.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>