linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-07-23	sched: Add enqueue/dequeue of dualpi2 qdisc	Koen De Schepper
	DualPI2 provides L4S-type low latency & loss to traffic that uses a scalable congestion controller (e.g. TCP-Prague, DCTCP) without degrading the performance of 'classic' traffic (e.g. Reno, Cubic etc.). It is to be the reference implementation of IETF RFC9332 DualQ Coupled AQM (https://datatracker.ietf.org/doc/html/rfc9332). Note that creating two independent queues cannot meet the goal of DualPI2 mentioned in RFC9332: "...to preserve fairness between ECN-capable and non-ECN-capable traffic." Further, it could even lead to starvation of Classic traffic, which is also inconsistent with the requirements in RFC9332: "...although priority MUST be bounded in order not to starve Classic traffic." DualPI2 is designed to maintain approximate per-flow fairness on L-queue and C-queue by forming a single qdisc using the coupling factor and scheduler between two queues. The qdisc provides two queues called low latency and classic. It classifies packets based on the ECN field in the IP headers. By default it directs non-ECN and ECT(0) into the classic queue and ECT(1) and CE into the low latency queue, as per the IETF spec. Each queue runs its own AQM: * The classic AQM is called PI2, which is similar to the PIE AQM but more responsive and simpler. Classic traffic requires a decent target queue (default 15ms for Internet deployment) to fully utilize the link and to avoid high drop rates. * The low latency AQM is, by default, a very shallow ECN marking threshold (1ms) similar to that used for DCTCP. The DualQ isolates the low queuing delay of the Low Latency queue from the larger delay of the 'Classic' queue. However, from a bandwidth perspective, flows in either queue will share out the link capacity as if there was just a single queue. This bandwidth pooling effect is achieved by coupling together the drop and ECN-marking probabilities of the two AQMs. The PI2 AQM has two main parameters in addition to its target delay. The integral gain factor alpha is used to slowly correct any persistent standing queue error from the target delay, while the proportional gain factor beta is used to quickly compensate for queue changes (growth or shrinkage). Either alpha and beta are given as a parameter, or they can be calculated by tc from alternative typical and maximum RTT parameters. Internally, the output of a linear Proportional Integral (PI) controller is used for both queues. This output is squared to calculate the drop or ECN-marking probability of the classic queue. This counterbalances the square-root rate equation of Reno/Cubic, which is the trick that balances flow rates across the queues. For the ECN-marking probability of the low latency queue, the output of the base AQM is multiplied by a coupling factor. This determines the balance between the flow rates in each queue. The default setting makes the flow rates roughly equal, which should be generally applicable. If DUALPI2 AQM has detected overload (due to excessive non-responsive traffic in either queue), it will switch to signaling congestion solely using drop, irrespective of the ECN field. Alternatively, it can be configured to limit the drop probability and let the queue grow and eventually overflow (like tail-drop). GSO splitting in DUALPI2 is configurable from userspace while the default behavior is to split gso. When running DUALPI2 at unshaped 10gigE with 4 download streams test, splitting gso apart results in halving the latency with no loss in throughput: Summary of tcp_4down run 'no_split_gso': avg median # data pts Ping (ms) ICMP : 0.53 0.30 ms 350 TCP download avg : 2326.86 N/A Mbits/s 350 TCP download sum : 9307.42 N/A Mbits/s 350 TCP download::1 : 2672.99 2568.73 Mbits/s 350 TCP download::2 : 2586.96 2570.51 Mbits/s 350 TCP download::3 : 1786.26 1798.82 Mbits/s 350 TCP download::4 : 2261.21 2309.49 Mbits/s 350 Summart of tcp_4down run 'split_gso': avg median # data pts Ping (ms) ICMP : 0.22 0.23 ms 350 TCP download avg : 2335.02 N/A Mbits/s 350 TCP download sum : 9340.09 N/A Mbits/s 350 TCP download::1 : 2335.30 2334.22 Mbits/s 350 TCP download::2 : 2334.72 2334.20 Mbits/s 350 TCP download::3 : 2335.28 2334.58 Mbits/s 350 TCP download::4 : 2334.79 2334.39 Mbits/s 350 A similar result is observed when running DUALPI2 at unshaped 1gigE with 1 download stream test: Summary of tcp_1down run 'no_split_gso': avg median # data pts Ping (ms) ICMP : 1.13 1.25 ms 350 TCP download : 941.41 941.46 Mbits/s 350 Summart of tcp_1down run 'split_gso': avg median # data pts Ping (ms) ICMP : 0.51 0.55 ms 350 TCP download : 941.41 941.45 Mbits/s 350 Additional details can be found in the draft: https://datatracker.ietf.org/doc/html/rfc9332 Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> Co-developed-by: Olga Albisser <olga@albisser.org> Signed-off-by: Olga Albisser <olga@albisser.org> Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Co-developed-by: Henrik Steen <henrist@henrist.net> Signed-off-by: Henrik Steen <henrist@henrist.net> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Bob Briscoe <research@bobbriscoe.net> Signed-off-by: Ilpo Järvinen <ij@kernel.org> Acked-by: Dave Taht <dave.taht@gmail.com> Link: https://patch.msgid.link/20250722095915.24485-4-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	sched: Dump configuration and statistics of dualpi2 qdisc	Chia-Yu Chang
	The configuration and statistics dump of the DualPI2 Qdisc provides information related to both queues, such as packet numbers and queuing delays in the L-queue and C-queue, as well as general information such as probability value, WRR credits, memory usage, packet marking counters, max queue size, etc. The following patch includes enqueue/dequeue for DualPI2. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Link: https://patch.msgid.link/20250722095915.24485-3-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	sched: Struct definition and parsing of dualpi2 qdisc	Chia-Yu Chang
	DualPI2 is the reference implementation of IETF RFC9332 DualQ Coupled AQM (https://datatracker.ietf.org/doc/html/rfc9332) providing two queues called low latency (L-queue) and classic (C-queue). By default, it enqueues non-ECN and ECT(0) packets into the C-queue and ECT(1) and CE packets into the low latency queue (L-queue), as per IETF RFC9332 spec. This patch defines the dualpi2 Qdisc structure and parsing, and the following two patches include dumping and enqueue/dequeue for the DualPI2. Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Link: https://patch.msgid.link/20250722095915.24485-2-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	Merge branch 'split-netmem-from-struct-page'	Jakub Kicinski
	Byungchul Park says: ==================== Split netmem from struct page The MM subsystem is trying to reduce struct page to a single pointer. See the following link for your information: https://kernelnewbies.org/MatthewWilcox/Memdescs/Path The first step towards that is splitting struct page by its individual users, as has already been done with folio and slab. This patchset does that for page pool. Matthew Wilcox tried and stopped the same work, you can see in: https://lore.kernel.org/20230111042214.907030-1-willy@infradead.org I focused on removing the page pool members in struct page this time, not moving the allocation code of page pool from net to mm. It can be done later if needed. ==================== Link: https://patch.msgid.link/20250721021835.63939-1-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	libeth: xdp: access ->pp through netmem_desc instead of page	Byungchul Park
	To eliminate the use of struct page in page pool, the page pool users should use netmem descriptor and APIs instead. Make xdp access ->pp through netmem_desc instead of page. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-13-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	net: ti: icssg-prueth: access ->pp through netmem_desc instead of page	Byungchul Park
	To eliminate the use of struct page in page pool, the page pool users should use netmem descriptor and APIs instead. Make icssg-prueth access ->pp through netmem_desc instead of page. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-12-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	mlx5: access ->pp through netmem_desc instead of page	Byungchul Park
	To eliminate the use of struct page in page pool, the page pool users should use netmem descriptor and APIs instead. Make mlx5 access ->pp through netmem_desc instead of page. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-11-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	idpf: access ->pp through netmem_desc instead of page	Byungchul Park
	To eliminate the use of struct page in page pool, the page pool users should use netmem descriptor and APIs instead. Make idpf access ->pp through netmem_desc instead of page. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-10-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	iavf: access ->pp through netmem_desc instead of page	Byungchul Park
	To eliminate the use of struct page in page pool, the page pool users should use netmem descriptor and APIs instead. Make iavf access ->pp through netmem_desc instead of page. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-9-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	octeontx2-pf: access ->pp through netmem_desc instead of page	Byungchul Park
	To eliminate the use of struct page in page pool, the page pool users should use netmem descriptor and APIs instead. Make octeontx2-pf access ->pp through netmem_desc instead of page. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-8-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	net: fec: access ->pp through netmem_desc instead of page	Byungchul Park
	To eliminate the use of struct page in page pool, the page pool users should use netmem descriptor and APIs instead. Make fec access ->pp through netmem_desc instead of page. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-7-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	mt76: access ->pp through netmem_desc instead of page	Byungchul Park
	To eliminate the use of struct page in page pool, the page pool users should use netmem descriptor and APIs instead. Make mt76 access ->pp through netmem_desc instead of page. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-6-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	netdevsim: access ->pp through netmem_desc instead of page	Byungchul Park
	To eliminate the use of struct page in page pool, the page pool users should use netmem descriptor and APIs instead. Make netdevsim access ->pp through netmem_desc instead of page. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-5-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	netmem, mlx4: access ->pp_ref_count through netmem_desc instead of page	Byungchul Park
	To eliminate the use of struct page in page pool, the page pool users should use netmem descriptor and APIs instead. Make mlx4 access ->pp_ref_count through netmem_desc instead of page. While at it, add a helper, pp_page_to_nmdesc() and __pp_page_to_nmdesc(), that can be used to get netmem_desc from page only if it's a pp page. For now that netmem_desc overlays on page, it can be achieved by just casting, and use macro and _Generic to cover const casting as well. Plus, change page_pool_page_is_pp() to check for 'const struct page ' instead of 'struct page ' since it doesn't modify data and additionally covers const type. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-4-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	netmem: use netmem_desc instead of page to access ->pp in __netmem_get_pp()	Byungchul Park
	To eliminate the use of the page pool fields in struct page, the page pool code should use netmem descriptor and APIs instead. However, __netmem_get_pp() still accesses ->pp via struct page. So change it to use struct netmem_desc instead, since ->pp no longer will be available in struct page. While at it, add a helper, __netmem_to_nmdesc(), that can be used to unsafely get pointer to netmem_desc backing the netmem_ref, only when the netmem_ref is always backed by system memory. Signed-off-by: Byungchul Park <byungchul@sk.com> Link: https://patch.msgid.link/20250721021835.63939-3-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	netmem: introduce struct netmem_desc mirroring struct page	Byungchul Park
	To simplify struct page, the page pool members of struct page should be moved to other, allowing these members to be removed from struct page. Introduce a network memory descriptor to store the members, struct netmem_desc, and make it union'ed with the existing fields in struct net_iov, allowing to organize the fields of struct net_iov. Signed-off-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20250721021835.63939-2-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	vxlan: remove redundant conversion of vni in vxlan_nl2conf	Wang Liang
	The IFLA_VXLAN_ID data has been converted to local variable vni in vxlan_nl2conf(), there is no need to do it again when set conf->vni. Signed-off-by: Wang Liang <wangliang74@huawei.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20250722093049.1527505-1-wangliang74@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	netdevsim: add fw_update_flash_chunk_time_ms debugfs knobs	Jiri Pirko
	Netdevsim emulates firmware update and it takes 5 seconds to complete. For some use cases, this is too long and unnecessary. Allow user to configure the time by exposing debugfs a knob to set chunk time. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250722091945.79506-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23	devlink: Fix excessive stack usage in rate TC bandwidth parsing	Carolina Jubran
	The devlink_nl_rate_tc_bw_parse function uses a large stack array for devlink attributes, which triggers a warning about excessive stack usage: net/devlink/rate.c: In function 'devlink_nl_rate_tc_bw_parse': net/devlink/rate.c:382:1: error: the frame size of 1648 bytes is larger than 1536 bytes [-Werror=frame-larger-than=] Introduce a separate attribute set specifically for rate TC bandwidth parsing that only contains the two attributes actually used: index and bandwidth. This reduces the stack array from DEVLINK_ATTR_MAX entries to just 2 entries, solving the stack usage issue. Update devlink selftest to use the new 'index' and 'bw' attribute names consistent with the YAML spec. Example usage with ynl with the new spec: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-set --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1, "rate-tc-bws": [ {"index": 0, "bw": 50}, {"index": 1, "bw": 50}, {"index": 2, "bw": 0}, {"index": 3, "bw": 0}, {"index": 4, "bw": 0}, {"index": 5, "bw": 0}, {"index": 6, "bw": 0}, {"index": 7, "bw": 0} ] }' ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-get --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1 }' output for rate-get: {'bus-name': 'pci', 'dev-name': '0000:08:00.0', 'port-index': 1, 'rate-tc-bws': [{'bw': 50, 'index': 0}, {'bw': 50, 'index': 1}, {'bw': 0, 'index': 2}, {'bw': 0, 'index': 3}, {'bw': 0, 'index': 4}, {'bw': 0, 'index': 5}, {'bw': 0, 'index': 6}, {'bw': 0, 'index': 7}], 'rate-tx-max': 0, 'rate-tx-priority': 0, 'rate-tx-share': 0, 'rate-tx-weight': 0, 'rate-type': 'leaf'} Fixes: 566e8f108fc7 ("devlink: Extend devlink rate API with traffic classes bandwidth management") Reported-by: Arnd Bergmann <arnd@arndb.de> Closes: https://lore.kernel.org/netdev/20250708160652.1810573-1-arnd@kernel.org/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507171943.W7DJcs6Y-lkp@intel.com/ Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Tested-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/1753175609-330621-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	Merge branch 'mlx5-next' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux Tariq Toukan says: ==================== mlx5-next updates 2025-07-22 The following pull-request contains common mlx5 updates * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: net/mlx5: Expose cable_length field in PFCC register net/mlx5: Add IFC bits and enums for buf_ownership net/mlx5: Add IFC bits to support RSS for IPSec offload ==================== Link: https://patch.msgid.link/1753175048-330044-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	Merge branch 'tcp-a-couple-of-fixes'	Jakub Kicinski
	Paolo Abeni says: ==================== tcp: a couple of fixes This series includes a couple of follow-up for the recent tcp receiver changes, addressing issues outlined by the nipa CI and the mptcp self-tests. Note that despite the affected self-tests where MPTCP ones, the issues are really in the TCP code, see patch 1 for the details. v1: https://lore.kernel.org/cover.1752859383.git.pabeni@redhat.com ==================== Link: https://patch.msgid.link/cover.1753118029.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	tcp: do not increment BeyondWindow MIB for old seq	Paolo Abeni
	The mentioned MIB is currently incremented even when a packet with an old sequence number (i.e. a zero window probe) is received, which is IMHO misleading. Explicitly restrict such MIB increment at the relevant events. Fixes: 6c758062c64d ("tcp: add LINUX_MIB_BEYOND_WINDOW") Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20d147292eb4b13b6535e0ad6f56be64d9c330d3.1753118029.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	tcp: do not set a zero size receive buffer	Paolo Abeni
	The nipa CI is reporting frequent failures in the mptcp_connect self-tests. In the failing scenarios (TCP -> MPTCP) the involved sockets are actually plain TCP ones, as fallback for passive socket at 2whs time cause the MPTCP listener to actually create a TCP socket. The transfer is stuck due to the receiver buffer being zero. With the stronger check in place, tcp_clamp_window() can be invoked while the TCP socket has sk_rmem_alloc == 0, and the receive buffer will be zeroed, too. Check for the critical condition in tcp_prune_queue() and just drop the packet without shrinking the receiver buffer. Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20c18165d3f848e1c5c1b782d88c1a5ab38b3f70.1753118029.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	Merge branch 'net-mlx5-misc-changes-2025-07-21'	Jakub Kicinski
	Tariq Toukan says: ==================== net/mlx5: misc changes 2025-07-21 This series by Lama contains misc enhancements to the SHAMPO parameters. ==================== Link: https://patch.msgid.link/1753081999-326247-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	net/mlx5e: Remove duplicate mkey from SHAMPO header	Lama Kayal
	SHAMPO structure holds two variations of the mkey, which is unnecessary, a duplication that's repeated per rq. Remove duplicate mkey information and keep only one version, the one used in the fast path, rename field to reflect field type clearly. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1753081999-326247-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	net/mlx5e: SHAMPO, Remove mlx5e_shampo_get_log_hd_entry_size()	Lama Kayal
	Refactor mlx5e_shampo_get_log_hd_entry_size() as macro, for more simplicity. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1753081999-326247-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	net/mlx5e: SHAMPO, Cleanup reservation size formula	Lama Kayal
	The reservation size formula can be reduced to a simple evaluation of MLX5E_SHAMPO_WQ_RESRV_SIZE. This leaves mlx5e_shampo_get_log_rsrv_size() with one single use, which can be replaced with a macro for simplicity. Also, function mlx5e_shampo_get_log_rsrv_size() is used only throughout params.c, make it static. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1753081999-326247-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	tcp: trace retransmit failures in tcp_retransmit_skb	Fan Yu
	Background ========== When TCP retransmits a packet due to missing ACKs, the retransmission may fail for various reasons (e.g., packets stuck in driver queues, receiver zero windows, or routing issues). The original tcp_retransmit_skb tracepoint: 'commit e086101b150a ("tcp: add a tracepoint for tcp retransmission")' lacks visibility into these failure causes, making production diagnostics difficult. Solution ======== Adds the retval("err") to the tcp_retransmit_skb tracepoint. Enables users to know why some tcp retransmission failed and users can filter retransmission failures by retval. Compatibility description ========================= This patch extends the tcp_retransmit_skb tracepoint by adding a new "err" field at the end of its existing structure (within TP_STRUCT__entry). The compatibility implications are detailed as follows: 1) Structural compatibility for legacy user-space tools Legacy tools/BPF programs accessing existing fields (by offset or name) can still work without modification or recompilation.The new field is appended to the end, preserving original memory layout. 2) Note: semantic changes The original tracepoint primarily only focused on successfully retransmitted packets. With this patch, the tracepoint now can figure out packets that may terminate early due to specific reasons. For accurate statistics, users should filter using "err" to distinguish outcomes. Before patched: field:const void * skbaddr; offset:8; size:8; signed:0; field:const void * skaddr; offset:16; size:8; signed:0; field:int state; offset:24; size:4; signed:1; field:__u16 sport; offset:28; size:2; signed:0; field:__u16 dport; offset:30; size:2; signed:0; field:__u16 family; offset:32; size:2; signed:0; field:__u8 saddr[4]; offset:34; size:4; signed:0; field:__u8 daddr[4]; offset:38; size:4; signed:0; field:__u8 saddr_v6[16]; offset:42; size:16; signed:0; field:__u8 daddr_v6[16]; offset:58; size:16; signed:0; print fmt: "skbaddr=%p skaddr=%p family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c state=%s" After patched: field:const void * skbaddr; offset:8; size:8; signed:0; field:const void * skaddr; offset:16; size:8; signed:0; field:int state; offset:24; size:4; signed:1; field:__u16 sport; offset:28; size:2; signed:0; field:__u16 dport; offset:30; size:2; signed:0; field:__u16 family; offset:32; size:2; signed:0; field:__u8 saddr[4]; offset:34; size:4; signed:0; field:__u8 daddr[4]; offset:38; size:4; signed:0; field:__u8 saddr_v6[16]; offset:42; size:16; signed:0; field:__u8 daddr_v6[16]; offset:58; size:16; signed:0; field:int err; offset:76; size:4; signed:1; print fmt: "skbaddr=%p skaddr=%p family=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c state=%s err=%d" Co-developed-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: xu xin <xu.xin16@zte.com.cn> Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250721111607626_BDnIJB0ywk6FghN63bor@zte.com.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	net: Kconfig: add endif/endmenu comments	Randy Dunlap
	Add comments on endif & endmenu blocks. This can save time when searching & trying to understand kconfig menu dependencies. The other endif & endmenu statements are already commented like this. This makes it similar to drivers/net/Kconfig, which is already commented like this. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250721020420.3555128-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	Merge branch 'selftests-drv-net-test-xdp-native-support'	Jakub Kicinski
	Mohsin Bashir says: ==================== selftests: drv-net: Test XDP native support This patch series add tests to validate XDP native support for PASS, DROP, ABORT, and TX actions, as well as headroom and tailroom adjustment. For adjustment tests, validate support for both the extension and shrinking cases across various packet sizes and offset values. The pass criteria for head/tail adjustment tests require that at-least one adjustment value works for at-least one packet size. This ensure that the variability in maximum supported head/tail adjustment offset across different drivers is being incorporated. The results reported in this series are based on netdevsim. However, the series is tested against multiple other drivers including fbnic. Note: The XDP support for fbnic will be added later. ==================== Link: https://patch.msgid.link/20250719083059.3209169-1-mohsin.bashr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	selftests: drv-net: Test head-adjustment support	Mohsin Bashir
	Add test to validate the headroom adjustment support for both extension and the shrinking cases. For the extension part, eat up space from the start of payload data whereas, for the shrinking part, populate the newly available space with a tag. In the user-space, validate that a test string is manipulated accordingly. The negative and positive offset values result in shrinking and growing of headroom (growing and shrinking of payload) respectively. TAP version 13 1..9 ok 1 xdp.test_xdp_native_pass_sb ok 2 xdp.test_xdp_native_pass_mb ok 3 xdp.test_xdp_native_drop_sb ok 4 xdp.test_xdp_native_drop_mb ok 5 xdp.test_xdp_native_tx_mb \# Failed run: pkt_sz 512, ... offset 1. Reason: Adjustment failed ok 6 xdp.test_xdp_native_adjst_tail_grow_data ok 7 xdp.test_xdp_native_adjst_tail_shrnk_data \# Failed run: pkt_sz 512, ... offset -128. Reason: Adjustment failed ok 8 xdp.test_xdp_native_adjst_head_grow_data \# Failed run: pkt_sz (512) > HDS threshold (0) and offset 64 > 48 ok 9 xdp.test_xdp_native_adjst_head_shrnk_data \# Totals: pass:9 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250719083059.3209169-6-mohsin.bashr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	selftests: drv-net: Test tail-adjustment support	Mohsin Bashir
	Add test to validate support for the two cases of tail adjustment: 1) tail extension, and 2) tail shrinking across different frame sizes and offset values. For each of the two cases, test both the single and multi-buffer cases by choosing appropriate packet size. The negative offset value result in growing of tailroom (shrinking of payload) while the positive offset result in shrinking of tailroom (growing of payload). Since the support for tail adjustment varies across drivers, classify the test as pass if at least one combination of packet size and offset from a pre-selected list results in a successful run. In case of an unsuccessful run, report the failure and highlight the packet size and offset values that caused the test to fail, as well as the values that resulted in the last successful run. Note: The growing part of this test for netdevsim may appear flaky when the offset value is larger than 1. This behavior occurs because tailroom is not explicitly reserved for netdevsim, with 1 being the typical tailroom value. However, in certain cases, such as payload being the last in the page with additional available space, the truesize is expanded. This also result increases the tailroom causing the test to pass intermittently. In contrast, when tailrrom is explicitly reserved, such as in the of fbnic, the test results are deterministic. ./drivers/net/xdp.py TAP version 13 1..7 ok 1 xdp.test_xdp_native_pass_sb ok 2 xdp.test_xdp_native_pass_mb ok 3 xdp.test_xdp_native_drop_sb ok 4 xdp.test_xdp_native_drop_mb ok 5 xdp.test_xdp_native_tx_mb \# Failed run: ... successful run: ... offset 1. Reason: Adjustment failed ok 6 xdp.test_xdp_native_adjst_tail_grow_data ok 7 xdp.test_xdp_native_adjst_tail_shrnk_data \# Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250719083059.3209169-5-mohsin.bashr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	selftests: drv-net: Test XDP_TX support	Mohsin Bashir
	Add test to verify the XDP_TX functionality by generating traffic from a remote node on a specific UDP port and redirecting it back to the sender. ./drivers/net/xdp.py TAP version 13 1..5 ok 1 xdp.test_xdp_native_pass_sb ok 2 xdp.test_xdp_native_pass_mb ok 3 xdp.test_xdp_native_drop_sb ok 4 xdp.test_xdp_native_drop_mb ok 5 xdp.test_xdp_native_tx_mb \# Totals: pass:5 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250719083059.3209169-4-mohsin.bashr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	selftests: drv-net: Test XDP_PASS/DROP support	Mohsin Bashir
	Test XDP_PASS/DROP in single buffer and multi buffer mode when XDP native support is available. ./drivers/net/xdp.py TAP version 13 1..4 ok 1 xdp.test_xdp_native_pass_sb ok 2 xdp.test_xdp_native_pass_mb ok 3 xdp.test_xdp_native_drop_sb ok 4 xdp.test_xdp_native_drop_mb \# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250719083059.3209169-3-mohsin.bashr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	net: netdevsim: hook in XDP handling	Jakub Kicinski
	Add basic XDP support by hooking in do_xdp_generic(). This should be enough to validate most basic XDP tests. Signed-off-by: Mohsin Bashir <mohsin.bashr@gmail.com> Link: https://patch.msgid.link/20250719083059.3209169-2-mohsin.bashr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-22	Merge branch 'octeontx2-af-rpm-misc-feaures'	Paolo Abeni
	Hariprasad Kelam says: ==================== Octeontx2-af: RPM: misc feaures This series patches adds different features like debugfs support for shared firmware structure and DMAC filter related enhancements. Patch1: Saves interface MAC address configured from DMAC filters. Patch2: Disables the stale DMAC filters in driver initialization Patch3: Configure dma mask for CGX/RPM drivers Patch4: Debugfs support for shared firmware data. ==================== Link: https://patch.msgid.link/20250720163638.1560323-1-hkelam@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-22	Octeontx2-af: Debugfs support for firmware data	Hariprasad Kelam
	MAC address, Link modes (supported and advertised) and eeprom data for the Netdev interface are read from the shared firmware data. This patch adds debugfs support for the same. Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Link: https://patch.msgid.link/20250720163638.1560323-5-hkelam@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-22	Octeontx2-af: RPM: Update DMA mask	Hariprasad Kelam
	CGX/RPM driver supports 48 bits of DMA addressing. Update the DMA mask accordingly. Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250720163638.1560323-4-hkelam@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-22	Octeontx2-af: Disable stale DMAC filters	Subbaraya Sundeep
	During driver initialization disable stale DMAC filters in CGX/RPM set by firmware. Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250720163638.1560323-3-hkelam@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-22	Octeontx2-af: Add programmed macaddr to RVU pfvf	Hariprasad Kelam
	Octeontx2/CN10k MAC block supports DMAC filters. DMAC filters can be installed on the interface through ethtool. When a user installs a DMAC filter, the interface's MAC address is implicitly added to the filter list. To ensure consistency, this MAC address must be kept in sync with the pfvf->mac_addr field, which is used to install MAC-based NPC rules. This patch updates the pfvf->mac_addr field with the programmed MAC address and also enables VF interfaces to install DMAC filters. Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250720163638.1560323-2-hkelam@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-22	ibmveth: Add multi buffers rx replenishment hcall support	Mingming Cao
	This patch enables batched RX buffer replenishment in ibmveth by using the new firmware-supported h_add_logical_lan_buffers() hcall to submit up to 8 RX buffers in a single call, instead of repeatedly calling the single-buffer h_add_logical_lan_buffer() hcall. During the probe, with the patch, the driver queries ILLAN attributes to detect IBMVETH_ILLAN_RX_MULTI_BUFF_SUPPORT bit. If the attribute is present, rx_buffers_per_hcall is set to 8, enabling batched replenishment. Otherwise, it defaults to 1, preserving the original upstream behavior with no change in code flow for unsupported systems. The core rx replenish logic remains the same. But when batching is enabled, the driver aggregates up to 8 fully prepared descriptors into a single h_add_logical_lan_buffers() hypercall. If any allocation or DMA mapping fails while preparing a batch, only the successfully prepared buffers are submitted, and the remaining are deferred for the next replenish cycle. If at runtime the firmware stops accepting the batched hcall—e,g, after a Live Partition Migration (LPM) to a host that does not support h_add_logical_lan_buffers(), the hypercall returns H_FUNCTION. In that case, the driver transparently disables batching, resets rx_buffers_per_hcall to 1, and falls back to the single-buffer hcall in next future replenishments to take care of these and future buffers. Test were done on systems with firmware that both supports and does not support the new h_add_logical_lan_buffers hcall. On supported firmware, this reduces hypercall overhead significantly over multiple buffers. SAR measurements showed about a 15% improvement in packet processing rate under moderate RX load, with heavier traffic seeing gains more than 30% Signed-off-by: Mingming Cao <mmc@linux.ibm.com> Reviewed-by: Brian King <bjking1@linux.ibm.com> Reviewed-by: Haren Myneni <haren@linux.ibm.com> Reviewed-by: Dave Marquardt <davemarq@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250719091356.57252-1-mmc@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-22	ip6_gre: Factor out common ip6gre tunnel match into helper	Yue Haibing
	Extract common ip6gre tunnel match from ip6gre_tunnel_lookup() into new helper function ip6gre_tunnel_match() to reduce code duplication. No functional change intended. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250719081551.963670-1-yuehaibing@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-22	Merge branch 'gve-af_xdp-zero-copy-for-dqo-rda'	Paolo Abeni
	Joshua Washington says: ==================== gve: AF_XDP zero-copy for DQO RDA This patch series adds support for AF_XDP zero-copy in the DQO RDA queue format. XSK infrastructure is updated to re-post buffers when adding XSK pools because XSK umem will be posted directly to the NIC, a departure from the bounce buffer model used in GQI QPL. A registry of XSK pools is introduced to prevent the usage of XSK pools when in copy mode. v1: https://lore.kernel.org/netdev/20250714160451.124671-1-jeroendb@google.com/ ==================== Link: https://patch.msgid.link/20250717152839.973004-1-jeroendb@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-22	gve: implement DQO RX datapath and control path for AF_XDP zero-copy	Joshua Washington
	Add the RX datapath for AF_XDP zero-copy for DQ RDA. The RX path is quite similar to that of the normal XDP case. Parallel methods are introduced to properly handle XSKs instead of normal driver buffers. To properly support posting from XSKs, queues are destroyed and recreated, as the driver was initially making use of page pool buffers instead of the XSK pool memory. Expose support for AF_XDP zero-copy, as the TX and RX datapaths both exist. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Signed-off-by: Jeroen de Borst <jeroendb@google.com> Link: https://patch.msgid.link/20250717152839.973004-6-jeroendb@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-22	gve: implement DQO TX datapath for AF_XDP zero-copy	Joshua Washington
	In the descriptor clean path, a number of changes need to be made to accommodate out of order completions and double completions. The XSK stack can only handle completions being processed in order, as a single counter is incremented in xsk_tx_completed to sigify how many XSK descriptors have been completed. Because completions can come back out of order in DQ, a separate queue of XSK descriptors must be maintained. This queue keeps the pending packets in the order that they were written so that the descriptors can be counted in xsk_tx_completed in the same order. For double completions, a new pending packet state and type are introduced. The new type, GVE_TX_PENDING_PACKET_DQO_XSK, plays an anlogous role to pre-existing _SKB and _XDP_FRAME pending packet types for XSK descriptors. The new state, GVE_PACKET_STATE_XSK_COMPLETE, represents packets for which no more completions are expected. This includes packets which have received a packet completion or reinjection completion, as well as packets whose reinjection completion timer have timed out. At this point, such packets can be counted as part of xsk_tx_completed() and freed. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Signed-off-by: Jeroen de Borst <jeroendb@google.com> Link: https://patch.msgid.link/20250717152839.973004-5-jeroendb@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-22	gve: keep registry of zc xsk pools in netdev_priv	Joshua Washington
	Relying on xsk_get_pool_from_qid for getting whether zero copy is enabled on a queue is erroneous, as an XSK pool is registered in xp_assign_dev whether AF_XDP zero-copy is enabled or not. This becomes problematic when queues are restarted in copy mode, as all RX queues with XSKs will register a pool, causing the driver to exercise the zero-copy codepath. This patch adds a bitmap to keep track of which queues have zero-copy enabled. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Signed-off-by: Jeroen de Borst <jeroendb@google.com> Link: https://patch.msgid.link/20250717152839.973004-4-jeroendb@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-22	gve: merge xdp and xsk registration	Joshua Washington
	The existence of both of these xdp_rxq and xsk_rxq is redundant. xdp_rxq can be used in both the zero-copy mode and the copy mode case. XSK pool memory model registration is prioritized over normal memory model registration to ensure that memory model registration happens only once per queue. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Signed-off-by: Jeroen de Borst <jeroendb@google.com> Link: https://patch.msgid.link/20250717152839.973004-3-jeroendb@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-22	gve: deduplicate xdp info and xsk pool registration logic	Joshua Washington
	The XDP registration path currently has a lot of reused logic, leading changes to the codepaths to be unnecessarily complex. gve_reg_xsk_pool extracts the logic of registering an XSK pool with a queue into a method that can be used by both XDP_SETUP_XSK_POOL and gve_reg_xdp_info. gve_unreg_xdp_info is used to undo XDP info registration in the error path instead of explicitly unregistering the XDP info, as it is more complete and idempotent. This patch will be followed by other changes to the XDP registration logic, and will simplify those changes due to the use of common methods. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Joshua Washington <joshwash@google.com> Signed-off-by: Jeroen de Borst <jeroendb@google.com> Link: https://patch.msgid.link/20250717152839.973004-2-jeroendb@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-21	Merge branch 'ethtool-rss-support-creating-and-removing-contexts-via-netlink'	Jakub Kicinski
	Jakub Kicinski says: ==================== ethtool: rss: support creating and removing contexts via Netlink This series completes support of RSS configuration via Netlink. All functionality supported by the IOCTL is now supported by Netlink. Future series (time allowing) will add: - hashing on the flow label, which started this whole thing; - pinning the RSS context to a Netlink socket for auto-cleanup. The first patch is a leftover held back from previous series to avoid conflicting with Gal's fix. Next 4 patches refactor existing code to make reusing it for context creation possible. 2 patches after that add create and delete commands. Last but not least the test is extended. ==================== Link: https://patch.msgid.link/20250717234343.2328602-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-21	selftests: drv-net: rss_api: context create and delete tests	Jakub Kicinski
	Add test cases for creating and deleting contexts. TAP version 13 1..12 ok 1 rss_api.test_rxfh_nl_set_fail ok 2 rss_api.test_rxfh_nl_set_indir ok 3 rss_api.test_rxfh_nl_set_indir_ctx ok 4 rss_api.test_rxfh_indir_ntf ok 5 rss_api.test_rxfh_indir_ctx_ntf ok 6 rss_api.test_rxfh_nl_set_key ok 7 rss_api.test_rxfh_fields ok 8 rss_api.test_rxfh_fields_set ok 9 rss_api.test_rxfh_fields_set_xfrm # SKIP no input-xfrm supported ok 10 rss_api.test_rxfh_fields_ntf ok 11 rss_api.test_rss_ctx_add ok 12 rss_api.test_rss_ctx_ntf # Totals: pass:11 fail:0 xfail:0 xpass:0 skip:1 error:0 Link: https://patch.msgid.link/20250717234343.2328602-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>