summaryrefslogtreecommitdiff
path: root/Documentation/networking
AgeCommit message (Collapse)Author
21 hoursMerge tag 'net-next-6.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core & protocols: - Improve drop account scalability on NUMA hosts for RAW and UDP sockets and the backlog, almost doubling the Pps capacity under DoS - Optimize the UDP RX performance under stress, reducing contention, revisiting the binary layout of the involved data structs and implementing NUMA-aware locking. This improves UDP RX performance by an additional 50%, even more under extreme conditions - Add support for PSP encryption of TCP connections; this mechanism has some similarities with IPsec and TLS, but offers superior HW offloads capabilities - Ongoing work to support Accurate ECN for TCP. AccECN allows more than one congestion notification signal per RTT and is a building block for Low Latency, Low Loss, and Scalable Throughput (L4S) - Reorganize the TCP socket binary layout for data locality, reducing the number of touched cachelines in the fastpath - Refactor skb deferral free to better scale on large multi-NUMA hosts, this improves TCP and UDP RX performances significantly on such HW - Increase the default socket memory buffer limits from 256K to 4M to better fit modern link speeds - Improve handling of setups with a large number of nexthop, making dump operating scaling linearly and avoiding unneeded synchronize_rcu() on delete - Improve bridge handling of VLAN FDB, storing a single entry per bridge instead of one entry per port; this makes the dump order of magnitude faster on large switches - Restore IP ID correctly for encapsulated packets at GSO segmentation time, allowing GRO to merge packets in more scenarios - Improve netfilter matching performance on large sets - Improve MPTCP receive path performance by leveraging recently introduced core infrastructure (skb deferral free) and adopting recent TCP autotuning changes - Allow bridges to redirect to a backup port when the bridge port is administratively down - Introduce MPTCP 'laminar' endpoint that con be used only once per connection and simplify common MPTCP setups - Add RCU safety to dst->dev, closing a lot of possible races - A significant crypto library API for SCTP, MPTCP and IPv6 SR, reducing code duplication - Supports pulling data from an skb frag into the linear area of an XDP buffer Things we sprinkled into general kernel code: - Generate netlink documentation from YAML using an integrated YAML parser Driver API: - Support using IPv6 Flow Label in Rx hash computation and RSS queue selection - Introduce API for fetching the DMA device for a given queue, allowing TCP zerocopy RX on more H/W setups - Make XDP helpers compatible with unreadable memory, allowing more easily building DevMem-enabled drivers with a unified XDP/skbs datapath - Add a new dedicated ethtool callback enabling drivers to provide the number of RX rings directly, improving efficiency and clarity in RX ring queries and RSS configuration - Introduce a burst period for the health reporter, allowing better handling of multiple errors due to the same root cause - Support for DPLL phase offset exponential moving average, controlling the average smoothing factor Device drivers: - Add a new Huawei driver for 3rd gen NIC (hinic3) - Add a new SpacemiT driver for K1 ethernet MAC - Add a generic abstraction for shared memory communication devices (dibps) - Ethernet high-speed NICs: - nVidia/Mellanox: - Use multiple per-queue doorbell, to avoid MMIO contention issues - support adjacent functions, allowing them to delegate their SR-IOV VFs to sibling PFs - support RSS for IPSec offload - support exposing raw cycle counters in PTP and mlx5 - support for disabling host PFs. - Intel (100G, ice, idpf): - ice: support for SRIOV VFs over an Active-Active link aggregate - ice: support for firmware logging via debugfs - ice: support for Earliest TxTime First (ETF) hardware offload - idpf: support basic XDP functionalities and XSk - Broadcom (bnxt): - support Hyper-V VF ID - dynamic SRIOV resource allocations for RoCE - Meta (fbnic): - support queue API, zero-copy Rx and Tx - support basic XDP functionalities - devlink health support for FW crashes and OTP mem corruptions - expand hardware stats coverage to FEC, PHY, and Pause - Wangxun: - support ethtool coalesce options - support for multiple RSS contexts - Ethernet virtual: - Macsec: - replace custom netlink attribute checks with policy-level checks - Bonding: - support aggregator selection based on port priority - Microsoft vNIC: - use page pool fragments for RX buffers instead of full pages to improve memory efficiency - Ethernet NICs consumer, and embedded: - Qualcomm: support Ethernet function for IPQ9574 SoC - Airoha: implement wlan offloading via NPU - Freescale - enetc: add NETC timer PTP driver and add PTP support - fec: enable the Jumbo frame support for i.MX8QM - Renesas (R-Car S4): - support HW offloading for layer 2 switching - support for RZ/{T2H, N2H} SoCs - Cadence (macb): support TAPRIO traffic scheduling - TI: - support for Gigabit ICSS ethernet SoC (icssm-prueth) - Synopsys (stmmac): a lot of cleanups - Ethernet PHYs: - Support 10g-qxgmi phy-mode for AQR412C, Felix DSA and Lynx PCS driver - Support bcm63268 GPHY power control - Support for Micrel lan8842 PHY and PTP - Support for Aquantia AQR412 and AQR115 - CAN: - a large CAN-XL preparation work - reorganize raw_sock and uniqframe struct to minimize memory usage - rcar_canfd: update the CAN-FD handling - WiFi: - extended Neighbor Awareness Networking (NAN) support - S1G channel representation cleanup - improve S1G support - WiFi drivers: - Intel (iwlwifi): - major refactor and cleanup - Broadcom (brcm80211): - support for AP isolation - RealTek (rtw88/89) rtw88/89: - preparation work for RTL8922DE support - MediaTek (mt76): - HW restart improvements - MLO support - Qualcomm/Atheros (ath10k): - GTK rekey fixes - Bluetooth drivers: - btusb: support for several new IDs for MT7925 - btintel: support for BlazarIW core - btintel_pcie: support for _suspend() / _resume() - btintel_pcie: support for Scorpious, Panther Lake-H484 IDs" * tag 'net-next-6.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1536 commits) net: stmmac: Add support for Allwinner A523 GMAC200 dt-bindings: net: sun8i-emac: Add A523 GMAC200 compatible Revert "Documentation: net: add flow control guide and document ethtool API" octeontx2-pf: fix bitmap leak octeontx2-vf: fix bitmap leak net/mlx5e: Use extack in set rxfh callback net/mlx5e: Introduce mlx5e_rss_params for RSS configuration net/mlx5e: Introduce mlx5e_rss_init_params net/mlx5e: Remove unused mdev param from RSS indir init net/mlx5: Improve QoS error messages with actual depth values net/mlx5e: Prevent entering switchdev mode with inconsistent netns net/mlx5: HWS, Generalize complex matchers net/mlx5: Improve write-combining test reliability for ARM64 Grace CPUs selftests/net: add tcp_port_share to .gitignore Revert "net/mlx5e: Update and set Xon/Xoff upon MTU set" net: add NUMA awareness to skb_attempt_defer_free() net: use llist for sd->defer_list net: make softnet_data.defer_count an atomic selftests: drv-net: psp: add tests for destroying devices selftests: drv-net: psp: add test for auto-adjusting TCP MSS ...
26 hoursMerge tag 'for-6.18/io_uring-20250929' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Store ring provided buffers locally for the users, rather than stuff them into struct io_kiocb. These types of buffers must always be fully consumed or recycled in the current context, and leaving them in struct io_kiocb is hence not a good ideas as that struct has a vastly different life time. Basically just an architecture cleanup that can help prevent issues with ring provided buffers in the future. - Support for mixed CQE sizes in the same ring. Before this change, a CQ ring either used the default 16b CQEs, or it was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where a few 32b CQEs were needed, this caused everything else to use big CQEs. This is wasteful both in terms of memory usage, but also memory bandwidth for the posted CQEs. With IORING_SETUP_CQE_MIXED, applications may use request types that post both normal 16b and big 32b CQEs on the same ring. - Add helpers for async data management, to make it harder for opcode handlers to mess it up. - Add support for multishot for uring_cmd, which ublk can use. This helps improve efficiency, by providing a persistent request type that can trigger multiple CQEs. - Add initial support for ring feature querying. We had basic support for probe operations, but the API isn't great. Rather than expand that, add support for QUERY which is easily expandable and can cover a lot more cases than the existing probe support. This will help applications get a better idea of what operations are supported on a given host. - zcrx improvements from Pavel: - Improve refill entry alignment for better caching - Various cleanups, especially around deduplicating normal memory vs dmabuf setup. - Generalisation of the niov size (Patch 12). It's still hard coded to PAGE_SIZE on init, but will let the user to specify the rx buffer length on setup. - Syscall / synchronous bufer return. It'll be used as a slow fallback path for returning buffers when the refill queue is full. Useful for tolerating slight queue size misconfiguration or with inconsistent load. - Accounting more memory to cgroups. - Additional independent cleanups that will also be useful for mutli-area support. - Various fixes and cleanups * tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring/cmd: drop unused res2 param from io_uring_cmd_done() io_uring: fix nvme's 32b cqes on mixed cq io_uring/query: cap number of queries io_uring/query: prevent infinite loops io_uring/zcrx: account niov arrays to cgroup io_uring/zcrx: allow synchronous buffer return io_uring/zcrx: introduce io_parse_rqe() io_uring/zcrx: don't adjust free cache space io_uring/zcrx: use guards for the refill lock io_uring/zcrx: reduce netmem scope in refill io_uring/zcrx: protect netdev with pp_lock io_uring/zcrx: rename dma lock io_uring/zcrx: make niov size variable io_uring/zcrx: set sgt for umem area io_uring/zcrx: remove dmabuf_offset io_uring/zcrx: deduplicate area mapping io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback() io_uring/zcrx: check all niovs filled with dma addresses io_uring/zcrx: move area reg checks into io_import_area io_uring/zcrx: don't pass slot to io_zcrx_create_area ...
2 daysRevert "Documentation: net: add flow control guide and document ethtool API"Paolo Abeni
This reverts commit 7bd80ed89d72285515db673803b021469ba71ee8. I should not have merged it to begin with due to pending review and changes to be addressed. Link: https://patch.msgid.link/c6f3af12df9b7998920a02027fc8893ce82afc4c.1759239721.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
3 daysDocumentation: net: add flow control guide and document ethtool APIOleksij Rempel
Introduce a new document, flow_control.rst, to provide a comprehensive guide on Ethernet Flow Control in Linux. The guide explains how flow control works, how autonegotiation resolves pause capabilities, and how to configure it using ethtool and Netlink. In parallel, document the pause and pause-stat attributes in the ethtool.yaml netlink spec. This enables the ynl tool to generate kernel-doc comments for the corresponding enums in the UAPI header, making the C interface self-documenting. Finally, replace the legacy flow control section in phy.rst with a reference to the new document and add pointers in the relevant C source files. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250924120241.724850-1-o.rempel@pengutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
4 daysdocs: networking: phy: clarify abbreviation "PAL"Markus Heidelberg
It is suddenly used in the text without introduction, so the meaning might have been unclear to readers. Signed-off-by: Markus Heidelberg <m.heidelberg@cab.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250926131520.222346-1-m.heidelberg@cab.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysnet/mlx5: Expose uar access and odp page fault countersAkiva Goldberger
Add three counters to vnic health reporter: bar_uar_access, odp_local_triggered_page_fault, and odp_remote_triggered_page_fault. - bar_uar_access number of WRITE or READ access operations to the UAR on the PCIe BAR. - odp_local_triggered_page_fault number of locally-triggered page-faults due to ODP. - odp_remote_triggered_page_fault number of remotly-triggered page-faults due to ODP. Example access: $ devlink health diagnose pci/0000:08:00.0 reporter vnic vNIC env counters: total_error_queues: 0 send_queue_priority_update_flow: 0 comp_eq_overrun: 0 async_eq_overrun: 0 cq_overrun: 0 invalid_command: 0 quota_exceeded_command: 0 nic_receive_steering_discard: 0 icm_consumption: 1032 bar_uar_access: 1279 odp_local_triggered_page_fault: 20 odp_remote_triggered_page_fault: 34 Signed-off-by: Akiva Goldberger <agoldberger@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1758797130-829564-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysethtool: add FEC bins histogram reportVadim Fedorenko
IEEE 802.3ck-2022 defines counters for FEC bins and 802.3df-2024 clarifies it a bit further. Implement reporting interface through as addition to FEC stats available in ethtool. Drivers can leave bin counter uninitialized if per-lane values are provided. In this case the core will recalculate summ for the bin. Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Link: https://patch.msgid.link/20250924124037.1508846-2-vadim.fedorenko@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysnet: dns_resolver: Fix request-key cross-referenceBagas Sanjaya
Link to "Key Request Service" docs uses file:// scheme instead due to angled brackets markup. Fix it to proper cross-reference. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://patch.msgid.link/20250924020626.17073-4-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysnet: dns_resolver: Move dns_query() explanation out of code blockBagas Sanjaya
Documentation for dns_query() is placed in the function's literal code block snippet instead. Move it out of there. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250924020626.17073-3-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysnet: dns_resolver: Use reST bullet list for features listBagas Sanjaya
Features overview list uses an asterisk in parentheses (``(*)``) as bullet list marker, which isn't supported by Sphinx as proper bullet. Replace it with just asterisk. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://patch.msgid.link/20250924020626.17073-2-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 daysnet: gso: restore ids of outer ip headers correctlyRichard Gobert
Currently, NETIF_F_TSO_MANGLEID indicates that the inner-most ID can be mangled. Outer IDs can always be mangled. Make GSO preserve outer IDs by default, with NETIF_F_TSO_MANGLEID allowing both inner and outer IDs to be mangled. This commit also modifies a few drivers that use SKB_GSO_FIXEDID directly. Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> # for sfc Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250923085908.4687-4-richardbgobert@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
9 daysDocumentation: rxrpc: Demote three sectionsBagas Sanjaya
Three sections ("Socket Options", "Security", and "Example Client Usage") use title headings, which increase number of entries in the networking docs toctree by three, and also make the rest of sections headed under "Example Client Usage". Demote these sections back to section headings. Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: David Howells <dhowells@redhat.com> Link: https://patch.msgid.link/20250922124137.5266-1-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 daystcp: move tcp_clean_acked to tcp_sock_read_tx groupEric Dumazet
tp->tcp_clean_acked is fetched in tx path when snd_una is updated. This field thus belongs to tcp_sock_read_tx group. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250919204856.2977245-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 daystcp: move recvmsg_inq to tcp_sock_read_txrxEric Dumazet
Fill a hole in tcp_sock_read_txrx, instead of possibly wasting a cache line. Note that tcp_recvmsg_locked() is also reading tp->repair, so this removes one cache line miss in tcp recvmsg(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250919204856.2977245-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 daystcp: move tcp->rcv_tstamp to tcp_sock_write_txrx groupEric Dumazet
tcp_ack() writes this field, it belongs to tcp_sock_write_txrx. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250919204856.2977245-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-18Merge branch 'add-basic-psp-encryption-for-tcp-connections'Paolo Abeni
Daniel Zahka says: ================== add basic PSP encryption for TCP connections This is v13 of the PSP RFC [1] posted by Jakub Kicinski one year ago. General developments since v1 include a fork of packetdrill [2] with support for PSP added, as well as some test cases, and an implementation of PSP key exchange and connection upgrade [3] integrated into the fbthrift RPC library. Both [2] and [3] have been tested on server platforms with PSP-capable CX7 NICs. Below is the cover letter from the original RFC: Add support for PSP encryption of TCP connections. PSP is a protocol out of Google: https://github.com/google/psp/blob/main/doc/PSP_Arch_Spec.pdf which shares some similarities with IPsec. I added some more info in the first patch so I'll keep it short here. The protocol can work in multiple modes including tunneling. But I'm mostly interested in using it as TLS replacement because of its superior offload characteristics. So this patch does three things: - it adds "core" PSP code PSP is offload-centric, and requires some additional care and feeding, so first chunk of the code exposes device info. This part can be reused by PSP implementations in xfrm, tunneling etc. - TCP integration TLS style Reuse some of the existing concepts from TLS offload, such as attaching crypto state to a socket, marking skbs as "decrypted", egress validation. PSP does not prescribe key exchange protocols. To use PSP as a more efficient TLS offload we intend to perform a TLS handshake ("inline" in the same TCP connection) and negotiate switching to PSP based on capabilities of both endpoints. This is also why I'm not including a software implementation. Nobody would use it in production, software TLS is faster, it has larger crypto records. - mlx5 implementation That's mostly other people's work, not 100% sure those folks consider it ready hence the RFC in the title. But it works :) Not posted, queued a branch [4] are follow up pieces: - standard stats - netdevsim implementation and tests [1] https://lore.kernel.org/netdev/20240510030435.120935-1-kuba@kernel.org/ [2] https://github.com/danieldzahka/packetdrill [3] https://github.com/danieldzahka/fbthrift/tree/dzahka/psp [4] https://github.com/kuba-moo/linux/tree/psp Comments we intend to defer to future series: - we prefer to keep the version field in the tx-assoc netlink request, because it makes parsing keys require less state early on, but we are willing to change in the next version of this series. - using a static branch to wrap psp_enqueue_set_decrypted() and other functions called from tcp. - using INDIRECT_CALL for tls/psp in sk_validate_xmit_skb(). We prefer to address this in a dedicated patch series, so that this series does not need to modify the way tls_validate_xmit_skb() is declared and stubbed out. v12: https://lore.kernel.org/netdev/20250916000559.1320151-1-kuba@kernel.org/ v11: https://lore.kernel.org/20250911014735.118695-1-daniel.zahka@gmail.com v10: https://lore.kernel.org/netdev/20250828162953.2707727-1-daniel.zahka@gmail.com/ v9: https://lore.kernel.org/netdev/20250827155340.2738246-1-daniel.zahka@gmail.com/ v8: https://lore.kernel.org/netdev/20250825200112.1750547-1-daniel.zahka@gmail.com/ v7: https://lore.kernel.org/netdev/20250820113120.992829-1-daniel.zahka@gmail.com/ v6: https://lore.kernel.org/netdev/20250812003009.2455540-1-daniel.zahka@gmail.com/ v5: https://lore.kernel.org/netdev/20250723203454.519540-1-daniel.zahka@gmail.com/ v4: https://lore.kernel.org/netdev/20250716144551.3646755-1-daniel.zahka@gmail.com/ v3: https://lore.kernel.org/netdev/20250702171326.3265825-1-daniel.zahka@gmail.com/ v2: https://lore.kernel.org/netdev/20250625135210.2975231-1-daniel.zahka@gmail.com/ v1: https://lore.kernel.org/netdev/20240510030435.120935-1-kuba@kernel.org/ ================== Links: https://patch.msgid.link/20250917000954.859376-1-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> --- * add-basic-psp-encryption-for-tcp-connections: net/mlx5e: Implement PSP key_rotate operation net/mlx5e: Add Rx data path offload psp: provide decapsulation and receive helper for drivers net/mlx5e: Configure PSP Rx flow steering rules net/mlx5e: Add PSP steering in local NIC RX net/mlx5e: Implement PSP Tx data path psp: provide encapsulation helper for drivers net/mlx5e: Implement PSP operations .assoc_add and .assoc_del net/mlx5e: Support PSP offload functionality psp: track generations of device key net: psp: update the TCP MSS to reflect PSP packet overhead net: psp: add socket security association code net: tcp: allow tcp_timewait_sock to validate skbs before handing to device net: move sk_validate_xmit_skb() to net/core/dev.c psp: add op for rotation of device key tcp: add datapath logic for PSP with inline key exchange net: modify core data structures for PSP datapath support psp: base PSP device support psp: add documentation
2025-09-18psp: add documentationJakub Kicinski
Add documentation of things which belong in the docs rather than commit messages. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-2-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18eth: fbnic: add OTP health reporterJakub Kicinski
OTP memory ("fuses") are used for secure boot and anti-rollback protection. The OTP memory is ECC protected. Check for its health periodically to notice when the chip is starting to go bad. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250916231420.1693955-10-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18eth: fbnic: report FW uptime in health diagnoseJakub Kicinski
FW crashes are detected based on uptime going back, expose the uptime via devlink health diagnose. $ devlink -j health diagnose pci/0000:01:00.0 reporter fw {"last_heartbeat":{"fw_uptime":{"sec":201,"msec":76}}} $ devlink -j health diagnose pci/0000:01:00.0 reporter fw last_heartbeat: fw_uptime: sec: 201 msec: 76 Reviewed-by: Lee Trager <lee@trager.us> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250916231420.1693955-9-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18eth: fbnic: add FW health reporterJakub Kicinski
Add a health reporter to catch FW crashes. Dumping the reporter if FW has not crashed will create a snapshot of FW memory. Reviewed-by: Lee Trager <lee@trager.us> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250916231420.1693955-8-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18tcp: accecn: AccECN option send controlChia-Yu Chang
Instead of sending the option in every ACK, limit sending to those ACKs where the option is necessary: - Handshake - "Change-triggered ACK" + the ACK following it. The 2nd ACK is necessary to unambiguously indicate which of the ECN byte counters in increasing. The first ACK has two counters increasing due to the ecnfield edge. - ACKs with CE to allow CEP delta validations to take advantage of the option. - Force option to be sent every at least once per 2^22 bytes. The check is done using the bit edges of the byte counters (avoids need for extra variables). - AccECN option beacon to send a few times per RTT even if nothing in the ECN state requires that. The default is 3 times per RTT, and its period can be set via sysctl_tcp_ecn_option_beacon. Below are the pahole outcomes before and after this patch, in which the group size of tcp_sock_write_tx is increased from 89 to 97 due to the new u64 accecn_opt_tstamp member: [BEFORE THIS PATCH] struct tcp_sock { [...] u64 tcp_wstamp_ns; /* 2488 8 */ struct list_head tsorted_sent_queue; /* 2496 16 */ [...] __cacheline_group_end__tcp_sock_write_tx[0]; /* 2521 0 */ __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */ u8 nonagle:4; /* 2521: 0 1 */ u8 rate_app_limited:1; /* 2521: 4 1 */ /* XXX 3 bits hole, try to pack */ /* Force alignment to the next boundary: */ u8 :0; u8 received_ce_pending:4;/* 2522: 0 1 */ u8 unused2:4; /* 2522: 4 1 */ u8 accecn_minlen:2; /* 2523: 0 1 */ u8 est_ecnfield:2; /* 2523: 2 1 */ u8 unused3:4; /* 2523: 4 1 */ [...] __cacheline_group_end__tcp_sock_write_txrx[0]; /* 2628 0 */ [...] /* size: 3200, cachelines: 50, members: 171 */ } [AFTER THIS PATCH] struct tcp_sock { [...] u64 tcp_wstamp_ns; /* 2488 8 */ u64 accecn_opt_tstamp; /* 2596 8 */ struct list_head tsorted_sent_queue; /* 2504 16 */ [...] __cacheline_group_end__tcp_sock_write_tx[0]; /* 2529 0 */ __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2529 0 */ u8 nonagle:4; /* 2529: 0 1 */ u8 rate_app_limited:1; /* 2529: 4 1 */ /* XXX 3 bits hole, try to pack */ /* Force alignment to the next boundary: */ u8 :0; u8 received_ce_pending:4;/* 2530: 0 1 */ u8 unused2:4; /* 2530: 4 1 */ u8 accecn_minlen:2; /* 2531: 0 1 */ u8 est_ecnfield:2; /* 2531: 2 1 */ u8 accecn_opt_demand:2; /* 2531: 4 1 */ u8 prev_ecnfield:2; /* 2531: 6 1 */ [...] __cacheline_group_end__tcp_sock_write_txrx[0]; /* 2636 0 */ [...] /* size: 3200, cachelines: 50, members: 173 */ } Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Co-developed-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Ilpo Järvinen <ij@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-8-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18tcp: accecn: AccECN optionIlpo Järvinen
The Accurate ECN allows echoing back the sum of bytes for each IP ECN field value in the received packets using AccECN option. This change implements AccECN option tx & rx side processing without option send control related features that are added by a later change. Based on specification: https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt (Some features of the spec will be added in the later changes rather than in this one). A full-length AccECN option is always attempted but if it does not fit, the minimum length is selected based on the counters that have changed since the last update. The AccECN option (with 24-bit fields) often ends in odd sizes so the option write code tries to take advantage of some nop used to pad the other TCP options. The delivered_ecn_bytes pairs with received_ecn_bytes similar to how delivered_ce pairs with received_ce. In contrast to ACE field, however, the option is not always available to update delivered_ecn_bytes. For ACK w/o AccECN option, the delivered bytes calculated based on the cumulative ACK+SACK information are assigned to one of the counters using an estimation heuristic to select the most likely ECN byte counter. Any estimation error is corrected when the next AccECN option arrives. It may occur that the heuristic gets too confused when there are enough different byte counter deltas between ACKs with the AccECN option in which case the heuristic just gives up on updating the counters for a while. tcp_ecn_option sysctl can be used to select option sending mode for AccECN: TCP_ECN_OPTION_DISABLED, TCP_ECN_OPTION_MINIMUM, and TCP_ECN_OPTION_FULL. This patch increases the size of tcp_info struct, as there is no existing holes for new u32 variables. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_info { [...] __u32 tcpi_total_rto_time; /* 244 4 */ /* size: 248, cachelines: 4, members: 61 */ } [AFTER THIS PATCH] struct tcp_info { [...] __u32 tcpi_total_rto_time; /* 244 4 */ __u32 tcpi_received_ce; /* 248 4 */ __u32 tcpi_delivered_e1_bytes; /* 252 4 */ __u32 tcpi_delivered_e0_bytes; /* 256 4 */ __u32 tcpi_delivered_ce_bytes; /* 260 4 */ __u32 tcpi_received_e1_bytes; /* 264 4 */ __u32 tcpi_received_e0_bytes; /* 268 4 */ __u32 tcpi_received_ce_bytes; /* 272 4 */ /* size: 280, cachelines: 5, members: 68 */ } This patch uses the existing 1-byte holes in the tcp_sock_write_txrx group for new u8 members, but adds a 4-byte hole in tcp_sock_write_rx group after the new u32 delivered_ecn_bytes[3] member. Therefore, the group size of tcp_sock_write_rx is increased from 96 to 112. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_sock { [...] u8 received_ce_pending:4; /* 2522: 0 1 */ u8 unused2:4; /* 2522: 4 1 */ /* XXX 1 byte hole, try to pack */ [...] u32 rcv_rtt_last_tsecr; /* 2668 4 */ [...] __cacheline_group_end__tcp_sock_write_rx[0]; /* 2728 0 */ [...] /* size: 3200, cachelines: 50, members: 167 */ } [AFTER THIS PATCH] struct tcp_sock { [...] u8 received_ce_pending:4;/* 2522: 0 1 */ u8 unused2:4; /* 2522: 4 1 */ u8 accecn_minlen:2; /* 2523: 0 1 */ u8 est_ecnfield:2; /* 2523: 2 1 */ u8 unused3:4; /* 2523: 4 1 */ [...] u32 rcv_rtt_last_tsecr; /* 2668 4 */ u32 delivered_ecn_bytes[3];/* 2672 12 */ /* XXX 4 bytes hole, try to pack */ [...] __cacheline_group_end__tcp_sock_write_rx[0]; /* 2744 0 */ [...] /* size: 3200, cachelines: 50, members: 171 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Neal Cardwell <ncardwell@google.com> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-7-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18tcp: accecn: add AccECN rx byte countersIlpo Järvinen
These three byte counters track IP ECN field payload byte sums for all arriving (acceptable) packets for ECT0, ECT1, and CE. The AccECN option (added by a later patch in the series) echoes these counters back to sender side; therefore, it is placed within the group of tcp_sock_write_txrx. Below are the pahole outcomes before and after this patch, in which the group size of tcp_sock_write_txrx is increased from 95 + 4 to 107 + 4 and an extra 4-byte hole is created but will be exploited in later patches: [BEFORE THIS PATCH] struct tcp_sock { [...] u32 delivered_ce; /* 2576 4 */ u32 received_ce; /* 2580 4 */ u32 app_limited; /* 2584 4 */ u32 rcv_wnd; /* 2588 4 */ struct tcp_options_received rx_opt; /* 2592 24 */ __cacheline_group_end__tcp_sock_write_txrx[0]; /* 2616 0 */ [...] /* size: 3200, cachelines: 50, members: 166 */ } [AFTER THIS PATCH] struct tcp_sock { [...] u32 delivered_ce; /* 2576 4 */ u32 received_ce; /* 2580 4 */ u32 received_ecn_bytes[3];/* 2584 12 */ u32 app_limited; /* 2596 4 */ u32 rcv_wnd; /* 2600 4 */ struct tcp_options_received rx_opt; /* 2604 24 */ __cacheline_group_end__tcp_sock_write_txrx[0]; /* 2628 0 */ /* XXX 4 bytes hole, try to pack */ [...] /* size: 3200, cachelines: 50, members: 167 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Signed-off-by: Neal Cardwell <ncardwell@google.com> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-4-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18tcp: accecn: AccECN negotiationIlpo Järvinen
Accurate ECN negotiation parts based on the specification: https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt Accurate ECN is negotiated using ECE, CWR and AE flags in the TCP header. TCP falls back into using RFC3168 ECN if one of the ends supports only RFC3168-style ECN. The AccECN negotiation includes reflecting IP ECN field value seen in SYN and SYNACK back using the same bits as negotiation to allow responding to SYN CE marks and to detect ECN field mangling. CE marks should not occur currently because SYN=1 segments are sent with Non-ECT in IP ECN field (but proposal exists to remove this restriction). Reflecting SYN IP ECN field in SYNACK is relatively simple. Reflecting SYNACK IP ECN field in the final/third ACK of the handshake is more challenging. Linux TCP code is not well prepared for using the final/third ACK a signalling channel which makes things somewhat complicated here. tcp_ecn sysctl can be used to select the highest ECN variant (Accurate ECN, ECN, No ECN) that is attemped to be negotiated and requested for incoming connection and outgoing connection: TCP_ECN_IN_NOECN_OUT_NOECN, TCP_ECN_IN_ECN_OUT_ECN, TCP_ECN_IN_ECN_OUT_NOECN, TCP_ECN_IN_ACCECN_OUT_ACCECN, TCP_ECN_IN_ACCECN_OUT_ECN, and TCP_ECN_IN_ACCECN_OUT_NOECN. After this patch, the size of tcp_request_sock remains unchanged and no new holes are added. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_request_sock { [...] u32 rcv_nxt; /* 352 4 */ u8 syn_tos; /* 356 1 */ /* size: 360, cachelines: 6, members: 16 */ } [AFTER THIS PATCH] struct tcp_request_sock { [...] u32 rcv_nxt; /* 352 4 */ u8 syn_tos; /* 356 1 */ bool accecn_ok; /* 357 1 */ u8 syn_ect_snt:2; /* 358: 0 1 */ u8 syn_ect_rcv:2; /* 358: 2 1 */ u8 accecn_fail_mode:4; /* 358: 4 1 */ /* size: 360, cachelines: 6, members: 20 */ } After this patch, the size of tcp_sock remains unchanged and no new holes are added. Also, 4 bits of the existing 2-byte hole are exploited. Below are the pahole outcomes before and after this patch: [BEFORE THIS PATCH] struct tcp_sock { [...] u8 dup_ack_counter:2; /* 2761: 0 1 */ u8 tlp_retrans:1; /* 2761: 2 1 */ u8 unused:5; /* 2761: 3 1 */ u8 thin_lto:1; /* 2762: 0 1 */ u8 fastopen_connect:1; /* 2762: 1 1 */ u8 fastopen_no_cookie:1; /* 2762: 2 1 */ u8 fastopen_client_fail:2; /* 2762: 3 1 */ u8 frto:1; /* 2762: 5 1 */ /* XXX 2 bits hole, try to pack */ [...] u8 keepalive_probes; /* 2765 1 */ /* XXX 2 bytes hole, try to pack */ [...] /* size: 3200, cachelines: 50, members: 164 */ } [AFTER THIS PATCH] struct tcp_sock { [...] u8 dup_ack_counter:2; /* 2761: 0 1 */ u8 tlp_retrans:1; /* 2761: 2 1 */ u8 syn_ect_snt:2; /* 2761: 3 1 */ u8 syn_ect_rcv:2; /* 2761: 5 1 */ u8 thin_lto:1; /* 2761: 7 1 */ u8 fastopen_connect:1; /* 2762: 0 1 */ u8 fastopen_no_cookie:1; /* 2762: 1 1 */ u8 fastopen_client_fail:2; /* 2762: 2 1 */ u8 frto:1; /* 2762: 4 1 */ /* XXX 3 bits hole, try to pack */ [...] u8 keepalive_probes; /* 2765 1 */ u8 accecn_fail_mode:4; /* 2766: 0 1 */ /* XXX 4 bits hole, try to pack */ /* XXX 1 byte hole, try to pack */ [...] /* size: 3200, cachelines: 50, members: 166 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-3-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-18tcp: AccECN coreIlpo Järvinen
This change implements Accurate ECN without negotiation and AccECN Option (that will be added by later changes). Based on AccECN specifications: https://tools.ietf.org/id/draft-ietf-tcpm-accurate-ecn-28.txt Accurate ECN allows feeding back the number of CE (congestion experienced) marks accurately to the sender in contrast to RFC3168 ECN that can only signal one marks-seen-yes/no per RTT. Congestion control algorithms can take advantage of the accurate ECN information to fine-tune their congestion response to avoid drastic rate reduction when only mild congestion is encountered. With Accurate ECN, tp->received_ce (r.cep in AccECN spec) keeps track of how many segments have arrived with a CE mark. Accurate ECN uses ACE field (ECE, CWR, AE) to communicate the value back to the sender which updates tp->delivered_ce (s.cep) based on the feedback. This signalling channel is lossy when ACE field overflow occurs. Conservative strategy is selected here to deal with the ACE overflow, however, some strategies using the AccECN option later in the overall patchset mitigate against false overflows detected. The ACE field values on the wire are offset by TCP_ACCECN_CEP_INIT_OFFSET. Delivered_ce/received_ce count the real CE marks rather than forcing all downstream users to adapt to the wire offset. This patch uses the first 1-byte hole and the last 4-byte hole of the tcp_sock_write_txrx for 'received_ce_pending' and 'received_ce'. Also, the group size of tcp_sock_write_txrx is increased from 91 + 4 to 95 + 4 due to the new u32 received_ce member. Below are the trimmed pahole outcomes before and after this patch. [BEFORE THIS PATCH] struct tcp_sock { [...] __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */ u8 nonagle:4; /* 2521: 0 1 */ u8 rate_app_limited:1; /* 2521: 4 1 */ /* XXX 3 bits hole, try to pack */ /* XXX 2 bytes hole, try to pack */ [...] u32 delivered_ce; /* 2576 4 */ u32 app_limited; /* 2580 4 */ u32 rcv_wnd; /* 2684 4 */ struct tcp_options_received rx_opt; /* 2688 24 */ __cacheline_group_end__tcp_sock_write_txrx[0]; /* 2612 0 */ /* XXX 4 bytes hole, try to pack */ [...] /* size: 3200, cachelines: 50, members: 161 */ } [AFTER THIS PATCH] struct tcp_sock { [...] __cacheline_group_begin__tcp_sock_write_txrx[0]; /* 2521 0 */ u8 nonagle:4; /* 2521: 0 1 */ u8 rate_app_limited:1; /* 2521: 4 1 */ /* XXX 3 bits hole, try to pack */ /* Force alignment to the next boundary: */ u8 :0; u8 received_ce_pending:4;/* 2522: 0 1 */ u8 unused2:4; /* 2522: 4 1 */ /* XXX 1 byte hole, try to pack */ [...] u32 delivered_ce; /* 2576 4 */ u32 received_ce; /* 2580 4 */ u32 app_limited; /* 2584 4 */ u32 rcv_wnd; /* 2588 4 */ struct tcp_options_received rx_opt; /* 2592 24 */ __cacheline_group_end__tcp_sock_write_txrx[0]; /* 2616 0 */ [...] /* size: 3200, cachelines: 50, members: 164 */ } Signed-off-by: Ilpo Järvinen <ij@kernel.org> Co-developed-by: Olivier Tilmans <olivier.tilmans@nokia.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia.com> Co-developed-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Signed-off-by: Chia-Yu Chang <chia-yu.chang@nokia-bell-labs.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250916082434.100722-2-chia-yu.chang@nokia-bell-labs.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-17net/mlx5e: Use the 'num_doorbells' devlink paramCosmin Ratiu
Use the new devlink param to control how many doorbells mlx5e devices allocate and use. The maximum number of doorbells configurable is capped to the maximum number of channels. This only applies to the Ethernet part, the RDMA devices using mlx5 manage their own doorbells. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-17devlink: Add a 'num_doorbells' driverinit paramCosmin Ratiu
This parameter can be used by drivers to configure a different number of doorbells. Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-16docs: devlink: Sort table of contents alphabeticallyKory Maincent (Dent Project)
Sort devlink documentation table of contents alphabetically to improve readability and make it easier to locate specific chapters. Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20250915-feature_poe_permanent_conf-v3-3-78871151088b@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-15dpll: zl3073x: Implement devlink flash callbackIvan Vecera
Use the introduced functionality to read firmware files and flash their contents into the device's internal flash memory to implement the devlink flash update callback. Sample output on EDS2 development board: # devlink -j dev info i2c/1-0070 | jq '.[][]["versions"]["running"]' { "fw": "6026" } # devlink dev flash i2c/1-0070 file firmware_fw2.hex [utility] Prepare flash mode [utility] Downloading image 100% [utility] Flash mode enabled [firmware1-part1] Downloading image 100% [firmware1-part1] Flashing image [firmware1-part2] Downloading image 100% [firmware1-part2] Flashing image [firmware1] Flashing done [firmware2] Downloading image 100% [firmware2] Flashing image 100% [firmware2] Flashing done [utility] Leaving flash mode Flashing done # devlink -j dev info i2c/1-0070 | jq '.[][]["versions"]["running"]' { "fw": "7006" } Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Link: https://patch.msgid.link/20250909091532.11790-6-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.17-rc6). Conflicts: net/netfilter/nft_set_pipapo.c net/netfilter/nft_set_pipapo_avx2.c c4eaca2e1052 ("netfilter: nft_set_pipapo: don't check genbit from packetpath lookups") 84c1da7b38d9 ("netfilter: nft_set_pipapo: use avx2 algorithm for insertions too") Only trivial adjacent changes (in a doc and a Makefile). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-10Merge tag 'linux-can-fixes-for-6.17-20250910' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2025-09-10 The 1st patch is by Alex Tran and fixes the Documentation of the struct bcm_msg_head. Davide Caratti's patch enabled the VCAN driver as a module for the Linux self tests. Tetsuo Handa contributes 3 patches that fix various problems in the CAN j1939 protocol. Anssi Hannula's patch fixes a potential use-after-free in the xilinx_can driver. Geert Uytterhoeven's patch fixes the rcan_can's suspend to RAM on R-Car Gen3 using PSCI. * tag 'linux-can-fixes-for-6.17-20250910' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: can: rcar_can: rcar_can_resume(): fix s2ram with PSCI can: xilinx_can: xcan_write_frame(): fix use-after-free of transmitted SKB can: j1939: j1939_local_ecu_get(): undo increment when j1939_local_ecu_get() fails can: j1939: j1939_sk_bind(): call j1939_priv_put() immediately when j1939_local_ecu_get() failed can: j1939: implement NETDEV_UNREGISTER notification handler selftests: can: enable CONFIG_CAN_VCAN as a module docs: networking: can: change bcm_msg_head frames member to support flexible array ==================== Link: https://patch.msgid.link/20250910162907.948454-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-10doc: mptcp: fix Netlink specs linkMatthieu Baerts (NGI0)
The Netlink specs RST files are no longer generated inside the source tree. In other words, the path to mptcp_pm.rst has changed, and needs to be updated to the new location. Fixes: 1ce4da3dd99e ("docs: use parser_yaml extension to handle Netlink specs") Reported-by: Kory Maincent <kory.maincent@bootlin.com> Closes: https://lore.kernel.org/20250828185037.07873d04@kmaincent-XPS-13-7390 Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250909-net-next-mptcp-pm-link-v1-1-0f1c4b8439c6@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09net/mlx5e: Add stale counter for PCIe congestion eventsDragos Tatulea
This ethtool counter is meant to help with observing how many times the congestion event was triggered but on query there was no state change. This would help to indicate when a work item was scheduled to run too late and in the meantime the congestion state changed back to previous state. While at it, do a driveby typo fix in documentation for pci_bw_inbound_high. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1757237976-531416-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09net/mlx5e: Make PCIe congestion event thresholds configurableDragos Tatulea
Add devlink driverinit parameters for configuring the thresholds for PCIe congestion events. These parameters are registered only when the firmware supports this feature. Update the mlx5 devlink docs as well on these new params. Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/1757237976-531416-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09net/mlx5: Implement devlink total_vfs parameterVlad Dumitrescu
Some devices support both symmetric (same value for all PFs) and asymmetric, while others only support symmetric configuration. This implementation prefers asymmetric, since it is closer to the devlink model (per function settings), but falls back to symmetric when needed. Example usage: devlink dev param set pci/0000:01:00.0 name total_vfs value <u16> cmode permanent devlink dev reload pci/0000:01:00.0 action fw_activate echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove echo 1 >/sys/bus/pci/rescan cat /sys/bus/pci/devices/0000:01:00.0/sriov_totalvfs Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Tested-by: Kamal Heib <kheib@redhat.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250907012953.301746-5-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09net/mlx5: Implement devlink enable_sriov parameterVlad Dumitrescu
Example usage: devlink dev param set pci/0000:01:00.0 name enable_sriov value {true, false} cmode permanent devlink dev reload pci/0000:01:00.0 action fw_activate echo 1 >/sys/bus/pci/devices/0000:01:00.0/remove echo 1 >/sys/bus/pci/rescan grep ^ /sys/bus/pci/devices/0000:01:00.0/sriov_* Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Tested-by: Kamal Heib <kheib@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250907012953.301746-4-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09net/mlx5: Implement cqe_compress_type via devlink paramsSaeed Mahameed
Selects which algorithm should be used by the NIC in order to decide rate of CQE compression dependeng on PCIe bus conditions. Supported values: 1) balanced, merges fewer CQEs, resulting in a moderate compression ratio but maintaining a balance between bandwidth savings and performance 2) aggressive, merges more CQEs into a single entry, achieving a higher compression rate and maximizing performance, particularly under high traffic loads. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250907012953.301746-3-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09devlink: Add 'total_vfs' generic device paramVlad Dumitrescu
NICs are typically configured with total_vfs=0, forcing users to rely on external tools to enable SR-IOV (a widely used and essential feature). Add total_vfs parameter to devlink for SR-IOV max VF configurability. Enables standard kernel tools to manage SR-IOV, addressing the need for flexible VF configuration. Signed-off-by: Vlad Dumitrescu <vdumitrescu@nvidia.com> Tested-by: Kamal Heib <kheib@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250907012953.301746-2-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09mptcp: make ADD_ADDR retransmission timeout adaptiveGeliang Tang
Currently the ADD_ADDR option is retransmitted with a fixed timeout. This patch makes the retransmission timeout adaptive by using the maximum RTO among all the subflows, while still capping it at the configured maximum value (add_addr_timeout_max). This improves responsiveness when establishing new subflows. Specifically: 1. Adds mptcp_adjust_add_addr_timeout() helper to compute the adaptive timeout. 2. Uses maximum subflow RTO (icsk_rto) when available. 3. Applies exponential backoff based on retransmission count. 4. Maintains fallback to configured max timeout when no RTO data exists. This slightly changes the behaviour of the MPTCP "add_addr_timeout" sysctl knob to be used as a maximum instead of a fixed value. But this is seen as an improvement: the ADD_ADDR might be sent quicker than before to improve the overall MPTCP connection. Also, the default value is set to 2 min, which was already way too long, and caused the ADD_ADDR not to be retransmitted for connections shorter than 2 minutes. Suggested-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/576 Reviewed-by: Christoph Paasch <cpaasch@openai.com> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250907-net-next-mptcp-add_addr-retrans-adapt-v1-1-824cc805772b@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09doc: mptcp: net.mptcp.pm_type is deprecatedMatthieu Baerts (NGI0)
The net.mptcp.pm_type sysctl knob has been deprecated in v6.15, net.mptcp.path_manager should be used instead. Adapt the section about path managers to suggest using the new sysctl knob instead of the deprecated one. Fixes: 595c26d122d1 ("mptcp: sysctl: set path manager by name") Cc: stable@vger.kernel.org Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250908-net-mptcp-misc-fixes-6-17-rc5-v1-2-5f2168a66079@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-09docs: networking: can: change bcm_msg_head frames member to support flexible ↵Alex Tran
array The documentation of the 'bcm_msg_head' struct does not match how it is defined in 'bcm.h'. Changed the frames member to a flexible array, matching the definition in the header file. See commit 94dfc73e7cf4 ("treewide: uapi: Replace zero-length arrays with flexible-array members") Signed-off-by: Alex Tran <alex.t.tran@gmail.com> Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20250904031709.1426895-1-alex.t.tran@gmail.com Fixes: 94dfc73e7cf4 ("treewide: uapi: Replace zero-length arrays with flexible-array members") Link: https://bugzilla.kernel.org/show_bug.cgi?id=217783 Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-09-09bonding: support aggregator selection based on port priorityHangbin Liu
Add a new ad_select policy 'port_priority' that uses the per-port actor priority values (set via ad_actor_port_prio) to determine aggregator selection. This allows administrators to influence which ports are preferred for aggregation by assigning different priority values, providing more flexible load balancing control in LACP configurations. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250902064501.360822-3-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-09bonding: add support for per-port LACP actor priorityHangbin Liu
Introduce a new netlink attribute 'actor_port_prio' to allow setting the LACP actor port priority on a per-slave basis. This extends the existing bonding infrastructure to support more granular control over LACP negotiations. The priority value is embedded in LACPDU packets and will be used by subsequent patches to influence aggregator selection policies. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250902064501.360822-2-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.17-rc5). No conflicts. Adjacent changes: include/net/sock.h c51613fa276f ("net: add sk->sk_drop_counters") 5d6b58c932ec ("net: lockless sock_i_ino()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-02bonding: Remove support for use_carrierJay Vosburgh
Remove the implementation of use_carrier, the link monitoring method that utilizes ethtool or ioctl to determine the link state of an interface in a bond. Bonding will always behaves as if use_carrier=1, which relies on netif_carrier_ok() to determine the link state of interfaces. To avoid acquiring RTNL many times per second, bonding inspects link state under RCU, but not under RTNL. However, ethtool implementations in drivers may sleep, and therefore this strategy is unsuitable for use with calls into driver ethtool functions. The use_carrier option was introduced in 2003, to provide backwards compatibility for network device drivers that did not support the then-new netif_carrier_ok/on/off system. Device drivers are now expected to support netif_carrier_*, and the use_carrier backwards compatibility logic is no longer necessary. The option itself remains, but when queried always returns 1, and may only be set to 1. Link: https://lore.kernel.org/000000000000eb54bf061cfd666a@google.com Link: https://lore.kernel.org/20240718122017.d2e33aaac43a.I10ab9c9ded97163aef4e4de10985cd8f7de60d28@changeid Signed-off-by: Jay Vosburgh <jv@jvosburgh.net> Reported-by: syzbot+b8c48ea38ca27d150063@syzkaller.appspotmail.com Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/2029487.1756512517@famine Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-01docs: remove obsolete description about threaded NAPIKohei Enju
Commit 2677010e7793 ("Add support to set NAPI threaded for individual NAPI") introduced threaded NAPI configuration per individual NAPI instance, however obsolete description that threaded NAPI is per device has remained. Remove the old description and clarify that only NAPI instances running in threaded mode spawn kernel threads by changing "Each NAPI instance" to "Each threaded NAPI instance". Signed-off-by: Kohei Enju <enjuk@amazon.com> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Link: https://patch.msgid.link/20250829064857.51503-1-enjuk@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-27io_uring/zcrx: add support for IORING_SETUP_CQE_MIXEDJens Axboe
zcrx currently requires the ring to be set up with fixed 32b CQEs, allow it to use IORING_SETUP_CQE_MIXED as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-08-26devlink: Make health reporter burst period configurableShahar Shitrit
Enable configuration of the burst period — a time window starting from the first error recovery, during which the reporter allows recovery attempts for each reported error. This feature is helpful when a single underlying issue causes multiple errors, as it delays the start of the grace period to allow sufficient time for recovering all related errors. For example, if multiple TX queues time out simultaneously, a sufficient burst period could allow all affected TX queues to be recovered within that window. Without this period, only the first TX queue that reports a timeout will undergo recovery, while the remaining TX queues will be blocked once the grace period begins. Configuration example: $ devlink health set pci/0000:00:09.0 reporter tx burst_period 500 Configuration example with ynl: ./tools/net/ynl/pyynl/cli.py \ --spec Documentation/netlink/specs/devlink.yaml \ --do health-reporter-set --json '{ "bus-name": "auxiliary", "dev-name": "mlx5_core.eth.0", "port-index": 65535, "health-reporter-name": "tx", "health-reporter-burst-period": 500 }' Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250824084354.533182-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.17-rc3). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-21docs: networking: Add PPE driver documentation for Qualcomm IPQ9574 SoCLei Wei
Add description and high-level diagram for PPE, driver overview and module enable/debug information. Signed-off-by: Lei Wei <quic_leiwei@quicinc.com> Signed-off-by: Luo Jie <quic_luoj@quicinc.com> Link: https://patch.msgid.link/20250818-qcom_ipq_ppe-v8-2-1d4ff641fce9@quicinc.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>