summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-07-02net/mlx5: Add support for setting tc-bw on nodesCarolina Jubran
Introduce support for enabling and disabling Traffic Class (TC) arbitration for existing devlink rate nodes. This patch adds support for a new scheduling node type, `SCHED_NODE_TYPE_TC_ARBITER_TSAR`. Key changes include: - New helper functions for transitioning existing rate nodes to TC arbiter nodes and vice versa. These functions handle the allocation of TC arbiter nodes, copying of child nodes, and restoring vport QoS settings when TC arbitration is disabled. - Implementation of `mlx5_esw_devlink_rate_node_tc_bw_set()` to manage tc-bw configuration on nodes. - Introduced stubs for `esw_qos_tc_arbiter_scheduling_setup()` and `esw_qos_tc_arbiter_scheduling_teardown()`, which will be extended in future patches to provide full support for tc-bw on devlink rate objects. - Validation functions for tc-bw settings, allowing graceful handling of unsupported traffic class bandwidth configurations. - Updated `__esw_qos_alloc_node()` to insert the new node into the parent’s children list only if the parent is not NULL. For the root TSAR, the new node is inserted directly after the allocation call. - Don't allow `tc-bw` configuration for nodes containing non-leaf children. This patch lays the groundwork for future support for configuring tc-bw on devlink rate nodes. Although the infrastructure is in place, full support for tc-bw is not yet implemented; attempts to set tc-bw on nodes will return `-EOPNOTSUPP`. No functional changes are introduced at this stage. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-6-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net/mlx5: Add no-op implementation for setting tc-bw on rate objectsCarolina Jubran
Introduce `mlx5_esw_devlink_rate_node_tc_bw_set()` and `mlx5_esw_devlink_rate_leaf_tc_bw_set()` with no-op logic. Future patches will add support for setting traffic class bandwidth on rate objects. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02selftest: netdevsim: Add devlink rate tc-bw testCarolina Jubran
Test verifies that netdevsim correctly implements devlink ops callbacks that set tc-bw on leaf or node rate object. Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02devlink: Extend devlink rate API with traffic classes bandwidth managementCarolina Jubran
Introduce support for specifying relative bandwidth shares between traffic classes (TC) in the devlink-rate API. This new option allows users to allocate bandwidth across multiple traffic classes in a single command. This feature provides a more granular control over traffic management, especially for scenarios requiring Enhanced Transmission Selection. Users can now define a relative bandwidth share for each traffic class. For example, assigning share values of 20 to TC0 (TCP/UDP) and 80 to TC5 (RoCE) will result in TC0 receiving 20% and TC5 receiving 80% of the total bandwidth. The actual percentage each class receives depends on the ratio of its share value to the sum of all shares. Example: DEV=pci/0000:08:00.0 $ devlink port function rate add $DEV/vfs_group tx_share 10Gbit \ tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0 $ devlink port function rate set $DEV/vfs_group \ tc-bw 0:20 1:0 2:0 3:0 4:0 5:20 6:60 7:0 Example usage with ynl: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-set --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1, "rate-tc-bws": [ {"rate-tc-index": 0, "rate-tc-bw": 50}, {"rate-tc-index": 1, "rate-tc-bw": 50}, {"rate-tc-index": 2, "rate-tc-bw": 0}, {"rate-tc-index": 3, "rate-tc-bw": 0}, {"rate-tc-index": 4, "rate-tc-bw": 0}, {"rate-tc-index": 5, "rate-tc-bw": 0}, {"rate-tc-index": 6, "rate-tc-bw": 0}, {"rate-tc-index": 7, "rate-tc-bw": 0} ] }' ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-get --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1 }' output for rate-get: {'bus-name': 'pci', 'dev-name': '0000:08:00.0', 'port-index': 1, 'rate-tc-bws': [{'rate-tc-bw': 50, 'rate-tc-index': 0}, {'rate-tc-bw': 50, 'rate-tc-index': 1}, {'rate-tc-bw': 0, 'rate-tc-index': 2}, {'rate-tc-bw': 0, 'rate-tc-index': 3}, {'rate-tc-bw': 0, 'rate-tc-index': 4}, {'rate-tc-bw': 0, 'rate-tc-index': 5}, {'rate-tc-bw': 0, 'rate-tc-index': 6}, {'rate-tc-bw': 0, 'rate-tc-index': 7}], 'rate-tx-max': 0, 'rate-tx-priority': 0, 'rate-tx-share': 0, 'rate-tx-weight': 0, 'rate-type': 'leaf'} Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02netlink: introduce type-checking attribute iteration for nlmsgCarolina Jubran
Add the nlmsg_for_each_attr_type() macro to simplify iteration over attributes of a specific type in a Netlink message. Convert existing users in vxlan and nfsd to use the new macro. Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-2-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02vhost-net: reduce one userspace copy when building XDP buffJason Wang
We used to do twice copy_from_iter() to copy virtio-net and packet separately. This introduce overheads for userspace access hardening as well as SMAP (for x86 it's stac/clac). So this patch tries to use one copy_from_iter() to copy them once and move the virtio-net header afterwards to reduce overheads. Testpmd + vhost_net shows 10% improvement from 5.45Mpps to 6.0Mpps. Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250701010352.74515-2-jasowang@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02tun: remove unnecessary tun_xdp_hdr structureJason Wang
With f95f0f95cfb7("net, xdp: Introduce xdp_init_buff utility routine"), buffer length could be stored as frame size so there's no need to have a dedicated tun_xdp_hdr structure. We can simply store virtio net header instead. Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250701010352.74515-1-jasowang@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02Merge branch 'preserve-msg_zerocopy-with-forwarding'Jakub Kicinski
Willem de Bruijn says: ==================== preserve MSG_ZEROCOPY with forwarding Avoid false positive copying of zerocopy skb frags when entering the ingress path if the skb is not queued locally but forwarded. Patch 1 for more details and feature. Patch 2 converts the existing selftest to a pass/fail test and adds coverage for this new feature. ==================== Link: https://patch.msgid.link/20250630194312.1571410-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02selftest: net: extend msg_zerocopy test with forwardingWillem de Bruijn
Zerocopy skbs are converted to regular copy skbs when data is queued to a local socket. This happens in the existing test with a sender and receiver communicating over a veth device. Zerocopy skbs are sent without copying if egressing a device. Verify that this behavior is maintained even in the common container setup where data is forwarded over a veth to the physical device. Update msg_zerocopy.sh to 1. Have a dummy network device to simulate a physical device. 2. Have forwarding enabled between veth and dummy. 3. Add a tx-only test that sends out dummy via the forwarding path. 4. Verify the exitcode of the sender, which signals zerocopy success. As dummy drops all packets, this cannot be a TCP connection. Test the new case with unconnected UDP only. Update msg_zerocopy.c to - Accept an argument whether send with zerocopy is expected. - Return an exitcode whether behavior matched that expectation. Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630194312.1571410-3-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: preserve MSG_ZEROCOPY with forwardingWillem de Bruijn
MSG_ZEROCOPY data must be copied before data is queued to local sockets, to avoid indefinite timeout until memory release. This test is performed by skb_orphan_frags_rx, which is called when looping an egress skb to packet sockets, error queue or ingress path. To preserve zerocopy for skbs that are looped to ingress but are then forwarded to an egress device rather than delivered locally, defer this last check until an skb enters the local IP receive path. This is analogous to existing behavior of skb_clear_delivery_time. Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630194312.1571410-2-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02Merge branch 'vsock-test-check-for-null-ptr-deref-when-transport-changes'Jakub Kicinski
Luigi Leonardi says: ==================== vsock/test: check for null-ptr-deref when transport changes This series introduces a new test that checks for a null pointer dereference that may happen when there is a transport change[1]. This bug was fixed in [2]. Note that this test *cannot* fail, it hangs if it triggers a kernel oops. The intended use-case is to run it and then check if there is any oops in the dmesg. This test is based on Hyunwoo Kim's[3] and Michal's python reproducers[4]. [1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/ [2]https://lore.kernel.org/netdev/20250110083511.30419-1-sgarzare@redhat.com/ [3]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/#t [4]https://lore.kernel.org/netdev/2b3062e3-bdaa-4c94-a3c0-2930595b9670@rbox.co/ v4: https://lore.kernel.org/20250624-test_vsock-v4-1-087c9c8e25a2@redhat.com v3: https://lore.kernel.org/20250611-test_vsock-v3-1-8414a2d4df62@redhat.com v2: https://lore.kernel.org/20250314-test_vsock-v2-1-3c0a1d878a6d@redhat.com v1: https://lore.kernel.org/20250306-test_vsock-v1-0-0320b5accf92@redhat.com ==================== Link: https://patch.msgid.link/20250630-test_vsock-v5-0-2492e141e80b@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02vsock/test: Add test for null ptr deref when transport changesLuigi Leonardi
Add a new test to ensure that when the transport changes a null pointer dereference does not occur. The bug was reported upstream [1] and fixed with commit 2cb7c756f605 ("vsock/virtio: discard packets if the transport changes"). KASAN: null-ptr-deref in range [0x0000000000000060-0x0000000000000067] CPU: 2 UID: 0 PID: 463 Comm: kworker/2:3 Not tainted Workqueue: vsock-loopback vsock_loopback_work RIP: 0010:vsock_stream_has_data+0x44/0x70 Call Trace: virtio_transport_do_close+0x68/0x1a0 virtio_transport_recv_pkt+0x1045/0x2ae4 vsock_loopback_work+0x27d/0x3f0 process_one_work+0x846/0x1420 worker_thread+0x5b3/0xf80 kthread+0x35a/0x700 ret_from_fork+0x2d/0x70 ret_from_fork_asm+0x1a/0x30 Note that this test may not fail in a kernel without the fix, but it may hang on the client side if it triggers a kernel oops. This works by creating a socket, trying to connect to a server, and then executing a second connect operation on the same socket but to a different CID (0). This triggers a transport change. If the connect operation is interrupted by a signal, this could cause a null-ptr-deref. Since this bug is non-deterministic, we need to try several times. It is reasonable to assume that the bug will show up within the timeout period. If there is a G2H transport loaded in the system, the bug is not triggered and this test will always pass. This is because `vsock_assign_transport`, when using CID 0, like in this case, sets vsk->transport to `transport_g2h` that is not NULL if a G2H transport is available. [1]https://lore.kernel.org/netdev/Z2LvdTTQR7dBmPb5@v4bel-B760M-AORUS-ELITE-AX/ Suggested-by: Hyunwoo Kim <v4bel@theori.io> Suggested-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250630-test_vsock-v5-2-2492e141e80b@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02vsock/test: Add macros to identify transportsLuigi Leonardi
Add three new macros: TRANSPORTS_G2H, TRANSPORTS_H2G and TRANSPORTS_LOCAL. They can be used to identify the type of the transport(s) loaded when using the `get_transports()` function. Suggested-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Luigi Leonardi <leonardi@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250630-test_vsock-v5-1-2492e141e80b@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02dt-bindings: net: Convert socfpga-dwmac bindings to yamlMatthew Gerlach
Convert the bindings for socfpga-dwmac to yaml. Since the original text contained descriptions for two separate nodes, two separate yaml files were created. Signed-off-by: Mun Yew Tham <mun.yew.tham@altera.com> Signed-off-by: Matthew Gerlach <matthew.gerlach@altera.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Rob Herring (Arm) <robh@kernel.org> Link: https://patch.msgid.link/20250630213748.71919-1-matthew.gerlach@altera.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02amd-xgbe: add support for giant packet sizeRaju Rangoju
AMD XGBE hardware supports giant Ethernet frames up to 16K bytes. Add support for configuring and enabling giant packet handling in the driver. - Define new register fields and macros for giant packet support. - Update the jumbo frame configuration logic to enable giant packet mode when MTU exceeds the jumbo threshold. Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250701121929.319690-1-Raju.Rangoju@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: ifb: support BIG TCP packetsEric Dumazet
Set the driver limit to GSO_MAX_SIZE (512 KB). This allows the admin/user to set a GSO limit up to this value, to avoid segmenting too large GRO packets in the netem -> ifb path. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250701084540.459261-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02Merge branch 'net-add-data-race-annotations-around-dst-fields'Jakub Kicinski
Eric Dumazet says: ==================== net: add data-race annotations around dst fields Add annotations around various dst fields, which can change under us. Add four helpers to prepare better dst->dev protection, and start using them. More to come later. v1: https://lore.kernel.org/20250627112526.3615031-1-edumazet@google.com ==================== Link: https://patch.msgid.link/20250630121934.3399505-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02ipv6: ip6_mc_input() and ip6_mr_input() cleanupsEric Dumazet
Both functions are always called under RCU. We remove the extra rcu_read_lock()/rcu_read_unlock(). We use skb_dst_dev_net_rcu() instead of skb_dst_dev_net(). We use dev_net_rcu() instead of dev_net(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-11-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02ipv6: adopt skb_dst_dev() and skb_dst_dev_net[_rcu]() helpersEric Dumazet
Use the new helpers as a step to deal with potential dst->dev races. v2: fix typo in ipv6_rthdr_rcv() (kernel test robot <lkp@intel.com>) Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-10-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02ipv6: adopt dst_dev() helperEric Dumazet
Use the new helper as a step to deal with potential dst->dev races. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02ipv4: adopt dst_dev, skb_dst_dev and skb_dst_dev_net[_rcu]Eric Dumazet
Use the new helpers as a first step to deal with potential dst->dev races. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: add four helpers to annotate data-races around dst->devEric Dumazet
dst->dev is read locklessly in many contexts, and written in dst_dev_put(). Fixing all the races is going to need many changes. We probably will have to add full RCU protection. Add three helpers to ease this painful process. static inline struct net_device *dst_dev(const struct dst_entry *dst) { return READ_ONCE(dst->dev); } static inline struct net_device *skb_dst_dev(const struct sk_buff *skb) { return dst_dev(skb_dst(skb)); } static inline struct net *skb_dst_dev_net(const struct sk_buff *skb) { return dev_net(skb_dst_dev(skb)); } static inline struct net *skb_dst_dev_net_rcu(const struct sk_buff *skb) { return dev_net_rcu(skb_dst_dev(skb)); } Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->outputEric Dumazet
dst_dev_put() can overwrite dst->output while other cpus might read this field (for instance from dst_output()) Add READ_ONCE()/WRITE_ONCE() annotations to suppress potential issues. We will likely need RCU protection in the future. Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->inputEric Dumazet
dst_dev_put() can overwrite dst->input while other cpus might read this field (for instance from dst_input()) Add READ_ONCE()/WRITE_ONCE() annotations to suppress potential issues. We will likely need full RCU protection later. Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->lastuseEric Dumazet
(dst_entry)->lastuse is read and written locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->expiresEric Dumazet
(dst_entry)->expires is read and written locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->obsoleteEric Dumazet
(dst_entry)->obsolete is read locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02Merge branch 'net-introduce-net_aligned_data'Jakub Kicinski
Eric Dumazet says: ==================== net: introduce net_aligned_data ____cacheline_aligned_in_smp on small fields like tcp_memory_allocated and udp_memory_allocated is not good enough. It makes sure to put these fields at the start of a cache line, but does not prevent the linker from using the cache line for other fields, with potential performance impact. nm -v vmlinux|egrep -5 "tcp_memory_allocated|udp_memory_allocated" ... ffffffff83e35cc0 B tcp_memory_allocated ffffffff83e35cc8 b __key.0 ffffffff83e35cc8 b __tcp_tx_delay_enabled.1 ffffffff83e35ce0 b tcp_orphan_timer ... ffffffff849dddc0 B udp_memory_allocated ffffffff849dddc8 B udp_encap_needed_key ffffffff849dddd8 B udpv6_encap_needed_key ffffffff849dddf0 b inetsw_lock One solution is to move these sensitive fields to a structure, so that the compiler is forced to add empty holes between each member. nm -v vmlinux|egrep -2 "tcp_memory_allocated|udp_memory_allocated|net_aligned_data" ffffffff885af970 b mem_id_init ffffffff885af980 b __key.0 ffffffff885af9c0 B net_aligned_data ffffffff885afa80 B page_pool_mem_providers ffffffff885afa90 b __key.2 v1: https://lore.kernel.org/netdev/20250627200551.348096-1-edumazet@google.com/ ==================== Link: https://patch.msgid.link/20250630093540.3052835-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02udp: move udp_memory_allocated into net_aligned_dataEric Dumazet
____cacheline_aligned_in_smp attribute only makes sure to align a field to a cache line. It does not prevent the linker to use the remaining of the cache line for other variables, causing potential false sharing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02tcp: move tcp_memory_allocated into net_aligned_dataEric Dumazet
____cacheline_aligned_in_smp attribute only makes sure to align a field to a cache line. It does not prevent the linker to use the remaining of the cache line for other variables, causing potential false sharing. Move tcp_memory_allocated into a dedicated cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: move net_cookie into net_aligned_dataEric Dumazet
Using per-cpu data for net->net_cookie generation is overkill, because even busy hosts do not create hundreds of netns per second. Make sure to put net_cookie in a private cache line to avoid potential false sharing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: add struct net_aligned_dataEric Dumazet
This structure will hold networking data that must consume a full cache line to avoid accidental false sharing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: thunderbolt: Enable end-to-end flow control also in transmitzhangjianrong
According to USB4 specification, if E2E flow control is disabled for the Transmit Descriptor Ring, the Host Interface Adapter Layer shall not require any credits to be available before transmitting a Tunneled Packet from this Transmit Descriptor Ring, so e2e flow control should be enabled in both directions. Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> Link: https://lore.kernel.org/20250624153805.GC2824380@black.fi.intel.com Signed-off-by: zhangjianrong <zhangjianrong5@huawei.com> Link: https://patch.msgid.link/20250628093813.647005-1-zhangjianrong5@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: thunderbolt: Fix the parameter passing of ↵zhangjianrong
tb_xdomain_enable_paths()/tb_xdomain_disable_paths() According to the description of tb_xdomain_enable_paths(), the third parameter represents the transmit ring and the fifth parameter represents the receive ring. tb_xdomain_disable_paths() is the same case. [Jakub] Mika says: it works now because both rings ->hop is the same Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> Link: https://lore.kernel.org/20250625051149.GD2824380@black.fi.intel.com Signed-off-by: zhangjianrong <zhangjianrong5@huawei.com> Link: https://patch.msgid.link/20250628094920.656658-1-zhangjianrong5@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01net: tulip: Rename PCI driver struct to end in _driverUwe Kleine-König
This is not only a cosmetic change because the section mismatch checks also depend on the name and for drivers the checks are stricter than for ops. However xircom_driver also passes the stricter checks just fine, so no further changes needed. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Link: https://patch.msgid.link/20250627102220.1937649-2-u.kleine-koenig@baylibre.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01net: atlantic: Rename PCI driver struct to end in _driverUwe Kleine-König
This is not only a cosmetic change because the section mismatch checks (implemented in scripts/mod/modpost.c) also depend on the object's name and for drivers the checks are stricter than for ops. However aq_pci_driver also passes the stricter checks just fine, so no further changes needed. The cheating^Wmisleading name was introduced in commit 97bde5c4f909 ("net: ethernet: aquantia: Support for NIC-specific code") Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250630164406.57589-2-u.kleine-koenig@baylibre.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01net: phy: air_en8811h: Introduce resume/suspend and clk_restore_context to ↵Lucien.Jheng
ensure correct CKO settings after network interface reinitialization. If the user reinitializes the network interface, the PHY will reinitialize, and the CKO settings will revert to their initial configuration(be enabled). To prevent CKO from being re-enabled, en8811h_clk_restore_context and en8811h_resume were added to ensure the CKO settings remain correct. Signed-off-by: Lucien.Jheng <lucienzx159@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250630154147.80388-1-lucienzx159@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01net: dsa: hellcreek: Constify struct devlink_region_ops and struct ↵Christophe JAILLET
hellcreek_fdb_entry 'struct devlink_region_ops' and 'struct hellcreek_fdb_entry' are not modified in this driver. Constifying these structures moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 55320 19216 320 74856 12468 drivers/net/dsa/hirschmann/hellcreek.o After: ===== text data bss dec hex filename 55960 18576 320 74856 12468 drivers/net/dsa/hirschmann/hellcreek.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Link: https://patch.msgid.link/2f7e8dc30db18bade94999ac7ce79f333342e979.1751231174.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01Merge branch 'seg6-fix-typos-in-comments-within-the-srv6-subsystem'Jakub Kicinski
Andrea Mayer says: ==================== seg6: fix typos in comments within the SRv6 subsystem In this patchset, we correct some typos found both in the SRv6 Endpoints implementation (i.e., seg6local) and in some SRv6 selftests, using codespell. ==================== Link: https://patch.msgid.link/20250629171226.4988-1-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01selftests: seg6: fix instaces typo in commentsAndrea Mayer
Fix a typo: instaces -> instances The typo has been identified using codespell, and the tool does not report any additional issues in the selftests considered. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250629171226.4988-3-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01seg6: fix lenghts typo in a commentAndrea Mayer
Fix a typo: lenghts -> length The typo has been identified using codespell, and the tool currently does not report any additional issues in comments. Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250629171226.4988-2-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01net: dsa: mv88e6xxx: Use kcalloc()Christophe JAILLET
Use kcalloc() instead of hand writing it. This is less verbose. Also move the initialization of 'count' to save some LoC. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 18652 5920 64 24636 603c drivers/net/dsa/mv88e6xxx/devlink.o After: ===== text data bss dec hex filename 18498 5920 64 24482 5fa2 drivers/net/dsa/mv88e6xxx/devlink.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/2f4fca4ff84950da71e007c9169f18a0272476f3.1751200453.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01net: dsa: mv88e6xxx: Constify struct devlink_region_ops and struct ↵Christophe JAILLET
mv88e6xxx_region 'struct devlink_region_ops' and 'struct mv88e6xxx_region' are not modified in this driver. Constifying these structures moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 18076 6496 64 24636 603c drivers/net/dsa/mv88e6xxx/devlink.o After: ===== text data bss dec hex filename 18652 5920 64 24636 603c drivers/net/dsa/mv88e6xxx/devlink.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/46040062161dda211580002f950a6d60433243dc.1751200453.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01net: atlantic: add set_power to fw_ops for atl2 to fix wolEric Work
Aquantia AQC113(C) using ATL2FW doesn't properly prepare the NIC for enabling wake-on-lan. The FW operation `set_power` was only implemented for `hw_atl` and not `hw_atl2`. Implement the `set_power` functionality for `hw_atl2`. Tested with both AQC113 and AQC113C devices. Confirmed you can shutdown the system and wake from S5 using magic packets. NIC was previously powered off when entering S5. If the NIC was configured for WOL by the Windows driver, loading the atlantic driver would disable WOL. Partially cherry-picks changes from commit, https://github.com/Aquantia/AQtion/commit/37bd5cc Attributing original authors from Marvell for the referenced commit. Closes: https://github.com/Aquantia/AQtion/issues/70 Co-developed-by: Igor Russkikh <irusskikh@marvell.com> Co-developed-by: Mark Starovoitov <mstarovoitov@marvell.com> Co-developed-by: Dmitry Bogdanov <dbogdanov@marvell.com> Co-developed-by: Pavel Belous <pbelous@marvell.com> Co-developed-by: Nikita Danilov <ndanilov@marvell.com> Signed-off-by: Eric Work <work.eric@gmail.com> Reviewed-by: Igor Russkikh <irusskikh@marvell.com> Link: https://patch.msgid.link/20250629051535.5172-1-work.eric@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01net: mana: Handle Reset Request from MANA NICHaiyang Zhang
Upon receiving the Reset Request, pause the connection and clean up queues, wait for the specified period, then resume the NIC. In the cleanup phase, the HWC is no longer responding, so set hwc_timeout to zero to skip waiting on the response. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Link: https://patch.msgid.link/1751055983-29760-1-git-send-email-haiyangz@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01phy: micrel: add Signal Quality Indicator (SQI) support for KSZ9477 switch PHYsOleksij Rempel
Add support for the Signal Quality Indicator (SQI) feature on KSZ9477 family switches, providing a relative measure of receive signal quality. The hardware exposes separate SQI readings per channel. For 1000BASE-T, all four channels are read. For 100BASE-TX, only one channel is reported, but which receive pair is active depends on Auto MDI-X negotiation, which is not exposed by the hardware. Therefore, it is not possible to reliably map the measured channel to a specific wire pair. This resolves an earlier discussion about how to handle multi-channel SQI. Originally, the plan was to expose all channels individually. However, since pair mapping is sometimes unavailable, this implementation treats SQI as a per-link metric instead. This fallback avoids ambiguity and ensures consistent behavior. The existing get_sqi() UAPI was designed for single-pair Ethernet (SPE), where per-pair and per-link are effectively equivalent. Restricting its use to per-link metrics does not introduce regressions for existing users. The raw 7-bit SQI value (0–127, lower is better) is converted to the standard 0–7 (high is better) scale. Empirical testing showed that the link becomes unstable around a raw value of 8. The SQI raw value remains zero if no data is received, even if noise is present. This confirms that the measurement reflects the "quality" during active data reception rather than the passive line state. User space must ensure that traffic is present on the link to obtain valid SQI readings. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250627112539.895255-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01selftests: pp-bench: remove page_pool_put_page wrapperMina Almasry
Minor cleanup: remove the pointless looking _ wrapper around page_pool_put_page, and just do the call directly. Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/20250627200501.1712389-2-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01selftests: pp-bench: remove unneeded linux/version.hMina Almasry
linux/version.h was used by the out-of-tree version, but not needed in the upstream one anymore. While I'm at it, sort the includes. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506271434.Gk0epC9H-lkp@intel.com/ Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/20250627200501.1712389-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01ip6_tunnel: enable to change proto of fb tunnelsNicolas Dichtel
This is possible via the ioctl API: > ip -6 tunnel change ip6tnl0 mode any Let's align the netlink API: > ip link set ip6tnl0 type ip6tnl mode any Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20250630145602.1027220-1-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01selftests/tc-testing: Enable CONFIG_IP_SETSebastian Andrzej Siewior
The config snippet specifies CONFIG_NET_EMATCH_IPSET. This option depends on CONFIG_IP_SET. Set CONFIG_IP_SET to be enabled at part for tc-testing. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20250630153341.Wgh3SzGi@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>