summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-07-21and-xgbe: remove the abstraction for hwptpRaju Rangoju
Remove the hwptp abstraction and associated callbacks from the struct xgbe_hw_if {}. The callback structure was only ever assigned a single function, without null checks. This cleanup inlines the logic and moves all the hwtstamp realted code a separate file, improving readability and maintainance. Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Link: https://patch.msgid.link/20250718185628.4038779-2-Raju.Rangoju@amd.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-21selftests: tc: Add generic erspan_opts matching support for tc-flowerLi Shuang
Add test cases to tc_flower.sh to validate generic matching on ERSPAN options. Both ERSPAN Type II and Type III are covered. Also add check_tc_erspan_support() to verify whether tc supports erspan_opts. Signed-off-by: Li Shuang <shuali@redhat.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/1f354a1afd60f29bbbf02bd60cb52ecfc0b6bd17.1752848172.git.shuali@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-21net: usb: Remove duplicate assignments for net->pcpu_stat_typeZqiang
This commit remove duplicate assignments for net->pcpu_stat_type in usbnet_probe(). Signed-off-by: Zqiang <qiang.zhang@linux.dev> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-07-18be2net: Use correct byte order and format string for TCP seq and ack_seqAlok Tiwari
The TCP header fields seq and ack_seq are 32-bit values in network byte order as (__be32). these fields were earlier printed using ntohs(), which converts only 16-bit values and produces incorrect results for 32-bit fields. This patch is changeing the conversion to ntohl(), ensuring correct interpretation of these sequence numbers. Notably, the format specifier is updated from %d to %u to reflect the unsigned nature of these fields. improves the accuracy of debug log messages for TCP sequence and acknowledgment numbers during TX timeouts. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250717193552.3648791-1-alok.a.tiwari@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: bcmasp: Add support for re-starting auto-negotiationFlorian Fainelli
Wire-up ethtool_ops::nway_reset to phy_ethtool_nway_reset in order to support re-starting auto-negotiation. Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250717180915.2611890-1-florian.fainelli@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18Merge branch 'net-maintain-netif-vs-dev-prefix-semantics'Jakub Kicinski
Stanislav Fomichev says: ==================== net: maintain netif vs dev prefix semantics Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. We care only about driver-visible APIs, don't touch stack-internal routines. The rest seem to be ok: * dev_xdp_prog_count - mostly called by sw drivers (bonding), should not matter * dev_get_by_xxx - too many to reasonably cleanup, already have different flavors * dev_fetch_sw_netstats - don't need instance lock * dev_get_tstats64 - never called directly, only as an ndo callback * dev_pick_tx_zero - never called directly, only as an ndo callback * dev_add_pack / dev_remove_pack - called early enough (in module init) to not matter * dev_get_iflink - mostly called by sw drivers, should not matter * dev_fill_forward_path - ditto * dev_getbyhwaddr_rcu - ditto * dev_getbyhwaddr - ditto * dev_getfirstbyhwtype - ditto * dev_valid_name - ditto * __dev_forward_skb dev_forward_skb dev_queue_xmit_nit - established helpers, no netif vs dev distinction ==================== Link: https://patch.msgid.link/20250717172333.1288349-1-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/dev_close_many/netif_close_many/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. netif_close_many is used only by vlan/dsa and one mtk driver, so move it into NETDEV_INTERNAL namespace. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-8-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/dev_set_threaded/netif_set_threaded/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Note that one dev_set_threaded call still remains in mt76 for debugfs file. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-7-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/dev_get_flags/netif_get_flags/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-6-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/__dev_set_mtu/__netif_set_mtu/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. __netif_set_mtu is used only by bond, so move it into NETDEV_INTERNAL namespace. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-5-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/dev_pre_changeaddr_notify/netif_pre_changeaddr_notify/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. netif_pre_changeaddr_notify is used only by ipvlan/bond, so move it into NETDEV_INTERNAL namespace. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-4-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/dev_get_mac_address/netif_get_mac_address/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. netif_get_mac_address is used only by tun/tap, so move it into NETDEV_INTERNAL namespace. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-3-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: s/dev_get_port_parent_id/netif_get_port_parent_id/Stanislav Fomichev
Commit cc34acd577f1 ("docs: net: document new locking reality") introduced netif_ vs dev_ function semantics: the former expects locked netdev, the latter takes care of the locking. We don't strictly follow this semantics on either side, but there are more dev_xxx handlers now that don't fit. Rename them to netif_xxx where appropriate. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250717172333.1288349-2-sdf@fomichev.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18selftests: rtnetlink: Add operational state testIdo Schimmel
Virtual devices (e.g., VXLAN) that do not have a notion of a carrier are created with an "UNKNOWN" operational state which some users find confusing [1]. It is possible to set the operational state from user space either during device creation or afterwards and some applications will start doing that in order to avoid the above problem. Add a test for this functionality to ensure it does not regress. [1] https://lore.kernel.org/netdev/20241119153703.71f97b76@hermes.local/ Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250717125151.466882-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: selftests: add PHY-loopback test for bad TCP checksumsOleksij Rempel
Detect NICs and drivers that either drop frames with a corrupted TCP checksum or, worse, pass them up as valid. The test flips one bit in the checksum, transmits the packet in internal loopback, and fails when the driver reports CHECKSUM_UNNECESSARY. Discussed at: https://lore.kernel.org/all/20250625132117.1b3264e8@kernel.org/ Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250717083524.1645069-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: track pfmemalloc drops via SKB_DROP_REASON_PFMEMALLOCJesper Dangaard Brouer
Add a new SKB drop reason (SKB_DROP_REASON_PFMEMALLOC) to track packets dropped due to memory pressure. In production environments, we've observed memory exhaustion reported by memory layer stack traces, but these drops were not properly tracked in the SKB drop reason infrastructure. While most network code paths now properly report pfmemalloc drops, some protocol-specific socket implementations still use sk_filter() without drop reason tracking: - Bluetooth L2CAP sockets - CAIF sockets - IUCV sockets - Netlink sockets - SCTP sockets - Unix domain sockets These remaining cases represent less common paths and could be converted in a follow-up patch if needed. The current implementation provides significantly improved observability into memory pressure events in the network stack, especially for key protocols like TCP and UDP, helping to diagnose problems in production environments. Reported-by: Matt Fleming <mfleming@cloudflare.com> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/175268316579.2407873.11634752355644843509.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-18net: stream: add description for sk_stream_write_space()Suchit Karunakaran
Add a proper description for the sk_stream_write_space() function as previously marked by a FIXME comment. No functional changes. Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com> Link: https://patch.msgid.link/20250716153404.7385-1-suchitkarunakaran@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17et131x: Add missing check after DMA mapThomas Fourier
The DMA map functions can fail and should be tested for errors. If the mapping fails, unmap and return an error. Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Acked-by: Mark Einon <mark.einon@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250716094733.28734-2-fourier.thomas@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net: ag71xx: Add missing check after DMA mapThomas Fourier
The DMA map functions can fail and should be tested for errors. Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250716095733.37452-3-fourier.thomas@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17selftests/drivers/net: Support ipv6 for napi_id testTianyi Cui
Add support for IPv6 environment for napi_id test. Test Plan: ./run_kselftest.sh -t drivers/net:napi_id.py TAP version 13 1..1 # timeout set to 45 # selftests: drivers/net: napi_id.py # TAP version 13 # 1..1 # ok 1 napi_id.test_napi_id # # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0 ok 1 selftests: drivers/net: napi_id.py Signed-off-by: Tianyi Cui <1997cui@gmail.com> Link: https://patch.msgid.link/20250717011913.1248816-1-1997cui@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17ibmvnic: Use ndo_get_stats64 to fix inaccurate SAR reportingMingming Cao
VNIC testing on multi-core Power systems showed SAR stats drift and packet rate inconsistencies under load. Implements ndo_get_stats64 to provide safe aggregation of queue-level atomic64 counters into rtnl_link_stats64 for use by tools like 'ip -s', 'ifconfig', and 'sar'. Switch to ndo_get_stats64 to align SAR reporting with the standard kernel interface for retrieving netdev stats. This removes redundant per-adapter stat updates, reduces overhead, eliminates cacheline bouncing from hot path updates, and improves the accuracy of reported packet rates. Signed-off-by: Mingming Cao <mmc@linux.ibm.com> Reviewed-by: Brian King <bjking1@linux.ibm.com> Reviewed-by: Dave Marquardt <davemarq@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> ---- Changes since v3: link to v3: https://www.spinics.net/lists/netdev/msg1107999.html -- keep per queue counters as u64 (this patch) and drop off patch 1 in v3 Link: https://patch.msgid.link/20250716152115.61143-1-mmc@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17Merge branch 'net-mlx5-misc-changes-2025-07-16'Jakub Kicinski
Tariq Toukan says: ==================== net/mlx5: misc changes 2025-07-16 This series contains misc enhancements to the mlx5 driver. v1: https://lore.kernel.org/1752471585-18053-1-git-send-email-tariqt@nvidia.com ==================== Link: https://patch.msgid.link/1752675472-201445-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net/mlx5e: Properly access RCU protected qdisc_sleeping variableLeon Romanovsky
qdisc_sleeping variable is declared as "struct Qdisc __rcu" and as such needs proper annotation while accessing it. Without rtnl_dereference(), the following error is generated by sparse: drivers/net/ethernet/mellanox/mlx5/core/en/qos.c:377:40: warning: incorrect type in initializer (different address spaces) drivers/net/ethernet/mellanox/mlx5/core/en/qos.c:377:40: expected struct Qdisc *qdisc drivers/net/ethernet/mellanox/mlx5/core/en/qos.c:377:40: got struct Qdisc [noderef] __rcu *qdisc_sleeping Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/1752675472-201445-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net/mlx5e: fix kdoc warning on eswitch.hMoshe Shemesh
Fix the following kdoc warning: git ls-files *.[ch] | egrep drivers/net/ethernet/mellanox/mlx5/core/ |\ xargs scripts/kernel-doc --none drivers/net/ethernet/mellanox/mlx5/core/eswitch.h:824: warning: cannot understand function prototype: 'struct mlx5_esw_event_info ' Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/1752675472-201445-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net/mlx5: HWS, Enable IPSec hardware offload in legacy modeLama Kayal
IPSec hardware offload in legacy mode should not be affected by the steering mode, hence it should also work properly with hmfs mode. Remove steering mode validation when calculating the cap for packet offload, this will also enable the missing cap MLX5_IPSEC_CAP_PRIO needed for crypto offload. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Jianbo Liu <jianbol@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/1752675472-201445-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net: pcs: xpcs: mask readl() return value to 16 bitsJack Ping CHNG
readl() returns 32-bit value but Clause 22/45 registers are 16-bit wide. Masking with 0xFFFF avoids using garbage upper bits. Signed-off-by: Jack Ping CHNG <jchng@maxlinear.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20250716030349.3796806-1-jchng@maxlinear.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net/mlx5: Fix an IS_ERR() vs NULL bug in esw_qos_move_node()Dan Carpenter
The __esw_qos_alloc_node() function returns NULL on error. It doesn't return error pointers. Update the error checking to match. Fixes: 96619c485fa6 ("net/mlx5: Add support for setting tc-bw on nodes") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/0ce4ec2a-2b5d-4652-9638-e715a99902a7@sabinyo.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net: ethernet: mtk_wed: Fix NULL vs IS_ERR() bug in mtk_wed_get_memory_region()Dan Carpenter
We recently changed this from using devm_ioremap() to using devm_ioremap_resource() and unfortunately the former returns NULL while the latter returns error pointers. The check for errors needs to be updated as well. Fixes: e27dba1951ce ("net: Use of_reserved_mem_region_to_resource{_byname}() for "memory-region"") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/87c10dbd-df86-4971-b4f5-40ba02c076fb@sabinyo.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net: airoha: Fix a NULL vs IS_ERR() bug in airoha_npu_run_firmware()Dan Carpenter
The devm_ioremap_resource() function returns error pointers. It never returns NULL. Update the check to match. Fixes: e27dba1951ce ("net: Use of_reserved_mem_region_to_resource{_byname}() for "memory-region"") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/fc6d194e-6bf5-49ca-bc77-3fdfda62c434@sabinyo.mountain Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17Merge branch 'add-shared-phy-counter-support-for-qca807x-and-qca808x'Jakub Kicinski
Luo Jie says: ==================== Add shared PHY counter support for QCA807x and QCA808x The implementation of the PHY counter is identical for both QCA808x and QCA807x series devices. This includes counters for both good and bad CRC frames in the RX and TX directions, which are active when CRC checking is enabled. This patch series introduces PHY counter functions into a shared library, enabling counter support for the QCA808x and QCA807x families through this common infrastructure. Additionally, enable CRC checking and configure automatic clearing of counters after reading within config_init() to ensure accurate counter recording. v2: https://lore.kernel.org/20250714-qcom_phy_counter-v2-0-94dde9d9769f@quicinc.com v1: https://lore.kernel.org/20250709-qcom_phy_counter-v1-0-93a54a029c46@quicinc.com ==================== Link: https://patch.msgid.link/20250715-qcom_phy_counter-v3-0-8b0e460a527b@quicinc.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net: phy: qcom: qca807x: Support PHY counterLuo Jie
Within the QCA807X PHY operation's config_init() function, enable CRC checking for received and transmitted frames and configure counter to clear after being read to support counter recording. Additionally, add support for PHY counter operations. Signed-off-by: Luo Jie <quic_luoj@quicinc.com> Link: https://patch.msgid.link/20250715-qcom_phy_counter-v3-3-8b0e460a527b@quicinc.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net: phy: qcom: qca808x: Support PHY counterLuo Jie
Enable CRC checking for received and transmitted frames, and configure counters to clear after being read within config_init() for accurate counter recording. Additionally, add PHY counter operations and integrate shared functions. Signed-off-by: Luo Jie <quic_luoj@quicinc.com> Link: https://patch.msgid.link/20250715-qcom_phy_counter-v3-2-8b0e460a527b@quicinc.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17net: phy: qcom: Add PHY counter supportLuo Jie
Add PHY counter functionality to the shared library. The implementation is identical for the current QCA807X and QCA808X PHYs. The PHY counter can be configured to perform CRC checking for both received and transmitted packets. Additionally, the packet counter can be set to automatically clear after it is read. The PHY counter includes 32-bit packet counters for both RX (received) and TX (transmitted) packets, as well as 16-bit counters for recording CRC error packets for both RX and TX. Signed-off-by: Luo Jie <quic_luoj@quicinc.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250715-qcom_phy_counter-v3-1-8b0e460a527b@quicinc.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17netdevsim: remove redundant branchDennis Chen
bool notify is referenced nowhere else in the function except to check whether or not to call rtnl_offload_xstats_notify(). Remove it and move the call to the previous branch. Signed-off-by: Dennis Chen <dechen@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20250716165750.561175-1-dechen@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-07-17 We've added 13 non-merge commits during the last 20 day(s) which contain a total of 4 files changed, 712 insertions(+), 84 deletions(-). The main changes are: 1) Avoid skipping or repeating a sk when using a TCP bpf_iter, from Jordan Rife. 2) Clarify the driver requirement on using the XDP metadata, from Song Yoong Siang * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: doc: xdp: Clarify driver implementation for XDP Rx metadata selftests/bpf: Add tests for bucket resume logic in established sockets selftests/bpf: Create iter_tcp_destroy test program selftests/bpf: Create established sockets in socket iterator tests selftests/bpf: Make ehash buckets configurable in socket iterator tests selftests/bpf: Allow for iteration over multiple states selftests/bpf: Allow for iteration over multiple ports selftests/bpf: Add tests for bucket resume logic in listening sockets bpf: tcp: Avoid socket skips and repeats during iteration bpf: tcp: Use bpf_tcp_iter_batch_item for bpf_tcp_iter_state batch items bpf: tcp: Get rid of st_bucket_done bpf: tcp: Make sure iter->batch always contains a full bucket snapshot bpf: tcp: Make mem flags configurable through bpf_iter_tcp_realloc_batch ==================== Link: https://patch.msgid.link/20250717191731.4142326-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17selftests: net: prevent Python from buffering the outputJakub Kicinski
Make sure Python doesn't buffer the output, otherwise for some tests we may see false positive timeouts in NIPA. NIPA thinks that a machine has hung if the test doesn't print anything for 3min. This is also nice to heave for running the tests manually, especially in vng. Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20250716205712.1787325-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17Merge branch 'neighbour-convert-rtm_getneigh-to-rcu-and-make-pneigh-rtnl-free'Jakub Kicinski
Kuniyuki Iwashima says: ==================== neighbour: Convert RTM_GETNEIGH to RCU and make pneigh RTNL-free. This is kind of v3 of the series below [0] but without NEIGHTBL patches. Patch 1 ~ 4 and 9 come from the series to convert RTM_GETNEIGH to RCU. Other patches clean up pneigh_lookup() and convert the pneigh code to RCU + private mutex so that we can easily remove RTNL from RTM_NEWNEIGH in the later series. [0]: https://lore.kernel.org/netdev/20250418012727.57033-1-kuniyu@amazon.com/ v2: https://lore.kernel.org/20250712203515.4099110-1-kuniyu@google.com v1: https://lore.kernel.org/20250711191007.3591938-1-kuniyu@google.com ==================== Link: https://patch.msgid.link/20250716221221.442239-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Update pneigh_entry in pneigh_create().Kuniyuki Iwashima
neigh_add() updates pneigh_entry() found or created by pneigh_create(). This update is serialised by RTNL, but we will remove it. Let's move the update part to pneigh_create() and make it return errno instead of a pointer of pneigh_entry. Now, the pneigh code is RTNL free. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-16-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Protect tbl->phash_buckets[] with a dedicated mutex.Kuniyuki Iwashima
tbl->phash_buckets[] is only modified in the slow path by pneigh_create() and pneigh_delete() under the table lock. Both of them are called under RTNL, so no extra lock is needed, but we will remove RTNL from the paths. pneigh_create() looks up a pneigh_entry, and this part can be lockless, but it would complicate the logic like 1. lookup 2. allocate pengih_entry for GFP_KERNEL 3. lookup again but under lock 4. if found, return it after freeing the allocated memory 5. else, return the new one Instead, let's add a per-table mutex and run lookup and allocation under it. Note that updating pneigh_entry part in neigh_add() is still protected by RTNL and will be moved to pneigh_create() in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-15-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Drop read_lock_bh(&tbl->lock) in pneigh_lookup().Kuniyuki Iwashima
Now, all callers of pneigh_lookup() are under RCU, and the read lock there is no longer needed. Let's drop the lock, inline __pneigh_lookup_1() to pneigh_lookup(), and call it from pneigh_create(). The next patch will remove tbl->lock from pneigh_create(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-14-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Remove __pneigh_lookup().Kuniyuki Iwashima
__pneigh_lookup() is the lockless version of pneigh_lookup(), but its only caller pndisc_is_router() holds the table lock and reads pneigh_netry.flags. This is because accessing pneigh_entry after pneigh_lookup() was illegal unless the caller holds RTNL or the table lock. Now, pneigh_entry is guaranteed to be alive during the RCU critical section. Let's call pneigh_lookup() and use READ_ONCE() for n->flags in pndisc_is_router() and remove __pneigh_lookup(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-13-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Use rcu_dereference() in pneigh_get_{first,next}().Kuniyuki Iwashima
Now pneigh_entry is guaranteed to be alive during the RCU critical section even without holding tbl->lock. Let's use rcu_dereference() in pneigh_get_{first,next}(). Note that neigh_seq_start() still holds tbl->lock for the normal neighbour entry. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-12-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Drop read_lock_bh(&tbl->lock) in pneigh_dump_table().Kuniyuki Iwashima
Now pneigh_entry is guaranteed to be alive during the RCU critical section even without holding tbl->lock. Let's drop read_lock_bh(&tbl->lock) and use rcu_dereference() to iterate tbl->phash_buckets[] in pneigh_dump_table() Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-11-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Convert RTM_GETNEIGH to RCU.Kuniyuki Iwashima
Only __dev_get_by_index() is the RTNL dependant in neigh_get(). Let's replace it with dev_get_by_index_rcu() and convert RTM_GETNEIGH to RCU. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-10-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Annotate access to struct pneigh_entry.{flags,protocol}.Kuniyuki Iwashima
We will convert pneigh readers to RCU, and its flags and protocol will be read locklessly. Let's annotate the access to the two fields. Note that all access to pn->permanent is under RTNL (neigh_add() and pneigh_ifdown_and_unlock()), so WRITE_ONCE() and READ_ONCE() are not needed. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-9-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Free pneigh_entry after RCU grace period.Kuniyuki Iwashima
We will convert RTM_GETNEIGH to RCU. neigh_get() looks up pneigh_entry by pneigh_lookup() and passes it to pneigh_fill_info(). Then, we must ensure that the entry is alive till pneigh_fill_info() completes, but read_lock_bh(&tbl->lock) in pneigh_lookup() does not guarantee that. Also, we will convert all readers of tbl->phash_buckets[] to RCU. Let's use call_rcu() to free pneigh_entry and update phash_buckets[] and ->next by rcu_assign_pointer(). pneigh_ifdown_and_unlock() uses list_head to avoid overwriting ->next and moving RCU iterators to another list. pndisc_destructor() (only IPv6 ndisc uses this) uses a mutex, so it is not delayed to call_rcu(), where we cannot sleep. This is fine because the mcast code works with RCU and ipv6_dev_mc_dec() frees mcast objects after RCU grace period. While at it, we change the return type of pneigh_ifdown_and_unlock() to void. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Annotate neigh_table.phash_buckets and pneigh_entry.next with __rcu.Kuniyuki Iwashima
The next patch will free pneigh_entry with call_rcu(). Then, we need to annotate neigh_table.phash_buckets[] and pneigh_entry.next with __rcu. To make the next patch cleaner, let's annotate the fields in advance. Currently, all accesses to the fields are under the neigh table lock, so rcu_dereference_protected() is used with 1 for now, but most of them (except in pneigh_delete() and pneigh_ifdown_and_unlock()) will be replaced with rcu_dereference() and rcu_dereference_check(). Note that pneigh_ifdown_and_unlock() changes pneigh_entry.next to a local list, which is illegal because the RCU iterator could be moved to another list. This part will be fixed in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Split pneigh_lookup().Kuniyuki Iwashima
pneigh_lookup() has ASSERT_RTNL() in the middle of the function, which is confusing. When called with the last argument, creat, 0, pneigh_lookup() literally looks up a proxy neighbour entry. This is the case of the reader path as the fast path and RTM_GETNEIGH. pneigh_lookup(), however, creates a pneigh_entry when called with creat 1 from RTM_NEWNEIGH and SIOCSARP, which require RTNL. Let's split pneigh_lookup() into two functions. We will convert all the reader paths to RCU, and read_lock_bh(&tbl->lock) in the new pneigh_lookup() will be dropped. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Move neigh_find_table() to neigh_get().Kuniyuki Iwashima
neigh_valid_get_req() calls neigh_find_table() to fetch neigh_tables[]. neigh_find_table() uses rcu_dereference_rtnl(), but RTNL actually does not protect it at all; neigh_table_clear() can be called without RTNL and only waits for RCU readers by synchronize_rcu(). Fortunately, there is no bug because IPv4 is built-in, IPv6 cannot be unloaded, and DECNET was removed. To fetch neigh_tables[] by rcu_dereference() later, let's move neigh_find_table() from neigh_valid_get_req() to neigh_get(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Allocate skb in neigh_get().Kuniyuki Iwashima
We will remove RTNL for neigh_get() and run it under RCU instead. neigh_get_reply() and pneigh_get_reply() allocate skb with GFP_KERNEL. Let's move the allocation before __dev_get_by_index() in neigh_get(). Now, neigh_get_reply() and pneigh_get_reply() are inlined and rtnl_unicast() is factorised. We will convert pneigh_lookup() to __pneigh_lookup() later. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>