linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-05-23	netfilter: nf_tables: Introduce nft_register_flowtable_ops()	Phil Sutter
	Facilitate binding and registering of a flowtable hook via a single function call. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23	netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}()	Phil Sutter
	Also a pretty dull wrapper around the hook->ops.dev comparison for now. Will search the embedded nf_hook_ops list in future. The ugly cast to eliminate the const qualifier will vanish then, too. Since this future list will be RCU-protected, also introduce an _rcu() variant here. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23	netfilter: nf_tables: Introduce functions freeing nft_hook objects	Phil Sutter
	Pointless wrappers around kfree() for now, prep work for an embedded list of nf_hook_ops. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23	netfilter: nf_tables: add packets conntrack state to debug trace info	Florian Westphal
	Add the minimal relevant info needed for userspace ("nftables monitor trace") to provide the conntrack view of the packet: - state (new, related, established) - direction (original, reply) - status (e.g., if connection is subject to dnat) - id (allows to query ctnetlink for remaining conntrack state info) Example: trace id a62 inet filter PRE_RAW packet: iif "enp0s3" ether [..] [..] trace id a62 inet filter PRE_MANGLE conntrack: ct direction original ct state new ct id 32 trace id a62 inet filter PRE_MANGLE packet: [..] [..] trace id a62 inet filter IN conntrack: ct direction original ct state new ct status dnat-done ct id 32 [..] In this case one can see that while NAT is active, the new connection isn't subject to a translation. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23	netfilter: conntrack: make nf_conntrack_id callable without a module dependency	Florian Westphal
	While nf_conntrack_id() doesn't need any functionaliy from conntrack, it does reside in nf_conntrack_core.c -- callers add a module dependency on conntrack. Followup patch will need to compute the conntrack id from nf_tables_trace.c to include it in nf_trace messages emitted to userspace via netlink. I don't want to introduce a module dependency between nf_tables and conntrack for this. Since trace is slowpath, the added indirection is ok. One alternative is to move nf_conntrack_id to the netfilter/core.c, but I don't see a compelling reason so far. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23	netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit	Sebastian Andrzej Siewior
	nf_dup_skb_recursion is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Move nf_dup_skb_recursion to struct netdev_xmit, provide wrappers. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23	netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx	Sebastian Andrzej Siewior
	nft_pcpu_tun_ctx is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Make a struct with a nft_inner_tun_ctx member (original nft_pcpu_tun_ctx) and a local_lock_t and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23	netfilter: nf_dup{4, 6}: Move duplication check to task_struct	Sebastian Andrzej Siewior
	nf_skb_duplicated is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Due to the recursion involved, the simplest change is to make it a per-task variable. Move the per-CPU variable nf_skb_duplicated to task_struct and name it in_nf_duplicate. Add it to the existing bitfield so it doesn't use additional memory. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23	netfilter: nft_tunnel: fix geneve_opt dump	Fernando Fernandez Mancera
	When dumping a nft_tunnel with more than one geneve_opt configured the netlink attribute hierarchy should be as follow: NFTA_TUNNEL_KEY_OPTS \| \|--NFTA_TUNNEL_KEY_OPTS_GENEVE \| \| \| \|--NFTA_TUNNEL_KEY_GENEVE_CLASS \| \|--NFTA_TUNNEL_KEY_GENEVE_TYPE \| \|--NFTA_TUNNEL_KEY_GENEVE_DATA \| \|--NFTA_TUNNEL_KEY_OPTS_GENEVE \| \| \| \|--NFTA_TUNNEL_KEY_GENEVE_CLASS \| \|--NFTA_TUNNEL_KEY_GENEVE_TYPE \| \|--NFTA_TUNNEL_KEY_GENEVE_DATA \| \|--NFTA_TUNNEL_KEY_OPTS_GENEVE ... Otherwise, userspace tools won't be able to fetch the geneve options configured correctly. Fixes: 925d844696d9 ("netfilter: nft_tunnel: add support for geneve opts") Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23	selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs	Florian Westphal
	Replace the existing VRF test with a more comprehensive one. It tests following combinations: - fib type (returns address type, e.g. unicast) - fib oif (route output interface index - both with and without 'iif' keyword (changes result, e.g. 'fib daddr type local' will be true when the destination address is configured on the local machine, but 'fib daddr . iif type local' will only be true when the destination address is configured on the incoming interface. Add all types of addresses to test with for both ipv4 and ipv6: - local address on the incoming interface - local address on another interface - local address on another interface thats part of a vrf - address on another host The ruleset stores obtained results from 'fib' in nftables sets and then queries the sets to check that it has the expected results. Perform one pass while packets are coming in on interface NOT part of a VRF and then again when it was added and make sure fib returns the expected routes and address types for the various addresses in the setup. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23	netfilter: nf_tables: nft_fib: consistent l3mdev handling	Florian Westphal
	fib has two modes: 1. Obtain output device according to source or destination address 2. Obtain the type of the address, e.g. local, unicast, multicast. 'fib daddr type' should return 'local' if the address is configured in this netns or unicast otherwise. 'fib daddr . iif type' should return 'local' if the address is configured on the input interface or unicast otherwise, i.e. more restrictive. However, if the interface is part of a VRF, then 'fib daddr type' returns unicast even if the address is configured on the incoming interface. This is broken for both ipv4 and ipv6. In the ipv4 case, inet_dev_addr_type must only be used if the 'iif' or 'oif' (strict mode) was requested. Else inet_addr_type_dev_table() needs to be used and the correct dev argument must be passed as well so the correct fib (vrf) table is used. In the ipv6 case, the bug is similar, without strict mode, dev is NULL so .flowi6_l3mdev will be set to 0. Add a new 'nft_fib_l3mdev_master_ifindex_rcu()' helper and use that to init the .l3mdev structure member. For ipv6, use it from nft_fib6_flowi_init() which gets called from both the 'type' and the 'route' mode eval functions. This provides consistent behaviour for all modes for both ipv4 and ipv6: If strict matching is requested, the input respectively output device of the netfilter hooks is used. Otherwise, use skb->dev to obtain the l3mdev ifindex. Without this, most type checks in updated nft_fib.sh selftest fail: FAIL: did not find veth0 . 10.9.9.1 . local in fibtype4 FAIL: did not find veth0 . dead:1::1 . local in fibtype6 FAIL: did not find veth0 . dead:9::1 . local in fibtype6 FAIL: did not find tvrf . 10.0.1.1 . local in fibtype4 FAIL: did not find tvrf . 10.9.9.1 . local in fibtype4 FAIL: did not find tvrf . dead:1::1 . local in fibtype6 FAIL: did not find tvrf . dead:9::1 . local in fibtype6 FAIL: fib expression address types match (iif in vrf) (fib errounously returns 'unicast' for all of them, even though all of these addresses are local to the vrf). Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22	netfilter: nf_tables: nft_fib_ipv6: fix VRF ipv4/ipv6 result discrepancy	Florian Westphal
	With a VRF, ipv4 and ipv6 FIB expression behave differently. fib daddr . iif oif Will return the input interface name for ipv4, but the real device for ipv6. Example: If VRF device name is tvrf and real (incoming) device is veth0. First round is ok, both ipv4 and ipv6 will yield 'veth0'. But in the second round (incoming device will be set to "tvrf"), ipv4 will yield "tvrf" whereas ipv6 returns "veth0" for the second round too. This makes ipv6 behave like ipv4. A followup patch will add a test case for this, without this change it will fail with: get element inet t fibif6iif { tvrf . dead:1::99 . tvrf } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ FAIL: did not find tvrf . dead:1::99 . tvrf in fibif6iif Alternatively we could either not do anything at all or change ipv4 to also return the lower/real device, however, nft (userspace) doc says "iif: if fib lookup provides a route then check its output interface is identical to the packets input interface." which is what the nft fib ipv4 behaviour is. Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22	selftests: netfilter: move fib vrf test to nft_fib.sh	Florian Westphal
	It was located in conntrack_vrf.sh because that already had the VRF bits. Lets not add to this and move it to nft_fib.sh where this belongs. No functional changes for the subtest intended. The subtest is limited, it only covered 'fib oif' (route output interface query) when the incoming interface is part of a VRF. Next we can extend it to cover 'fib type' for VRFs and also check fib results when there is an unrelated VRF in same netns. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22	selftests: netfilter: nft_fib.sh: add 'type' mode tests	Florian Westphal
	fib can either lookup the interface id/name of the output interface that would be used for the given address, or it can check for the type of the address according to the fib, e.g. local, unicast, multicast and so on. This can be used to e.g. make a locally configured address only reachable through its interface. Example: given eth0:10.1.1.1 and eth1:10.1.2.1 then 'fib daddr type' for 10.1.1.1 arriving on eth1 will be 'local', but 'fib daddr . iif type' is expected to return 'unicast', whereas 'fib daddr' and 'fib daddr . iif' are expected to indicate 'local' if such a packet arrives on eth0. So far nft_fib.sh only covered oif/oifname, not type. Repeat tests both with default and a policy (ip rule) based setup. Also try to run all remaining tests even if a subtest has failed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22	netfilter: xtables: support arpt_mark and ipv6 optstrip for iptables-nft ↵	Florian Westphal
	only builds Its now possible to build a kernel that has no support for the classic xtables get/setsockopt interfaces and builtin tables. In this case, we have CONFIG_IP6_NF_MANGLE=n and CONFIG_IP_NF_ARPTABLES=n. For optstript, the ipv6 code is so small that we can enable it if netfilter ipv6 support exists. For mark, check if either classic arptables or NFT_ARP_COMPAT is set. Fixes: a9525c7f6219 ("netfilter: xtables: allow xtables-nft only builds") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22	selftests: netfilter: nft_concat_range.sh: add coverage for 4bit group ↵	Florian Westphal
	representation Pipapo supports a more compact '4 bit group' format that is chosen when the memory needed for the default exceeds a threshold (2mb). Add coverage for those code paths, the existing tests use small sets that are handled by the default representation. This comes with a test script run-time increase, but I think its ok: normal: 2m35s -> 3m9s debug: 3m24s -> 5m29s (with KSFT_MACHINE_SLOW=yes). Cc: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-19	Merge branch 'queue_api-reduce-risk-of-name-collision-over-txq'	Jakub Kicinski
	Gur Stavi says: ==================== queue_api: reduce risk of name collision over txq Rename local variable in macros from txq to _txq. When macro parameter get_desc is expended it is likely to have a txq token that refers to a different txq variable at the caller's site. ==================== Link: https://patch.msgid.link/cover.1747559621.git.gur.stavi@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-19	queue_api: reduce risk of name collision over txq	Gur Stavi
	Rename local variable in macros from txq to _txq. When macro parameter get_desc is expended it is likely to have a txq token that refers to a different txq variable at the caller's site. Signed-off-by: Gur Stavi <gur.stavi@huawei.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/95b60d218f004308486d92ed17c8cc6f28bac09d.1747559621.git.gur.stavi@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-19	Merge branch '200GbE' of ↵	Jakub Kicinski
	git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== idpf: add initial PTP support Milena Olech says: This patch series introduces support for Precision Time Protocol (PTP) to Intel(R) Infrastructure Data Path Function (IDPF) driver. PTP feature is supported when the PTP capability is negotiated with the Control Plane (CP). IDPF creates a PTP clock and sets a set of supported functions. During the PTP initialization, IDPF requests a set of PTP capabilities and receives a writeback from the CP with the set of supported options. These options are: - get time of the PTP clock - set the time of the PTP clock - adjust the PTP clock - Tx timestamping Each feature is considered to have direct access, where the operations on PCIe BAR registers are allowed, or the mailbox access, where the virtchnl messages are used to perform any PTP action. Mailbox access means that PTP requests are sent to the CP through dedicated secondary mailbox and the CP reads/writes/modifies desired resource - PTP Clock or Tx timestamp registers. Tx timestamp capabilities are negotiated only for vports that have UPLINK_VPORT flag set by the CP. Capabilities provide information about the number of available Tx timestamp latches, their indexes and size of the Tx timestamp value. IDPF requests Tx timestamp by setting the TSYN bit and the requested timestamp index in the context descriptor for the PTP packets. When the completion tag for that packet is received, IDPF schedules a worker to read the Tx timestamp value. * '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: idpf: add support for Rx timestamping idpf: add Tx timestamp flows idpf: add Tx timestamp capabilities negotiation idpf: add PTP clock configuration idpf: add mailbox access to read PTP clock time idpf: negotiate PTP capabilities and get PTP clock idpf: move virtchnl structures to the header file virtchnl: add PTP virtchnl definitions idpf: add initial PTP support idpf: change the method for mailbox workqueue allocation ==================== Link: https://patch.msgid.link/20250516170645.1172700-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-19	selftests: drv-net: Fix "envirnoments" to "environments"	Sumanth Gavini
	Fix misspelling reported by codespell Signed-off-by: Sumanth Gavini <sumanth.gavini@yahoo.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250516225156.1122058-1-sumanth.gavini@yahoo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-19	net: netlink: reduce extack cookie size	Johannes Berg
	Seems like the extack cookie hasn't found any users outside of wireless, which always uses nl_set_extack_cookie_u64(). Thus, allocating 20 bytes for it is pointless, reduce that to 8 bytes, and add a BUILD_BUG_ON() to ensure it's enough (obviously it is, for a u64, but in case it changes again.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Link: https://patch.msgid.link/20250516115927.38209-2-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-19	Merge tag 'ovpn-net-next-20250515' of https://github.com/OpenVPN/ovpn-net-next	David S. Miller
	Antonio Quartulli says: ==================== ovpn: pull request for net-next: ovpn 2025-05-15 this is a new version of the previous pull request. These time I have removed the fixes that we are still discussing, so that we don't hold the entire series back. There is a new fix though: it's about properly checking the return value of skb_to_sgvec_nomark(). I spotted the issue while testing pings larger than the iface's MTU on a TCP VPN connection. I have added various Closes and Link tags where applicable, so that we have references to GitHub tickets and other public discussions. Since I have resent the PR, I have also added Andrew's Reviewed-by to the first patch. Please pull or let me know if something should be changed! ==================== Signed-off-by: David S. Miller <davem@davemloft.net> Patchset highlights: - update MAINTAINERS entry for ovpn - extend selftest with more cases - avoid crash in selftest in case of getaddrinfo() failure - fix ndo_start_xmit return value on error - set ignore_df flag for IPv6 packets - drop useless reg_state check in keepalive worker - retain skb's dst when entering xmit function - fix check on skb_to_sgvec_nomark() return value
2025-05-16	Merge branch 'vsock-test-improve-sigpipe-test-reliability'	Jakub Kicinski
	Stefano Garzarella says: ==================== vsock/test: improve sigpipe test reliability Running the tests continuously I noticed that sometimes the sigpipe test would fail due to a race between the control message of the test and the vsock transport messages. While I was at it I also improved the test by checking the errno we expect. v1: https://lore.kernel.org/20250508142005.135857-1-sgarzare@redhat.com ==================== Link: https://patch.msgid.link/20250514141927.159456-1-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	vsock/test: check also expected errno on sigpipe test	Stefano Garzarella
	In the sigpipe test, we expect send() to fail, but we do not check if send() fails with the errno we expect (EPIPE). Add this check and repeat the send() in case of EINTR as we do in other tests. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250514141927.159456-4-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	vsock/test: retry send() to avoid occasional failure in sigpipe test	Stefano Garzarella
	When the other peer calls shutdown(SHUT_RD), there is a chance that the send() call could occur before the message carrying the close information arrives over the transport. In such cases, the send() might still succeed. To avoid this race, let's retry the send() call a few times, ensuring the test is more reliable. Sleep a little before trying again to avoid flooding the other peer and filling its receive buffer, causing false-negative. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250514141927.159456-3-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	vsock/test: add timeout_usleep() to allow sleeping in timeout sections	Stefano Garzarella
	The timeout API uses signals, so we have documented not to use sleep(), but we can use nanosleep(2) since POSIX.1 explicitly specifies that it does not interact with signals. Let's provide timeout_usleep() for that. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250514141927.159456-2-sgarzare@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	Merge branch 'tools-ynl-gen-support-sub-messages-and-rt-link'	Jakub Kicinski
	Jakub Kicinski says: ==================== tools: ynl-gen: support sub-messages and rt-link Sub-messages are how we express "polymorphism" in YNL. Donald added the support to specs and Python a while back, support them in C, too. Sub-message is a nest, but the interpretation of the attribute types within that nest depends on a value of another attribute. For example in rt-link the "kind" attribute contains the link type (veth, bonding, etc.) and based on that the right enum has to be applied to interpret link-specific attributes. The last message is probably the most interesting to look at, as it adds a fairly advanced sample. This patch only contains enough support for rtnetlink, we will need a little more complexity to support TC, where sub-messages may contain fixed headers, and where the selector may be in a different nest than the submessage. ==================== Link: https://patch.msgid.link/20250515231650.1325372-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	tools: ynl: add a sample for rt-link	Jakub Kicinski
	Add a fairly complete example of rt-link usage. If run without any arguments it simply lists the interfaces and some of their attrs. If run with an arg it tries to create and delete a netkit device. 1 # ./tools/net/ynl/samples/rt-link 1 2 Trying to create a Netkit interface 3 Testing error message for policy being bad: 4 Kernel error: 'Provided default xmit policy not supported' (bad attribute: .linkinfo.data(netkit).policy) 5 1: lo: mtu 65536 6 2: wlp0s1: mtu 1500 7 3: enp0s13: mtu 1500 8 4: dummy0: mtu 1500 kind dummy altname one two 9 5: nk0: mtu 1500 kind netkit primary 0 policy forward 10 6: nk1: mtu 1500 kind netkit primary 1 policy blackhole 11 Trying to delete a Netkit interface (ifindex 6) Sample creates the device first, it sets an invalid value for a netkit attribute to trigger reverse parsing. Line 4 shows the error with the attribute path correctly generated by YNL. Then sample fixes the bad attribute and re-issues the request, with NLM_F_ECHO set. This flag causes the notification to be looped back to the initiating socket (our socket). Sample parses this notification to save the ifindex of the created netkit. Sample then proceeds to list the devices. Line 8 above shows a dummy device with two alt names. Lines 9 and 10 show the netkit devices the sample itself created. The "primary" and "policy" attrs are from inside the netkit submsg. The string values are auto-generated for the enums by YNL. To clean up sample deletes the interface it created (line 11). Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250515231650.1325372-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	tools: ynl: enable codegen for all rt- families	Jakub Kicinski
	Switch from including Classic netlink families one by one to excluding. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250515231650.1325372-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	tools: ynl: submsg: reverse parse / error reporting	Jakub Kicinski
	Reverse parsing lets YNL convert bad and missing attr pointers from extack into a string like "missing attribute nest1.nest2.attr_name". It's a feature that's unique to YNL C AFAIU (even the Python YNL can't do nested reverse parsing). Add support for reverse-parsing of sub-messages. To simplify the logic and the code annotate the type policies with extra metadata. Mark the selectors and the messages with the information we need. We assume that key / selector always precedes the sub-message while parsing (and also if there are multiple sub-messages like in rt-link they are interleaved selector 1 ... submsg 1 ... selector 2 .. submsg 2, not selector 1 ... selector 2 ... submsg 1 ... submsg 2). The rt-link sample in a subsequent changes shows reverse parsing of sub-messages in action. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250515231650.1325372-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	tools: ynl-gen: submsg: support parsing and rendering sub-messages	Jakub Kicinski
	Adjust parsing and rendering appropriately to make sub-messages work. Rendering is pretty trivial, as the submsg -> netlink conversion looks like rendering a nest in which only one attr was set. Only trick is that we use the enum value of the sub-message rather than the nest as the type, and effectively skip one layer of nesting. A real double nested struct would look like this: [SELECTOR] [SUBMSG] [NEST] [MSG1-ATTR] A submsg "is" the nest so by skipping I mean: [SELECTOR] [SUBMSG] [MSG1-ATTR] There is no extra validation in YNL if caller has set the selector matching the submsg type (e.g. link type = "macvlan" but the nest attrs are set to carry "veth"). Let the kernel handle that. Parsing side is a little more specialized as we need to render and insert a new kind of function which switches between what to parse based on the selector. But code isn't too complicated. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250515231650.1325372-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	tools: ynl-gen: submsg: render the structs	Jakub Kicinski
	The easiest (or perhaps only sane) way to support submessages in C is to treat them as if they were nests. Build fake attributes to that effect in the codegen. Render the submsg as a big nest of all possible values. With this in place the main missing part is to hook in the switch which selects how to parse based on the key. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250515231650.1325372-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	tools: ynl-gen: submsg: plumb thru an empty type	Jakub Kicinski
	Hook in handling of sub-messages, for now treat them as ignored attrs. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250515231650.1325372-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	tools: ynl-gen: prepare for submsg structs	Jakub Kicinski
	Prepare for constructing Struct() instances which represent sub-messages rather than nested attributes. Restructure the code / indentation to more easily insert a case where nested reference comes from annotation other than the 'nested-attributes' property. Make sure we don't construct the Struct() object from scratch in multiple places as the constructor will soon have more arguments. This should cause no functional change. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250515231650.1325372-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	tools: ynl-gen: factor out the annotation of pure nested struct	Jakub Kicinski
	We're about to add some code here for sub-messages. Factor out the nest-related logic to make the code readable. No functional change. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250515231650.1325372-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	netlink: specs: rt-link: add C naming info for ovpn	Jakub Kicinski
	C naming info for OVPN which was added since I adjusted the existing attrs. Also add missing reference to a header needed for a bridge struct. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250515231650.1325372-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	net: phy: microchip: document where the LAN88xx PHYs are used	Oleksij Rempel
	The driver uses the name LAN88xx for PHYs with phy_id = 0x0007c132. But with this placeholder name no documentation can be found on the net. Document the fact that these PHYs are build into the LAN7800 and LAN7850 USB/Ethernet controllers. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250515082051.2644450-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	net: phy: fixed_phy: remove fixed_phy_register_with_gpiod	Heiner Kallweit
	Since its introduction 6 yrs ago this functions has never had a user. So remove it. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/ccbeef28-65ae-4e28-b1db-816c44338dee@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	net: rfs: add sock_rps_delete_flow() helper	Eric Dumazet
	RFS can exhibit lower performance for workloads using short-lived flows and a small set of 4-tuple. This is often the case for load-testers, using a pair of hosts, if the server has a single listener port. Typical use case : Server : tcp_crr -T128 -F1000 -6 -U -l30 -R 14250 Client : tcp_crr -T128 -F1000 -6 -U -l30 -c -H server \| grep local_throughput This is because RFS global hash table contains stale information, when the same RSS key is recycled for another socket and another cpu. Make sure to undo the changes and go back to initial state when a flow is disconnected. Performance of the above test is increased by 22 %, going from 372604 transactions per second to 457773. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Octavian Purdila <tavip@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250515100354.3339920-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	r8169: add support for RTL8127A	ChunHao Lin
	This adds support for 10Gbs chip RTL8127A. Signed-off-by: ChunHao Lin <hau@realtek.com> Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://patch.msgid.link/20250515095303.3138-1-hau@realtek.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	net: dlink: add synchronization for stats update	Moon Yeounsu
	This patch synchronizes code that accesses from both user-space and IRQ contexts. The `get_stats()` function can be called from both context. `dev->stats.tx_errors` and `dev->stats.collisions` are also updated in the `tx_errors()` function. Therefore, these fields must also be protected by synchronized. There is no code that accessses `dev->stats.tx_errors` between the previous and updated lines, so the updating point can be moved. Signed-off-by: Moon Yeounsu <yyyynoom@gmail.com> Link: https://patch.msgid.link/20250515075333.48290-1-yyyynoom@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	net/mlx5e: Reuse per-RQ XDP buffer to avoid stack zeroing overhead	Carolina Jubran
	CONFIG_INIT_STACK_ALL_ZERO introduces a performance cost by zero-initializing all stack variables on function entry. The mlx5 XDP RX path previously allocated a struct mlx5e_xdp_buff on the stack per received CQE, resulting in measurable performance degradation under this config. This patch reuses a mlx5e_xdp_buff stored in the mlx5e_rq struct, avoiding per-CQE stack allocations and repeated zeroing. With this change, XDP_DROP and XDP_TX performance matches that of kernels built without CONFIG_INIT_STACK_ALL_ZERO. Performance was measured on a ConnectX-6Dx using a single RX channel (1 CPU at 100% usage) at ~50 Mpps. The baseline results were taken from net-next-6.15. Stack zeroing disabled: - XDP_DROP: * baseline: 31.47 Mpps * baseline + per-RQ allocation: 32.31 Mpps (+2.68%) - XDP_TX: * baseline: 12.41 Mpps * baseline + per-RQ allocation: 12.95 Mpps (+4.30%) Stack zeroing enabled: - XDP_DROP: * baseline: 24.32 Mpps * baseline + per-RQ allocation: 32.27 Mpps (+32.7%) - XDP_TX: * baseline: 11.80 Mpps * baseline + per-RQ allocation: 12.24 Mpps (+3.72%) Reported-by: Sebastiano Miano <mianosebastiano@gmail.com> Reported-by: Samuel Dobron <sdobron@redhat.com> Link: https://lore.kernel.org/all/CAMENy5pb8ea+piKLg5q5yRTMZacQqYWAoVLE1FE9WhQPq92E0g@mail.gmail.com/ Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Link: https://patch.msgid.link/1747253032-663457-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	net: phy: mediatek: do not require syscon compatible for pio property	Frank Wunderlich
	Current implementation requires syscon compatible for pio property which is used for driving the switch leds on mt7988. Replace syscon_regmap_lookup_by_phandle with of_parse_phandle and device_node_to_regmap to get the regmap already assigned by pinctrl driver. Signed-off-by: Frank Wunderlich <frank-w@public-files.de> Link: https://patch.msgid.link/20250510174933.154589-1-linux@fw-web.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	idpf: add support for Rx timestamping	Milena Olech
	Add Rx timestamp function when the Rx timestamp value is read directly from the Rx descriptor. In order to extend the Rx timestamp value to 64 bit in hot path, the PHC time is cached in the receive groups. Add supported Rx timestamp modes. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Milena Olech <milena.olech@intel.com> Tested-by: YiFei Zhu <zhuyifei@google.com> Tested-by: Mina Almasry <almasrymina@google.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-16	idpf: add Tx timestamp flows	Milena Olech
	Add functions to request Tx timestamp for the PTP packets, read the Tx timestamp when the completion tag for that packet is being received, extend the Tx timestamp value and set the supported timestamping modes. Tx timestamp is requested for the PTP packets by setting a TSYN bit and index value in the Tx context descriptor. The driver assumption is that the Tx timestamp value is ready to be read when the completion tag is received. Then the driver schedules delayed work and the Tx timestamp value read is requested through virtchnl message. At the end, the Tx timestamp value is extended to 64-bit and provided back to the skb. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Co-developed-by: Josh Hay <joshua.a.hay@intel.com> Signed-off-by: Josh Hay <joshua.a.hay@intel.com> Signed-off-by: Milena Olech <milena.olech@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-16	idpf: add Tx timestamp capabilities negotiation	Milena Olech
	Tx timestamp capabilities are negotiated for the uplink Vport. Driver receives information about the number of available Tx timestamp latches, the size of Tx timestamp value and the set of indexes used for Tx timestamping. Add function to get the Tx timestamp capabilities and parse the uplink vport flag. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Co-developed-by: Emil Tantilov <emil.s.tantilov@intel.com> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Co-developed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Signed-off-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com> Signed-off-by: Milena Olech <milena.olech@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-16	idpf: add PTP clock configuration	Milena Olech
	PTP clock configuration operations - set time, adjust time and adjust frequency are required to control the clock and maintain synchronization process. Extend get PTP capabilities function to request for the clock adjustments and add functions to enable these actions using dedicated virtchnl messages. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Milena Olech <milena.olech@intel.com> Tested-by: Mina Almasry <almasrymina@google.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-16	idpf: add mailbox access to read PTP clock time	Milena Olech
	When the access to read PTP clock is specified as mailbox, the driver needs to send virtchnl message to perform PTP actions. Message is sent using idpf_mbq_opc_send_msg_to_peer_drv mailbox opcode, with the parameters received during PTP capabilities negotiation. Add functions to recognize PTP messages, move them to dedicated secondary mailbox, read the PTP clock time and cross timestamp using mailbox messages. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Milena Olech <milena.olech@intel.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-16	idpf: negotiate PTP capabilities and get PTP clock	Milena Olech
	PTP capabilities are negotiated using virtchnl command. Add get capabilities function, direct access to read the PTP clock. Set initial PTP capabilities exposed to the stack. Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Milena Olech <milena.olech@intel.com> Tested-by: Willem de Bruijn <willemb@google.com> Tested-by: Mina Almasry <almasrymina@google.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-05-16	idpf: move virtchnl structures to the header file	Milena Olech
	Move virtchnl structures to the header file to expose them for the PTP virtchnl file. Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Milena Olech <milena.olech@intel.com> Tested-by: Mina Almasry <almasrymina@google.com> Tested-by: Samuel Salin <Samuel.salin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>