summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2025-02-26net: move aRFS rmap management and CPU affinity to coreAhmed Zaki
A common task for most drivers is to remember the user-set CPU affinity to its IRQs. On each netdev reset, the driver should re-assign the user's settings to the IRQs. Unify this task across all drivers by moving the CPU affinity to napi->config. However, to move the CPU affinity to core, we also need to move aRFS rmap management since aRFS uses its own IRQ notifiers. For the aRFS, add a new netdev flag "rx_cpu_rmap_auto". Drivers supporting aRFS should set the flag via netif_enable_cpu_rmap() and core will allocate and manage the aRFS rmaps. Freeing the rmap is also done by core when the netdev is freed. For better IRQ affinity management, move the IRQ rmap notifier inside the napi_struct and add new notify.notify and notify.release functions: netif_irq_cpu_rmap_notify() and netif_napi_affinity_release(). Now we have the aRFS rmap management in core, add CPU affinity mask to napi_config. To delegate the CPU affinity management to the core, drivers must: 1 - set the new netdev flag "irq_affinity_auto": netif_enable_irq_affinity(netdev) 2 - create the napi with persistent config: netif_napi_add_config() 3 - bind an IRQ to the napi instance: netif_napi_set_irq() the core will then make sure to use re-assign affinity to the napi's IRQ. The default IRQ mask is set to one cpu starting from the closest NUMA. Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com> Link: https://patch.msgid.link/20250224232228.990783-2-ahmed.zaki@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26net: skb: free up one bit in tx_flagsWillem de Bruijn
The linked series wants to add skb tx completion timestamps. That needs a bit in skb_shared_info.tx_flags, but all are in use. A per-skb bit is only needed for features that are configured on a per packet basis. Per socket features can be read from sk->sk_tsflags. Per packet tsflags can be set in sendmsg using cmsg, but only those in SOF_TIMESTAMPING_TX_RECORD_MASK. Per packet tsflags can also be set without cmsg by sandwiching a send inbetween two setsockopts: val |= SOF_TIMESTAMPING_$FEATURE; setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); write(fd, buf, sz); val &= ~SOF_TIMESTAMPING_$FEATURE; setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val)); Changing a datapath test from skb_shinfo(skb)->tx_flags to skb->sk->sk_tsflags can change behavior in that case, as the tx_flags is written before the second setsockopt updates sk_tsflags. Therefore, only bits can be reclaimed that cannot be set by cmsg and are also highly unlikely to be used to target individual packets otherwise. Free up the bit currently used for SKBTX_HW_TSTAMP_USE_CYCLES. This selects between clock and free running counter source for HW TX timestamps. It is probable that all packets of the same socket will always use the same source. Link: https://lore.kernel.org/netdev/cover.1739988644.git.pav@iki.fi/ Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Link: https://patch.msgid.link/20250225023416.2088705-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26tcp: be less liberal in TSEcr received while in SYN_RECV stateEric Dumazet
Yong-Hao Zou mentioned that linux was not strict as other OS in 3WHS, for flows using TCP TS option (RFC 7323) As hinted by an old comment in tcp_check_req(), we can check the TSEcr value in the incoming packet corresponds to one of the SYNACK TSval values we have sent. In this patch, I record the oldest and most recent values that SYNACK packets have used. Send a challenge ACK if we receive a TSEcr outside of this range, and increase a new SNMP counter. nstat -az | grep TSEcrRejected TcpExtTSEcrRejected 0 0.0 Due to TCP fastopen implementation, do not apply yet these checks for fastopen flows. v2: No longer use req->num_timeout, but treq->snt_tsval_first to detect when first SYNACK is prepared. This means we make sure to not send an initial zero TSval. Make sure MPTCP and TCP selftests are passing. Change MIB name to TcpExtTSEcrRejected v1: https://lore.kernel.org/netdev/CADVnQykD8i4ArpSZaPKaoNxLJ2if2ts9m4As+=Jvdkrgx1qMHw@mail.gmail.com/T/ Reported-by: Yong-Hao Zou <yonghaoz1994@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250225171048.3105061-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25ethtool: Symmetric OR-XOR RSS hashGal Pressman
Add an additional type of symmetric RSS hash type: OR-XOR. The "Symmetric-OR-XOR" algorithm transforms the input as follows: (SRC_IP | DST_IP, SRC_IP ^ DST_IP, SRC_PORT | DST_PORT, SRC_PORT ^ DST_PORT) Change 'cap_rss_sym_xor_supported' to 'supported_input_xfrm', a bitmap of supported RXH_XFRM_* types. Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/20250224174416.499070-2-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: blackhole: avoid checking the state twiceMatthieu Baerts (NGI0)
A small cleanup, reordering the conditions to avoid checking things twice. The code here is called in case of timeout on a TCP connection, before triggering a retransmission. But it only acts on SYN + MPC packets. So the conditions can be re-order to exit early in case of non-MPTCP SYN + MPC. This also reduce the indentation levels. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-10-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: sched: reduce size for unused dataMatthieu Baerts (NGI0)
Thanks for the previous commit ("mptcp: sched: split get_subflow interface into two"), the mptcp_sched_data structure is now currently unused. This structure has been added to allow future extensions that are not ready yet. At the end, this structure will not even be used at all when mptcp_subflow bpf_iter will be supported [1]. Here is a first step to save 64 bytes on the stack for each scheduling operation. The structure is not removed yet not to break the WIP work on these extensions, but will be done when [1] will be ready and applied. Link: https://lore.kernel.org/6645ad6e-8874-44c5-8730-854c30673218@linux.dev [1] Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-9-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: sched: split get_subflow interface into twoGeliang Tang
get_retrans() interface of the burst packet scheduler invokes a sleeping function mptcp_pm_subflow_chk_stale(), which calls __lock_sock_fast(). So get_retrans() interface should be set with BPF_F_SLEEPABLE flag in BPF. But get_send() interface of this scheduler can't be set with BPF_F_SLEEPABLE flag since it's invoked in ack_update_msk() under mptcp data lock. So this patch has to split get_subflow() interface of packet scheduer into two interfaces: get_send() and get_retrans(). Then we can set get_retrans() interface alone with BPF_F_SLEEPABLE flag. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-8-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: pm: use ipv6_addr_equal in addresses_equalGeliang Tang
Use ipv6_addr_equal() to check whether two IPv6 addresses are equal in mptcp_addresses_equal(). This is more appropriate than using !ipv6_addr_cmp(). Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-7-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: pm: drop inet6_sk after inet_skGeliang Tang
In mptcp_event_add_subflow(), mptcp_event_pm_listener() and mptcp_nl_find_ssk(), 'issk' has already been got through inet_sk(). No need to use inet6_sk() to get 'ipv6_pinfo' again, just use issk->pinet6 instead. This patch also drops these 'ipv6_pinfo' variables. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-6-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: pm: drop match in userspace_pm_append_new_local_addrGeliang Tang
The variable 'match' in mptcp_userspace_pm_append_new_local_addr() is a redundant one, and this patch drops it. No need to define 'match' as 'struct mptcp_pm_addr_entry *' type. In this function, it's only used to check whether it's NULL. It can be defined as a Boolean one. Also other variables 'addr_match' and 'id_match' make 'match' a redundant one, which can be replaced by directly checking 'addr_match && id_match'. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-5-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: pm: add mptcp_pm_genl_fill_addr helperGeliang Tang
To save some redundant code in dump_addr() interfaces of both the netlink PM and userspace PM, the code that calls netlink message helpers (genlmsg_put/cancel/end) and mptcp_nl_fill_addr() is wrapped into a new helper mptcp_pm_genl_fill_addr(). Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-4-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: pm: add a build check for userspace_pm_dump_addrGeliang Tang
This patch adds a build check for mptcp_userspace_pm_dump_addr() to make sure there is enough space in 'cb->ctx' to store an address id bitmap. Just in case info stored in 'cb->ctx' are increased later. Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-3-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: pm: change to fullmesh only for 'subflow'Matthieu Baerts (NGI0)
If an endpoint doesn't have the 'subflow' flag -- in fact, has no type, so not 'subflow', 'signal', nor 'implicit' -- there are then no subflows created from this local endpoint to at least the initial destination address. In this case, no need to call mptcp_pm_nl_fullmesh() which is there to recreate the subflows to reflect the new value of the fullmesh attribute. Similarly, there is then no need to iterate over all connections to do nothing, if only the 'fullmesh' flag has been changed, and the endpoint doesn't have the 'subflow' one. So stop early when dealing with this specific case. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-2-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24mptcp: pm: remove unused ret value to set flagsMatthieu Baerts (NGI0)
The returned value is not used, it can then be dropped. Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250221-net-next-mptcp-pm-misc-cleanup-3-v1-1-2b70ab1cee79@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net: Remove shadow variable in netdev_run_todo()Breno Leitao
Fix a shadow variable warning in net/core/dev.c when compiled with CONFIG_LOCKDEP enabled. The warning occurs because 'dev' is redeclared inside the while loop, shadowing the outer scope declaration. net/core/dev.c:11211:22: warning: declaration shadows a local variable [-Wshadow] struct net_device *dev = list_first_entry(&unlink_list, net/core/dev.c:11202:21: note: previous declaration is here struct net_device *dev, *tmp; Remove the redundant declaration since the variable is already defined in the outer scope and will be overwritten in the subsequent list_for_each_entry_safe() loop anyway. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250221-netcons_fix_shadow-v1-1-dee20c8658dd@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net: remove '__' from __skb_flow_get_ports()Nicolas Dichtel
Only one version of skb_flow_get_ports() exists after the previous commit, so let's remove the useless '__'. Suggested-by: Simon Horman <horms@kernel.org> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20250221110941.2041629-3-nicolas.dichtel@6wind.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-24net-sysfs: restore behavior for not running devicesEric Dumazet
modprobe dummy dumdummies=1 Old behavior : $ cat /sys/class/net/dummy0/carrier cat: /sys/class/net/dummy0/carrier: Invalid argument After blamed commit, an empty string is reported. $ cat /sys/class/net/dummy0/carrier $ In this commit, I restore the old behavior for carrier, speed and duplex attributes. Fixes: 79c61899b5ee ("net-sysfs: remove rtnl_trylock from device attributes") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Marco Leogrande <leogrande@google.com> Reviewed-by: Antoine Tenart <atenart@kernel.org> Link: https://patch.msgid.link/20250221051223.576726-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: fib_rules: Enable DSCP mask usageIdo Schimmel
Allow user space to configure FIB rules that match on DSCP with a mask, now that support has been added to the IPv4 and IPv6 address families. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20250220080525.831924-5-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21ipv6: fib_rules: Add DSCP mask matchingIdo Schimmel
Extend IPv6 FIB rules to match on DSCP using a mask. Unlike IPv4, also initialize the DSCP mask when a non-zero 'tos' is specified as there is no difference in matching between 'tos' and 'dscp'. As a side effect, this makes it possible to match on 'dscp 0', like in IPv4. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20250220080525.831924-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21ipv4: fib_rules: Add DSCP mask matchingIdo Schimmel
Extend IPv4 FIB rules to match on DSCP using a mask. The mask is only set in rules that match on DSCP (not TOS) and initialized to cover the entire DSCP field if the mask attribute is not specified. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20250220080525.831924-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: fib_rules: Add DSCP mask attributeIdo Schimmel
Add an attribute that allows matching on DSCP with a mask. Matching on DSCP with a mask is needed in deployments where users encode path information into certain bits of the DSCP field. Temporarily set the type of the attribute to 'NLA_REJECT' while support is being added. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20250220080525.831924-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-02-20 We've added 19 non-merge commits during the last 8 day(s) which contain a total of 35 files changed, 1126 insertions(+), 53 deletions(-). The main changes are: 1) Add TCP_RTO_MAX_MS support to bpf_set/getsockopt, from Jason Xing 2) Add network TX timestamping support to BPF sock_ops, from Jason Xing 3) Add TX metadata Launch Time support, from Song Yoong Siang * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: igc: Add launch time support to XDP ZC igc: Refactor empty frame insertion for launch time support net: stmmac: Add launch time support to XDP ZC selftests/bpf: Add launch time request to xdp_hw_metadata xsk: Add launch time hardware offload support to XDP Tx metadata selftests/bpf: Add simple bpf tests in the tx path for timestamping feature bpf: Support selective sampling for bpf timestamping bpf: Add BPF_SOCK_OPS_TSTAMP_SENDMSG_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING bpf: Disable unsafe helpers in TX timestamping callbacks bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping bpf: Add networking timestamping support to bpf_get/setsockopt() selftests/bpf: Add rto max for bpf_setsockopt test bpf: Support TCP_RTO_MAX_MS for bpf_setsockopt ==================== Link: https://patch.msgid.link/20250221022104.386462-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net/rds: Replace deprecated strncpy() with strscpy_pad()Thorsten Blum
strncpy() is deprecated for NUL-terminated destination buffers. Use strscpy_pad() instead and remove the manual NUL-termination. Compile-tested only. Link: https://github.com/KSPP/linux/issues/90 Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Tested-by: Allison Henderson <allison.henderson@oracle.com> Link: https://patch.msgid.link/20250219224730.73093-2-thorsten.blum@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21rtnetlink: Create link directly in target net namespaceXiao Liang
Make rtnl_newlink_create() create device in target namespace directly. Avoid extra netns change when link netns is provided. Device drivers has been converted to be aware of link netns, that is not assuming device netns is and link netns is the same when ops->newlink() is called. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-12-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21rtnetlink: Remove "net" from newlink paramsXiao Liang
Now that devices have been converted to use the specific netns instead of ambiguous "net", let's remove it from newlink parameters. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-11-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: xfrm: Use link netns in newlink() of rtnl_link_opsXiao Liang
When link_net is set, use it as link netns instead of dev_net(). This prepares for rtnetlink core to create device in target netns directly, in which case the two namespaces may be different. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-10-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: ipv6: Use link netns in newlink() of rtnl_link_opsXiao Liang
When link_net is set, use it as link netns instead of dev_net(). This prepares for rtnetlink core to create device in target netns directly, in which case the two namespaces may be different. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-9-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: ipv6: Init tunnel link-netns before registering devXiao Liang
Currently some IPv6 tunnel drivers set tnl->net to dev_net(dev) in ndo_init(), which is called in register_netdevice(). However, it lacks the context of link-netns when we enable cross-net tunnels at device registration time. Let's move the init of tunnel link-netns before register_netdevice(). ip6_gre has already initialized netns, so just remove the redundant assignment. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-8-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: ip_tunnel: Use link netns in newlink() of rtnl_link_opsXiao Liang
When link_net is set, use it as link netns instead of dev_net(). This prepares for rtnetlink core to create device in target netns directly, in which case the two namespaces may be different. Convert common ip_tunnel_newlink() to accept an extra link netns argument. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-7-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: ip_tunnel: Don't set tunnel->net in ip_tunnel_init()Xiao Liang
ip_tunnel_init() is called from register_netdevice(). In all code paths reaching here, tunnel->net should already have been set (either in ip_tunnel_newlink() or __ip_tunnel_create()). So don't set it again. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-6-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21ieee802154: 6lowpan: Validate link netns in newlink() of rtnl_link_opsXiao Liang
Device denoted by IFLA_LINK is in link_net (IFLA_LINK_NETNSID) or source netns by design, but 6lowpan uses dev_net. Note dev->netns_local is set to true and currently link_net is implemented via a netns change. These together effectively reject IFLA_LINK_NETNSID. This patch adds a validation to ensure link_net is either NULL or identical to dev_net. Thus it would be fine to continue using dev_net when rtnetlink core begins to create devices directly in target netns. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-5-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21net: Use link/peer netns in newlink() of rtnl_link_opsXiao Liang
Add two helper functions - rtnl_newlink_link_net() and rtnl_newlink_peer_net() for netns fallback logic. Peer netns falls back to link netns, and link netns falls back to source netns. Convert the use of params->net in netdevice drivers to one of the helper functions for clarity. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-4-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21rtnetlink: Pack newlink() params into structXiao Liang
There are 4 net namespaces involved when creating links: - source netns - where the netlink socket resides, - target netns - where to put the device being created, - link netns - netns associated with the device (backend), - peer netns - netns of peer device. Currently, two nets are passed to newlink() callback - "src_net" parameter and "dev_net" (implicitly in net_device). They are set as follows, depending on netlink attributes in the request. +------------+-------------------+---------+---------+ | peer netns | IFLA_LINK_NETNSID | src_net | dev_net | +------------+-------------------+---------+---------+ | | absent | source | target | | absent +-------------------+---------+---------+ | | present | link | link | +------------+-------------------+---------+---------+ | | absent | peer | target | | present +-------------------+---------+---------+ | | present | peer | link | +------------+-------------------+---------+---------+ When IFLA_LINK_NETNSID is present, the device is created in link netns first and then moved to target netns. This has some side effects, including extra ifindex allocation, ifname validation and link events. These could be avoided if we create it in target netns from the beginning. On the other hand, the meaning of src_net parameter is ambiguous. It varies depending on how parameters are passed. It is the effective link (or peer netns) by design, but some drivers ignore it and use dev_net instead. To provide more netns context for drivers, this patch packs existing newlink() parameters, along with the source netns, link netns and peer netns, into a struct. The old "src_net" is renamed to "net" to avoid confusion with real source netns, and will be deprecated later. The use of src_net are converted to params->net trivially. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-3-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21rtnetlink: Lookup device in target netns when creating linkXiao Liang
When creating link, lookup for existing device in target net namespace instead of current one. For example, two links created by: # ip link add dummy1 type dummy # ip link add netns ns1 dummy1 type dummy should have no conflict since they are in different namespaces. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250219125039.18024-2-shaw.leon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20neighbour: Replace kvzalloc() with kzalloc() when GFP_ATOMIC is specifiedKohei Enju
kzalloc() uses page allocator when size is larger than KMALLOC_MAX_CACHE_SIZE, so the intention of commit ab101c553bc1 ("neighbour: use kvzalloc()/kvfree()") can be achieved by using kzalloc(). When using GFP_ATOMIC, kvzalloc() only tries the kmalloc path, since the vmalloc path does not support the flag. In this case, kvzalloc() is equivalent to kzalloc() in that neither try the vmalloc path, so this replacement brings no functional change. This is primarily a cleanup change, as the original code functions correctly. This patch replaces kvzalloc() introduced by commit 41b3caa7c076 ("neighbour: Add hlist_node to struct neighbour"), which is called in the same context and with the same gfp flag as the aforementioned commit ab101c553bc1 ("neighbour: use kvzalloc()/kvfree()"). Signed-off-by: Kohei Enju <enjuk@amazon.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Link: https://patch.msgid.link/20250219102227.72488-1-enjuk@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: pktgen: fix access outside of user given buffer in pktgen_thread_write()Peter Seiderer
Honour the user given buffer size for the strn_len() calls (otherwise strn_len() will access memory outside of the user given buffer). Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250219084527.20488-8-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: pktgen: fix ctrl interface command parsingPeter Seiderer
Enable command writing without trailing '\n': - the good case $ echo "reset" > /proc/net/pktgen/pgctrl - the bad case (before the patch) $ echo -n "reset" > /proc/net/pktgen/pgctrl -bash: echo: write error: Invalid argument - with patch applied $ echo -n "reset" > /proc/net/pktgen/pgctrl Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250219084527.20488-7-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: pktgen: fix 'ratep 0' error handling (return -EINVAL)Peter Seiderer
Given an invalid 'ratep' command e.g. 'ratep 0' the return value is '1', leading to the following misleading output: - the good case $ echo "ratep 100" > /proc/net/pktgen/lo\@0 $ grep "Result:" /proc/net/pktgen/lo\@0 Result: OK: ratep=100 - the bad case (before the patch) $ echo "ratep 0" > /proc/net/pktgen/lo\@0" -bash: echo: write error: Invalid argument $ grep "Result:" /proc/net/pktgen/lo\@0 Result: No such parameter "atep" - with patch applied $ echo "ratep 0" > /proc/net/pktgen/lo\@0 -bash: echo: write error: Invalid argument $ grep "Result:" /proc/net/pktgen/lo\@0 Result: Idle Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250219084527.20488-6-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: pktgen: fix 'rate 0' error handling (return -EINVAL)Peter Seiderer
Given an invalid 'rate' command e.g. 'rate 0' the return value is '1', leading to the following misleading output: - the good case $ echo "rate 100" > /proc/net/pktgen/lo\@0 $ grep "Result:" /proc/net/pktgen/lo\@0 Result: OK: rate=100 - the bad case (before the patch) $ echo "rate 0" > /proc/net/pktgen/lo\@0" -bash: echo: write error: Invalid argument $ grep "Result:" /proc/net/pktgen/lo\@0 Result: No such parameter "ate" - with patch applied $ echo "rate 0" > /proc/net/pktgen/lo\@0 -bash: echo: write error: Invalid argument $ grep "Result:" /proc/net/pktgen/lo\@0 Result: Idle Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250219084527.20488-5-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: pktgen: fix hex32_arg parsing for short readsPeter Seiderer
Fix hex32_arg parsing for short reads (here 7 hex digits instead of the expected 8), shift result only on successful input parsing. - before the patch $ echo "mpls 0000123" > /proc/net/pktgen/lo\@0 $ grep mpls /proc/net/pktgen/lo\@0 mpls: 00001230 Result: OK: mpls=00001230 - with patch applied $ echo "mpls 0000123" > /proc/net/pktgen/lo\@0 $ grep mpls /proc/net/pktgen/lo\@0 mpls: 00000123 Result: OK: mpls=00000123 Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250219084527.20488-4-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: pktgen: enable 'param=value' parsingPeter Seiderer
Enable more flexible parameters syntax, allowing 'param=value' in addition to the already supported 'param value' pattern (additional this gives the skipping '=' in count_trail_chars() a purpose). Tested with: $ echo "min_pkt_size 999" > /proc/net/pktgen/lo\@0 $ echo "min_pkt_size=999" > /proc/net/pktgen/lo\@0 $ echo "min_pkt_size =999" > /proc/net/pktgen/lo\@0 $ echo "min_pkt_size= 999" > /proc/net/pktgen/lo\@0 $ echo "min_pkt_size = 999" > /proc/net/pktgen/lo\@0 Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250219084527.20488-3-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20net: pktgen: replace ENOTSUPP with EOPNOTSUPPPeter Seiderer
Replace ENOTSUPP with EOPNOTSUPP, fixes checkpatch hint WARNING: ENOTSUPP is not a SUSV4 error code, prefer EOPNOTSUPP and e.g. $ echo "clone_skb 1" > /proc/net/pktgen/lo\@0 -bash: echo: write error: Unknown error 524 Signed-off-by: Peter Seiderer <ps.report@gmx.net> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250219084527.20488-2-ps.report@gmx.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20af_unix: Fix undefined 'other' errorPurva Yeshi
Fix an issue with the sparse static analysis tool where an "undefined 'other'" error occurs due to `__releases(&unix_sk(other)->lock)` being placed before 'other' is in scope. Remove the `__releases()` annotation from the `unix_wait_for_peer()` function to eliminate the sparse error. The annotation references `other` before it is declared, leading to a false positive error during static analysis. Since AF_UNIX does not use sparse annotations, this annotation is unnecessary and does not impact functionality. Reviewed-by: Joe Damato <jdamato@fastly.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Purva Yeshi <purvayeshi550@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250218141045.38947-1-purvayeshi550@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-20xsk: Add launch time hardware offload support to XDP Tx metadataSong Yoong Siang
Extend the XDP Tx metadata framework so that user can requests launch time hardware offload, where the Ethernet device will schedule the packet for transmission at a pre-determined time called launch time. The value of launch time is communicated from user space to Ethernet driver via launch_time field of struct xsk_tx_metadata. Suggested-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250216093430.957880-2-yoong.siang.song@intel.com
2025-02-20bpf: Support selective sampling for bpf timestampingJason Xing
Add the bpf_sock_ops_enable_tx_tstamp kfunc to allow BPF programs to selectively enable TX timestamping on a skb during tcp_sendmsg(). For example, BPF program will limit tracking X numbers of packets and then will stop there instead of tracing all the sendmsgs of matched flow all along. It would be helpful for users who cannot afford to calculate latencies from every sendmsg call probably due to the performance or storage space consideration. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250220072940.99994-12-kerneljasonxing@gmail.com
2025-02-20bpf: Add BPF_SOCK_OPS_TSTAMP_SENDMSG_CB callbackJason Xing
This patch introduces a new callback in tcp_tx_timestamp() to correlate tcp_sendmsg timestamp with timestamps from other tx timestamping callbacks (e.g., SND/SW/ACK). Without this patch, BPF program wouldn't know which timestamps belong to which flow because of no socket lock protection. This new callback is inserted in tcp_tx_timestamp() to address this issue because tcp_tx_timestamp() still owns the same socket lock with tcp_sendmsg_locked() in the meanwhile tcp_tx_timestamp() initializes the timestamping related fields for the skb, especially tskey. The tskey is the bridge to do the correlation. For TCP, BPF program hooks the beginning of tcp_sendmsg_locked() and then stores the sendmsg timestamp at the bpf_sk_storage, correlating this timestamp with its tskey that are later used in other sending timestamping callbacks. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250220072940.99994-11-kerneljasonxing@gmail.com
2025-02-20bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callbackJason Xing
Support the ACK case for bpf timestamping. Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_ACK_CB. This callback will occur at the same timestamping point as the user space's SCM_TSTAMP_ACK. The BPF program can use it to get the same SCM_TSTAMP_ACK timestamp without modifying the user-space application. This patch extends txstamp_ack to two bits: 1 stands for SO_TIMESTAMPING mode, 2 bpf extension. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250220072940.99994-10-kerneljasonxing@gmail.com
2025-02-20bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callbackJason Xing
Support hw SCM_TSTAMP_SND case for bpf timestamping. Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SND_HW_CB. This callback will occur at the same timestamping point as the user space's hardware SCM_TSTAMP_SND. The BPF program can use it to get the same SCM_TSTAMP_SND timestamp without modifying the user-space application. To avoid increasing the code complexity, replace SKBTX_HW_TSTAMP with SKBTX_HW_TSTAMP_NOBPF instead of changing numerous callers from driver side using SKBTX_HW_TSTAMP. The new definition of SKBTX_HW_TSTAMP means the combination tests of socket timestamping and bpf timestamping. After this patch, drivers can work under the bpf timestamping. Considering some drivers don't assign the skb with hardware timestamp, this patch does the assignment and then BPF program can acquire the hwstamp from skb directly. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250220072940.99994-9-kerneljasonxing@gmail.com
2025-02-20bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callbackJason Xing
Support sw SCM_TSTAMP_SND case for bpf timestamping. Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SND_SW_CB. This callback will occur at the same timestamping point as the user space's software SCM_TSTAMP_SND. The BPF program can use it to get the same SCM_TSTAMP_SND timestamp without modifying the user-space application. Based on this patch, BPF program will get the software timestamp when the driver is ready to send the skb. In the sebsequent patch, the hardware timestamp will be supported. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250220072940.99994-8-kerneljasonxing@gmail.com
2025-02-20bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callbackJason Xing
Support SCM_TSTAMP_SCHED case for bpf timestamping. Add a new sock_ops callback, BPF_SOCK_OPS_TSTAMP_SCHED_CB. This callback will occur at the same timestamping point as the user space's SCM_TSTAMP_SCHED. The BPF program can use it to get the same SCM_TSTAMP_SCHED timestamp without modifying the user-space application. A new SKBTX_BPF flag is added to mark skb_shinfo(skb)->tx_flags, ensuring that the new BPF timestamping and the current user space's SO_TIMESTAMPING do not interfere with each other. Signed-off-by: Jason Xing <kerneljasonxing@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250220072940.99994-7-kerneljasonxing@gmail.com