summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2025-07-10net: xsk: introduce XDP_MAX_TX_SKB_BUDGET setsockoptJason Xing
This patch provides a setsockopt method to let applications leverage to adjust how many descs to be handled at most in one send syscall. It mitigates the situation where the default value (32) that is too small leads to higher frequency of triggering send syscall. Considering the prosperity/complexity the applications have, there is no absolutely ideal suggestion fitting all cases. So keep 32 as its default value like before. The patch does the following things: - Add XDP_MAX_TX_SKB_BUDGET socket option. - Set max_tx_budget to 32 by default in the initialization phase as a per-socket granular control. - Set the range of max_tx_budget as [32, xs->tx->nentries]. The idea behind this comes out of real workloads in production. We use a user-level stack with xsk support to accelerate sending packets and minimize triggering syscalls. When the packets are aggregated, it's not hard to hit the upper bound (namely, 32). The moment user-space stack fetches the -EAGAIN error number passed from sendto(), it will loop to try again until all the expected descs from tx ring are sent out to the driver. Enlarging the XDP_MAX_TX_SKB_BUDGET value contributes to less frequency of sendto() and higher throughput/PPS. Here is what I did in production, along with some numbers as follows: For one application I saw lately, I suggested using 128 as max_tx_budget because I saw two limitations without changing any default configuration: 1) XDP_MAX_TX_SKB_BUDGET, 2) socket sndbuf which is 212992 decided by net.core.wmem_default. As to XDP_MAX_TX_SKB_BUDGET, the scenario behind this was I counted how many descs are transmitted to the driver at one time of sendto() based on [1] patch and then I calculated the possibility of hitting the upper bound. Finally I chose 128 as a suitable value because 1) it covers most of the cases, 2) a higher number would not bring evident results. After twisting the parameters, a stable improvement of around 4% for both PPS and throughput and less resources consumption were found to be observed by strace -c -p xxx: 1) %time was decreased by 7.8% 2) error counter was decreased from 18367 to 572 [1]: https://lore.kernel.org/all/20250619093641.70700-1-kerneljasonxing@gmail.com/ Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20250704160138.48677-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-09vsock: Add support for SIOCINQ ioctlXuewei Niu
Add support for SIOCINQ ioctl, indicating the length of bytes unread in the socket. The value is obtained from `vsock_stream_has_data()`. Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Luigi Leonardi <leonardi@redhat.com> Link: https://patch.msgid.link/20250708-siocinq-v6-2-3775f9a9e359@antgroup.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09hv_sock: Return the readable bytes in hvs_stream_has_data()Dexuan Cui
When hv_sock was originally added, __vsock_stream_recvmsg() and vsock_stream_has_data() actually only needed to know whether there is any readable data or not, so hvs_stream_has_data() was written to return 1 or 0 for simplicity. However, now hvs_stream_has_data() should return the readable bytes because vsock_data_ready() -> vsock_stream_has_data() needs to know the actual bytes rather than a boolean value of 1 or 0. The SIOCINQ ioctl support also needs hvs_stream_has_data() to return the readable bytes. Let hvs_stream_has_data() return the readable bytes of the payload in the next host-to-guest VMBus hv_sock packet. Note: there may be multiple incoming hv_sock packets pending in the VMBus channel's ringbuffer, but so far there is not a VMBus API that allows us to know all the readable bytes in total without reading and caching the payload of the multiple packets, so let's just return the readable bytes of the next single packet. In the future, we'll either add a VMBus API that allows us to know the total readable bytes without touching the data in the ringbuffer, or the hv_sock driver needs to understand the VMBus packet format and parse the packets directly. Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com> Acked-by: Stefano Garzarella <sgarzare@redhat.com> Acked-by: Wei Liu <wei.liu@kernel.org> Link: https://patch.msgid.link/20250708-siocinq-v6-1-3775f9a9e359@antgroup.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09skbuff: Add MSG_MORE flag to optimize tcp large packet transmissionFeng Yang
When using sockmap for forwarding, the average latency for different packet sizes after sending 10,000 packets is as follows: size old(us) new(us) 512 56 55 1472 58 58 1600 106 81 3000 145 105 5000 182 125 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250708054053.39551-1-yangfeng59949@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net: ipconfig: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies(E * 1000) +secs_to_jiffies(E) -msecs_to_jiffies(E * MSEC_PER_SEC) +secs_to_jiffies(E) While here, manually convert a couple timeouts denominated in seconds Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-2-b7817036342f@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09net/smc: convert timeouts to secs_to_jiffies()Easwar Hariharan
Commit b35108a51cf7 ("jiffies: Define secs_to_jiffies()") introduced secs_to_jiffies(). As the value here is a multiple of 1000, use secs_to_jiffies() instead of msecs_to_jiffies to avoid the multiplication. This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with the following Coccinelle rules: @depends on patch@ expression E; @@ -msecs_to_jiffies(E * 1000) +secs_to_jiffies(E) -msecs_to_jiffies(E * MSEC_PER_SEC) +secs_to_jiffies(E) Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Link: https://patch.msgid.link/20250707-netdev-secs-to-jiffies-part-2-v2-1-b7817036342f@linux.microsoft.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09devlink: Add new "clock_id" generic device paramIvan Vecera
Add a new device generic parameter to specify clock ID that should be used by the device for registering DPLL devices and pins. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-5-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09devlink: Add support for u64 parametersIvan Vecera
Only 8, 16 and 32-bit integers are supported for numeric devlink parameters. The subsequent patch adds support for DPLL clock ID that is defined as 64-bit number. Add support for u64 parameter type. Signed-off-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250704182202.1641943-4-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08udp: remove udp_tunnel_gro_init()Eric Dumazet
Use DEFINE_MUTEX() to initialize udp_tunnel_gro_type_lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250707091634.311974-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: Remove setsockopt_needs_rtnl().Kuniyuki Iwashima
We no longer need to hold RTNL for IPv6 socket options. Let's remove setsockopt_needs_rtnl(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-16-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: anycast: Don't hold RTNL for IPV6_JOIN_ANYCAST.Kuniyuki Iwashima
inet6_sk(sk)->ipv6_ac_list is protected by lock_sock(). In ipv6_sock_ac_join(), only __dev_get_by_index(), __dev_get_by_flags(), and __in6_dev_get() require RTNL. __dev_get_by_flags() is only used by ipv6_sock_ac_join() and can be converted to RCU version. Let's replace RCU version helper and drop RTNL from IPV6_JOIN_ANYCAST. setsockopt_needs_rtnl() will be removed in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-15-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: anycast: Unify two error paths in ipv6_sock_ac_join().Kuniyuki Iwashima
The next patch will replace __dev_get_by_index() and __dev_get_by_flags() to RCU + refcount version. Then, we will need to call dev_put() in some error paths. Let's unify two error paths to make the next patch cleaner. Note that we add READ_ONCE() for net->ipv6.devconf_all->forwarding and idev->conf.forwarding as we will drop RTNL that protects them. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-14-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: anycast: Don't hold RTNL for IPV6_LEAVE_ANYCAST and IPV6_ADDRFORM.Kuniyuki Iwashima
inet6_sk(sk)->ipv6_ac_list is protected by lock_sock(). In ipv6_sock_ac_drop() and ipv6_sock_ac_close(), only __dev_get_by_index() and __in6_dev_get() requrie RTNL. Let's replace them with dev_get_by_index() and in6_dev_get() and drop RTNL from IPV6_LEAVE_ANYCAST and IPV6_ADDRFORM. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-13-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: anycast: Don't use rtnl_dereference().Kuniyuki Iwashima
inet6_dev->ac_list is protected by inet6_dev->lock, so rtnl_dereference() is a bit rough annotation. As done in mcast.c, we can use ac_dereference() that checks if inet6_dev->lock is held. Let's replace rtnl_dereference() with a new helper ac_dereference(). Note that now addrconf_join_solict() / addrconf_leave_solict() in __ipv6_dev_ac_inc() / __ipv6_dev_ac_dec() does not need RTNL, so we can remove ASSERT_RTNL() there. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-12-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Remove unnecessary ASSERT_RTNL and comment.Kuniyuki Iwashima
Now, RTNL is not needed for mcast code, and what's commented in ip6_mc_msfget() is apparent by for_each_pmc_socklock(), which has lockdep annotation for lock_sock(). Let's remove the comment and ASSERT_RTNL() in ipv6_mc_rejoin_groups(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-11-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Don't hold RTNL for MCAST_ socket options.Kuniyuki Iwashima
In ip6_mc_source() and ip6_mc_msfilter(), per-socket mld data is protected by lock_sock() and inet6_dev->mc_lock is also held for some per-interface functions. ip6_mc_find_dev_rtnl() only depends on RTNL. If we want to remove it, we need to check inet6_dev->dead under mc_lock to close the race with addrconf_ifdown(), as mentioned earlier. Let's do that and drop RTNL for the rest of MCAST_ socket options. Note that ip6_mc_msfilter() has unnecessary lock dances and they are integrated into one to avoid the last-minute error and simplify the error handling. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-10-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Don't hold RTNL in ipv6_sock_mc_close().Kuniyuki Iwashima
In __ipv6_sock_mc_close(), per-socket mld data is protected by lock_sock(), and only __dev_get_by_index() and __in6_dev_get() require RTNL. Let's call __ipv6_sock_mc_drop() and drop RTNL in ipv6_sock_mc_close(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-9-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Don't hold RTNL for IPV6_DROP_MEMBERSHIP and MCAST_LEAVE_GROUP.Kuniyuki Iwashima
In __ipv6_sock_mc_drop(), per-socket mld data is protected by lock_sock(), and only __dev_get_by_index() and __in6_dev_get() require RTNL. Let's use dev_get_by_index() and in6_dev_get() and drop RTNL for IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP. Note that __ipv6_sock_mc_drop() is factorised to reuse in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-8-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Don't hold RTNL for IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP.Kuniyuki Iwashima
In __ipv6_sock_mc_join(), per-socket mld data is protected by lock_sock(), and only __dev_get_by_index() requires RTNL. Let's use dev_get_by_index() and drop RTNL for IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP. Note that we must call rt6_lookup() and dev_hold() under RCU. If rt6_lookup() returns an entry from the exception table, dst_dev_put() could change rt->dev.dst to loopback concurrently, and the original device could lose the refcount before dev_hold() and unblock device registration. dst_dev_put() is called from NETDEV_UNREGISTER and synchronize_net() follows it, so as long as rt6_lookup() and dev_hold() are called within the same RCU critical section, the dev is alive. Even if the race happens, they are synchronised by idev->dead and mcast addresses are cleaned up. For the racy access to rt->dst.dev, we use dst_dev(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-7-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Use in6_dev_get() in ipv6_dev_mc_dec().Kuniyuki Iwashima
As well as __ipv6_dev_mc_inc(), all code in __ipv6_dev_mc_dec() are protected by inet6_dev->mc_lock, and RTNL is not needed. Let's use in6_dev_get() in ipv6_dev_mc_dec() and remove ASSERT_RTNL() in __ipv6_dev_mc_dec(). Now, we can remove the RTNL comment above addrconf_leave_solict() too. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-6-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Remove mca_get().Kuniyuki Iwashima
Since commit 63ed8de4be81 ("mld: add mc_lock for protecting per-interface mld data"), the newly allocated struct ifmcaddr6 cannot be removed until inet6_dev->mc_lock is released, so mca_get() and mc_put() are unnecessary. Let's remove the extra refcounting. Note that mca_get() was only used in __ipv6_dev_mc_inc(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-5-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Check inet6_dev->dead under idev->mc_lock in __ipv6_dev_mc_inc().Kuniyuki Iwashima
Since commit 63ed8de4be81 ("mld: add mc_lock for protecting per-interface mld data"), every multicast resource is protected by inet6_dev->mc_lock. RTNL is unnecessary in terms of protection but still needed for synchronisation between addrconf_ifdown() and __ipv6_dev_mc_inc(). Once we removed RTNL, there would be a race below, where we could add a multicast address to a dead inet6_dev. CPU1 CPU2 ==== ==== addrconf_ifdown() __ipv6_dev_mc_inc() if (idev->dead) <-- false dead = true return -ENODEV; ipv6_mc_destroy_dev() / ipv6_mc_down() mutex_lock(&idev->mc_lock) ... mutex_unlock(&idev->mc_lock) mutex_lock(&idev->mc_lock) ... mutex_unlock(&idev->mc_lock) The race window can be easily closed by checking inet6_dev->dead under inet6_dev->mc_lock in __ipv6_dev_mc_inc() as addrconf_ifdown() will acquire it after marking inet6_dev dead. Let's check inet6_dev->dead under mc_lock in __ipv6_dev_mc_inc(). Note that now __ipv6_dev_mc_inc() no longer depends on RTNL and we can remove ASSERT_RTNL() there and the RTNL comment above addrconf_join_solict(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-4-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: mcast: Replace locking comments with lockdep annotations.Kuniyuki Iwashima
Commit 63ed8de4be81 ("mld: add mc_lock for protecting per-interface mld data") added the same comments regarding locking to many functions. Let's replace the comments with lockdep annotation, which is more helpful. Note that we just remove the comment for mld_clear_zeros() and mld_send_cr(), where mc_dereference() is used in the entry of the function. While at it, a comment for __ipv6_sock_mc_join() is moved back to the correct place. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-3-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: ndisc: Remove __in6_dev_get() in pndisc_{constructor,destructor}().Kuniyuki Iwashima
ipv6_dev_mc_{inc,dec}() has the same check. Let's remove __in6_dev_get() from pndisc_constructor() and pndisc_destructor(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-2-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: xsk: update tx queue consumer immediately after transmissionJason Xing
For afxdp, the return value of sendto() syscall doesn't reflect how many descs handled in the kernel. One of use cases is that when user-space application tries to know the number of transmitted skbs and then decides if it continues to send, say, is it stopped due to max tx budget? The following formular can be used after sending to learn how many skbs/descs the kernel takes care of: tx_queue.consumers_before - tx_queue.consumers_after Prior to the current patch, in non-zc mode, the consumer of tx queue is not immediately updated at the end of each sendto syscall when error occurs, which leads to the consumer value out-of-dated from the perspective of user space. So this patch requires store operation to pass the cached value to the shared value to handle the problem. More than those explicit errors appearing in the while() loop in __xsk_generic_xmit(), there are a few possible error cases that might be neglected in the following call trace: __xsk_generic_xmit() xskq_cons_peek_desc() xskq_cons_read_desc() xskq_cons_is_valid_desc() It will also cause the premature exit in the while() loop even if not all the descs are consumed. Based on the above analysis, using @sent_frame could cover all the possible cases where it might lead to out-of-dated global state of consumer after finishing __xsk_generic_xmit(). The patch also adds a common helper __xsk_tx_release() to keep align with the zc mode usage in xsk_tx_release(). Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250703141712.33190-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08af_unix: Introduce SO_INQ.Kuniyuki Iwashima
We have an application that uses almost the same code for TCP and AF_UNIX (SOCK_STREAM). TCP can use TCP_INQ, but AF_UNIX doesn't have it and requires an extra syscall, ioctl(SIOCINQ) or getsockopt(SO_MEMINFO) as an alternative. Let's introduce the generic version of TCP_INQ. If SO_INQ is enabled, recvmsg() will put a cmsg of SCM_INQ that contains the exact value of ioctl(SIOCINQ). The cmsg is also included when msg->msg_get_inq is non-zero to make sockets io_uring-friendly. Note that SOCK_CUSTOM_SOCKOPT is flagged only for SOCK_STREAM to override setsockopt() for SOL_SOCKET. By having the flag in struct unix_sock, instead of struct sock, we can later add SO_INQ support for TCP and reuse tcp_sk(sk)->recvmsg_inq. Note also that supporting custom getsockopt() for SOL_SOCKET will need preparation for other SOCK_CUSTOM_SOCKOPT users (UDP, vsock, MPTCP). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250702223606.1054680-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08af_unix: Cache state->msg in unix_stream_read_generic().Kuniyuki Iwashima
In unix_stream_read_generic(), state->msg is fetched multiple times. Let's cache it in a local variable. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250702223606.1054680-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08af_unix: Use cached value for SOCK_STREAM in unix_inq_len().Kuniyuki Iwashima
Compared to TCP, ioctl(SIOCINQ) for AF_UNIX SOCK_STREAM socket is more expensive, as unix_inq_len() requires iterating through the receive queue and accumulating skb->len. Let's cache the value for SOCK_STREAM to a new field during sendmsg() and recvmsg(). The field is protected by the receive queue lock. Note that ioctl(SIOCINQ) for SOCK_DGRAM returns the length of the first skb in the queue. SOCK_SEQPACKET still requires iterating through the queue because we do not touch functions shared with unix_dgram_ops. But, if really needed, we can support it by switching __skb_try_recv_datagram() to a custom version. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250702223606.1054680-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08af_unix: Don't use skb_recv_datagram() in unix_stream_read_skb().Kuniyuki Iwashima
unix_stream_read_skb() calls skb_recv_datagram() with MSG_DONTWAIT, which is mostly equivalent to sock_error(sk) + skb_dequeue(). In the following patch, we will add a new field to cache the number of bytes in the receive queue. Then, we want to avoid introducing atomic ops in the fast path, so we will reuse the receive queue lock. As a preparation for the change, let's not use skb_recv_datagram() in unix_stream_read_skb(). Note that sock_error() is now moved out of the u->iolock mutex as the mutex does not synchronise the peer's close() at all. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250702223606.1054680-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08af_unix: Don't check SOCK_DEAD in unix_stream_read_skb().Kuniyuki Iwashima
unix_stream_read_skb() checks SOCK_DEAD only when the dequeued skb is OOB skb. unix_stream_read_skb() is called for a SOCK_STREAM socket in SOCKMAP when data is sent to it. The function is invoked via sk_psock_verdict_data_ready(), which is set to sk->sk_data_ready(). During sendmsg(), we check if the receiver has SOCK_DEAD, so there is no point in checking it again later in ->read_skb(). Also, unix_read_skb() for SOCK_DGRAM does not have the test either. Let's remove the SOCK_DEAD test in unix_stream_read_skb(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250702223606.1054680-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08af_unix: Don't hold unix_state_lock() in __unix_dgram_recvmsg().Kuniyuki Iwashima
When __skb_try_recv_datagram() returns NULL in __unix_dgram_recvmsg(), we hold unix_state_lock() unconditionally. This is because SOCK_SEQPACKET sk needs to return EOF in case its peer has been close()d concurrently. This behaviour totally depends on the timing of the peer's close() and reading sk->sk_shutdown, and taking the lock does not play a role. Let's drop the lock from __unix_dgram_recvmsg() and use READ_ONCE(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250702223606.1054680-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: ethtool: reduce indent for _rxfh_context opsJakub Kicinski
Now that we don't have the compat code we can reduce the indent a little. No functional changes. Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/20250707184115.2285277-6-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: ethtool: remove the compat code for _rxfh_context opsJakub Kicinski
All drivers are now converted to dedicated _rxfh_context ops. Remove the use of >set_rxfh() to manage additional contexts. Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/20250707184115.2285277-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08tcp: update the outdated ref draft-ietf-tcpm-rackXin Guo
As RACK-TLP was published as a standards-track RFC8985, so the outdated ref draft-ietf-tcpm-rack need to be updated. Signed-off-by: Xin Guo <guoxin0309@gmail.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250705163647.301231-1-guoxin0309@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: account for encap headers in qdisc pkt lenFengyuan Gong
Refine qdisc_pkt_len_init to include headers up through the inner transport header when computing header size for encapsulations. Also refine net/sched/sch_cake.c borrowed from qdisc_pkt_len_init(). Signed-off-by: Fengyuan Gong <gfengyuan@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250702160741.1204919-1-gfengyuan@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08netlink: spelling: fix appened -> appended in a commentFaisal Bukhari
Fix spelling mistake in net/netlink/af_netlink.c appened -> appended Signed-off-by: Faisal Bukhari <faisalbukhari523@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250705030841.353424-1-faisalbukhari523@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: skbuff: Drop unused @skbMichal Luczaj
Since its introduction in commit 6fa01ccd8830 ("skbuff: Add pskb_extract() helper function"), pskb_carve_frag_list() never used the argument @skb. Drop it and adapt the only caller. No functional change intended. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: skbuff: Drop unused @skbMichal Luczaj
Since its introduction in commit ce098da1497c ("skbuff: Introduce slab_build_skb()"), __slab_build_skb() never used the @skb argument. Remove it and adapt both callers. No functional change intended. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: splice: Drop unused @gfpMichal Luczaj
Since its introduction in commit 2e910b95329c ("net: Add a function to splice pages into an skbuff for MSG_SPLICE_PAGES"), skb_splice_from_iter() never used the @gfp argument. Remove it and adapt callers. No functional change intended. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20250702-splice-drop-unused-v3-2-55f68b60d2b7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: splice: Drop unused @pipeMichal Luczaj
Since commit 41c73a0d44c9 ("net: speedup skb_splice_bits()"), __splice_segment() and spd_fill_page() do not use the @pipe argument. Drop it. While adapting the callers, move one line to enforce reverse xmas tree order. No functional change intended. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20250702-splice-drop-unused-v3-1-55f68b60d2b7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net/handshake: Add new parameter 'HANDSHAKE_A_ACCEPT_KEYRING'Hannes Reinecke
Add a new netlink parameter 'HANDSHAKE_A_ACCEPT_KEYRING' to provide the serial number of the keyring to use. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250701144657.104401-1-hare@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08net: replace ADDRLABEL with dynamic debugWang Liang
ADDRLABEL only works when it was set in compilation phase. Replace it with net_dbg_ratelimited(). Signed-off-by: Wang Liang <wangliang74@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250702104417.1526138-1-wangliang74@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08net/sched: acp_api: no longer acquire RTNL in tc_action_net_exit()Eric Dumazet
tc_action_net_exit() got an rtnl exclusion in commit a159d3c4b829 ("net_sched: acquire RTNL in tc_action_net_exit()") Since then, commit 16af6067392c ("net: sched: implement reference counted action release") made this RTNL exclusion obsolete for most cases. Only tcf_action_offload_del() might still require it. Move the rtnl locking into tcf_idrinfo_destroy() when an offload action is found. Most netns do not have actions, yet deleting them is adding a lot of pressure on RTNL, which is for many the most contended mutex in the kernel. We are moving to a per-netns 'rtnl', so tc_action_net_exit() will not be able to grab 'rtnl' a single time for a batch of netns. Before the patch: perf probe -a rtnl_lock perf record -e probe:rtnl_lock -a /bin/bash -c 'unshare -n "/bin/true"; sleep 1' [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.305 MB perf.data (25 samples) ] After the patch: perf record -e probe:rtnl_lock -a /bin/bash -c 'unshare -n "/bin/true"; sleep 1' [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.304 MB perf.data (9 samples) ] Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Buslov <vladbu@nvidia.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Link: https://patch.msgid.link/20250702071230.1892674-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08net: mctp: test: Add tests for gateway routesJeremy Kerr
Add a few kunit tests for the gateway routing. Because we have multiple route types now (direct and gateway), rename mctp_test_create_route to mctp_test_create_route_direct, and add a _gateway variant too. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20250702-dev-forwarding-v5-14-1468191da8a4@codeconstruct.com.au Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08net: mctp: add gateway routing supportJeremy Kerr
This change allows for gateway routing, where a route table entry may reference a routable endpoint (by network and EID), instead of routing directly to a netdevice. We add support for a RTM_GATEWAY attribute for netlink route updates, with an attribute format of: struct mctp_fq_addr { unsigned int net; mctp_eid_t eid; } - we need the net here to uniquely identify the target EID, as we no longer have the device reference directly (which would provide the net id in the case of direct routes). This makes route lookups recursive, as a route lookup that returns a gateway route must be resolved into a direct route (ie, to a device) eventually. We provide a limit to the route lookups, to prevent infinite loop routing. The route lookup populates a new 'nexthop' field in the dst structure, which now specifies the key for the neighbour table lookup on device output, rather than using the packet destination address directly. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20250702-dev-forwarding-v5-13-1468191da8a4@codeconstruct.com.au Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08net: mctp: allow NL parsing directly into a struct mctp_routeJeremy Kerr
The netlink route parsing functions end up setting a bunch of output variables from the rt attributes. This will get messy when the routes become more complex. So, split the rt parsing into two types: a lookup (returning route target data suitable for a route lookup, like when deleting a route) and a populate (setting fields of a struct mctp_route). In doing this, we need to separate the route allocation from mctp_route_add, so add some comments on the lifetime semantics for the latter. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20250702-dev-forwarding-v5-12-1468191da8a4@codeconstruct.com.au Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08net: mctp: remove routes by netid, not by deviceJeremy Kerr
In upcoming changes, a route may not have a device associated. Since the route is matched on the (network, eid) tuple, pass the netid itself into mctp_route_remove. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20250702-dev-forwarding-v5-11-1468191da8a4@codeconstruct.com.au Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08net: mctp: pass net into route creationJeremy Kerr
We may not have a mdev pointer, from which we currently extract the net. Instead, pass the net directly. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20250702-dev-forwarding-v5-10-1468191da8a4@codeconstruct.com.au Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08net: mctp: test: Add initial socket testsJeremy Kerr
Recent changes have modified the extaddr path a little, so add a couple of kunit tests to af-mctp.c. These check that we're correctly passing lladdr data between sendmsg/recvmsg and the routing layer. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20250702-dev-forwarding-v5-9-1468191da8a4@codeconstruct.com.au Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-07-08net: mctp: test: add sock test infrastructureJeremy Kerr
Add a new test object, for use with the af_mctp socket code. This is intially empty, but we'll start populating actual tests in an upcoming change. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Link: https://patch.msgid.link/20250702-dev-forwarding-v5-8-1468191da8a4@codeconstruct.com.au Signed-off-by: Paolo Abeni <pabeni@redhat.com>