summaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)Author
2025-03-06net: ipv6: fix missing dst ref drop in ila lwtunnelJustin Iurman
Add missing skb_dst_drop() to drop reference to the old dst before adding the new dst to the skb. Fixes: 79ff2fc31e0f ("ila: Cache a route to translated address") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Link: https://patch.msgid.link/20250305081655.19032-1-justin.iurman@uliege.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-06net: ipv6: fix dst ref loop in ila lwtunnelJustin Iurman
This patch follows commit 92191dd10730 ("net: ipv6: fix dst ref loops in rpl, seg6 and ioam6 lwtunnels") and, on a second thought, the same patch is also needed for ila (even though the config that triggered the issue was pathological, but still, we don't want that to happen). Fixes: 79ff2fc31e0f ("ila: Cache a route to translated address") Cc: Tom Herbert <tom@herbertland.com> Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Link: https://patch.msgid.link/20250304181039.35951-1-justin.iurman@uliege.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27net: ipv6: fix dst ref loop on input in rpl lwtJustin Iurman
Prevent a dst ref loop on input in rpl_iptunnel. Fixes: a7a29f9c361f ("net: ipv6: add rpl sr tunnel") Cc: Alexander Aring <alex.aring@gmail.com> Cc: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-27net: ipv6: fix dst ref loop on input in seg6 lwtJustin Iurman
Prevent a dst ref loop on input in seg6_iptunnel. Fixes: af4a2209b134 ("ipv6: sr: use dst_cache in seg6_input") Cc: David Lebrun <dlebrun@google.com> Cc: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-02-13ipv6: mcast: add RCU protection to mld_newpack()Eric Dumazet
mld_newpack() can be called without RTNL or RCU being held. Note that we no longer can use sock_alloc_send_skb() because ipv6.igmp_sk uses GFP_KERNEL allocations which can sleep. Instead use alloc_skb() and charge the net->ipv6.igmp_sk socket under RCU protection. Fixes: b8ad0cbc58f7 ("[NETNS][IPV6] mcast - handle several network namespace") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250212141021.1663666-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10ipv6: mcast: extend RCU protection in igmp6_send()Eric Dumazet
igmp6_send() can be called without RTNL or RCU being held. Extend RCU protection so that we can safely fetch the net pointer and avoid a potential UAF. Note that we no longer can use sock_alloc_send_skb() because ipv6.igmp_sk uses GFP_KERNEL allocations which can sleep. Instead use alloc_skb() and charge the net->ipv6.igmp_sk socket under RCU protection. Fixes: b8ad0cbc58f7 ("[NETNS][IPV6] mcast - handle several network namespace") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250207135841.1948589-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10ndisc: extend RCU protection in ndisc_send_skb()Eric Dumazet
ndisc_send_skb() can be called without RTNL or RCU held. Acquire rcu_read_lock() earlier, so that we can use dev_net_rcu() and avoid a potential UAF. Fixes: 1762f7e88eb3 ("[NETNS][IPV6] ndisc - make socket control per namespace") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250207135841.1948589-8-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10ndisc: use RCU protection in ndisc_alloc_skb()Eric Dumazet
ndisc_alloc_skb() can be called without RTNL or RCU being held. Add RCU protection to avoid possible UAF. Fixes: de09334b9326 ("ndisc: Introduce ndisc_alloc_skb() helper.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250207135841.1948589-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10ndisc: ndisc_send_redirect() must use dev_get_by_index_rcu()Eric Dumazet
ndisc_send_redirect() is called under RCU protection, not RTNL. It must use dev_get_by_index_rcu() instead of __dev_get_by_index() Fixes: 2f17becfbea5 ("vrf: check the original netdevice for generating redirect") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250207135841.1948589-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06ipv6: Use RCU in ip6_input()Eric Dumazet
Instead of grabbing rcu_read_lock() from ip6_input_finish(), do it earlier in is caller, so that ip6_input() access to dev_net() can be validated by LOCKDEP. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250205155120.1676781-13-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06ipv6: icmp: convert to dev_net_rcu()Eric Dumazet
icmp6_send() must acquire rcu_read_lock() sooner to ensure the dev_net() call done from a safe context. Other ICMPv6 uses of dev_net() seem safe, change them to dev_net_rcu() to get LOCKDEP support to catch bugs. Fixes: 9a43b709a230 ("[NETNS][IPV6] icmp6 - make icmpv6_socket per namespace") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250205155120.1676781-12-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06ipv6: use RCU protection in ip6_default_advmss()Eric Dumazet
ip6_default_advmss() needs rcu protection to make sure the net structure it reads does not disappear. Fixes: 5578689a4e3c ("[NETNS][IPV6] route6 - make route6 per namespace") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250205155120.1676781-11-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-03udp: gso: do not drop small packets when PMTU reducesYan Zhai
Commit 4094871db1d6 ("udp: only do GSO if # of segs > 1") avoided GSO for small packets. But the kernel currently dismisses GSO requests only after checking MTU/PMTU on gso_size. This means any packets, regardless of their payload sizes, could be dropped when PMTU becomes smaller than requested gso_size. We encountered this issue in production and it caused a reliability problem that new QUIC connection cannot be established before PMTU cache expired, while non GSO sockets still worked fine at the same time. Ideally, do not check any GSO related constraints when payload size is smaller than requested gso_size, and return EMSGSIZE instead of EINVAL on MTU/PMTU check failure to be more specific on the error cause. Fixes: 4094871db1d6 ("udp: only do GSO if # of segs > 1") Signed-off-by: Yan Zhai <yan@cloudflare.com> Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-02-01net: ipv6: fix dst ref loops in rpl, seg6 and ioam6 lwtunnelsJakub Kicinski
Some lwtunnels have a dst cache for post-transformation dst. If the packet destination did not change we may end up recording a reference to the lwtunnel in its own cache, and the lwtunnel state will never be freed. Discovered by the ioam6.sh test, kmemleak was recently fixed to catch per-cpu memory leaks. I'm not sure if rpl and seg6 can actually hit this, but in principle I don't see why not. Fixes: 8cb3bf8bff3c ("ipv6: ioam: Add support for the ip6ip6 encapsulation") Fixes: 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels") Fixes: a7a29f9c361f ("net: ipv6: add rpl sr tunnel") Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250130031519.2716843-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-01net: ipv6: fix dst refleaks in rpl, seg6 and ioam6 lwtunnelsJakub Kicinski
dst_cache_get() gives us a reference, we need to release it. Discovered by the ioam6.sh test, kmemleak was recently fixed to catch per-cpu memory leaks. Fixes: 985ec6f5e623 ("net: ipv6: rpl_iptunnel: mitigate 2-realloc issue") Fixes: 40475b63761a ("net: ipv6: seg6_iptunnel: mitigate 2-realloc issue") Fixes: dce525185bc9 ("net: ipv6: ioam6_iptunnel: mitigate 2-realloc issue") Reviewed-by: Justin Iurman <justin.iurman@uliege.be> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250130031519.2716843-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-27Merge tag 'ipsec-2025-01-27' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2025-01-27 1) Fix incrementing the upper 32 bit sequence numbers for GSO skbs. From Jianbo Liu. 2) Fix an out-of-bounds read on xfrm state lookup. From Florian Westphal. 3) Fix secpath handling on packet offload mode. From Alexandre Cassen. 4) Fix the usage of skb->sk in the xfrm layer. 5) Don't disable preemption while looking up cache state to fix PREEMPT_RT. From Sebastian Sewior. * tag 'ipsec-2025-01-27' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: Don't disable preemption while looking up cache state. xfrm: Fix the usage of skb->sk xfrm: delete intermediate secpath entry in packet offload mode xfrm: state: fix out-of-bounds read during lookup xfrm: replay: Fix the update of replay_esn->oseq_hi for GSO ==================== Link: https://patch.msgid.link/20250127060757.3946314-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20ipv6: Convert inet6_rtm_deladdr() to per-netns RTNL.Kuniyuki Iwashima
Let's register inet6_rtm_deladdr() with RTNL_FLAG_DOIT_PERNET and hold rtnl_net_lock() before inet6_addr_del(). Now that inet6_addr_del() is always called under per-netns RTNL. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-12-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20ipv6: Convert inet6_rtm_newaddr() to per-netns RTNL.Kuniyuki Iwashima
Let's register inet6_rtm_newaddr() with RTNL_FLAG_DOIT_PERNET and hold rtnl_net_lock() before __dev_get_by_index(). Now that inet6_addr_add() and inet6_addr_modify() are always called under per-netns RTNL. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-11-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20ipv6: Move lifetime validation to inet6_rtm_newaddr().Kuniyuki Iwashima
inet6_addr_add() and inet6_addr_modify() have the same code to validate IPv6 lifetime that is done under RTNL. Let's factorise it out to inet6_rtm_newaddr() so that we can validate the lifetime without RTNL later. Note that inet6_addr_add() is called from addrconf_add_ifaddr(), but the lifetime is INFINITY_LIFE_TIME in the path, so expires and flags are 0. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-10-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20ipv6: Set cfg.ifa_flags before device lookup in inet6_rtm_newaddr().Kuniyuki Iwashima
We will convert inet6_rtm_newaddr() to per-netns RTNL. Except for IFA_F_OPTIMISTIC, cfg.ifa_flags can be set before __dev_get_by_index(). Let's move ifa_flags setup before __dev_get_by_index() so that we can set ifa_flags without RTNL. Also, now it's moved before tb[IFA_CACHEINFO] in preparing for the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-9-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20ipv6: Pass dev to inet6_addr_add().Kuniyuki Iwashima
inet6_addr_add() is called from inet6_rtm_newaddr() and addrconf_add_ifaddr(). inet6_addr_add() looks up dev by __dev_get_by_index(), but it's already done in inet6_rtm_newaddr(). Let's move the 2nd lookup to addrconf_add_ifaddr() and pass dev to inet6_addr_add(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20ipv6: Convert inet6_ioctl() to per-netns RTNL.Kuniyuki Iwashima
These functions are called from inet6_ioctl() with a socket's netns and hold RTNL. * SIOCSIFADDR : addrconf_add_ifaddr() * SIOCDIFADDR : addrconf_del_ifaddr() * SIOCSIFDSTADDR : addrconf_set_dstaddr() Let's use rtnl_net_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20ipv6: Hold rtnl_net_lock() in addrconf_init() and addrconf_cleanup().Kuniyuki Iwashima
addrconf_init() holds RTNL for blackhole_netdev, which is the global device in init_net. addrconf_cleanup() holds RTNL to clean up devices in init_net too. Let's use rtnl_net_lock(&init_net) there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20ipv6: Hold rtnl_net_lock() in addrconf_dad_work().Kuniyuki Iwashima
addrconf_dad_work() is per-address work and holds RTNL internally. We can fetch netns as dev_net(ifp->idev->dev). Let's use rtnl_net_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20ipv6: Hold rtnl_net_lock() in addrconf_verify_work().Kuniyuki Iwashima
addrconf_verify_work() is per-netns work to call addrconf_verify_rtnl() under RTNL. Let's use rtnl_net_lock(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20ipv6: Convert net.ipv6.conf.${DEV}.XXX sysctl to per-netns RTNL.Kuniyuki Iwashima
net.ipv6.conf.${DEV}.XXX sysctl are changed under RTNL: * forwarding * ignore_routes_with_linkdown * disable_ipv6 * proxy_ndp * addr_gen_mode * stable_secret * disable_policy Let's use rtnl_net_lock() there. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250115080608.28127-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-20xfrm: Fix the usage of skb->skSteffen Klassert
xfrm assumed to always have a full socket at skb->sk. This is not always true, so fix it by converting to a full socket before it is used. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Reviewed-by: Eric Dumazet <edumazet@google.com>
2025-01-15inet: ipmr: fix data-racesEric Dumazet
Following fields of 'struct mr_mfc' can be updated concurrently (no lock protection) from ip_mr_forward() and ip6_mr_forward() - bytes - pkt - wrong_if - lastuse They also can be read from other functions. Convert bytes, pkt and wrong_if to atomic_long_t, and use READ_ONCE()/WRITE_ONCE() for lastuse. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250114221049.1190631-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-10Merge tag 'ipsec-next-2025-01-09' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== ipsec-next-2025-01-09 1) Implement the AGGFRAG protocol and basic IP-TFS (RFC9347) functionality. From Christian Hopps. 2) Support ESN context update to hardware for TX. From Jianbo Liu. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2025-01-09netlink: add IPv6 anycast join/leave notificationsYuyang Huang
This change introduces a mechanism for notifying userspace applications about changes to IPv6 anycast addresses via netlink. It includes: * Addition and deletion of IPv6 anycast addresses are reported using RTM_NEWANYCAST and RTM_DELANYCAST. * A new netlink group (RTNLGRP_IPV6_ACADDR) for subscribing to these notifications. This enables user space applications(e.g. ip monitor) to efficiently track anycast addresses through netlink messages, improving metrics collection and system monitoring. It also unlocks the potential for advanced anycast management in user space, such as hardware offload control and fine grained network control. Cc: Maciej Żenczykowski <maze@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Yuyang Huang <yuyanghuang@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250107114355.1766086-1-yuyanghuang@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.13-rc6). No conflicts. Adjacent changes: include/linux/if_vlan.h f91a5b808938 ("af_packet: fix vlan_get_protocol_dgram() vs MSG_PEEK") 3f330db30638 ("net: reformat kdoc return statements") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-02ila: serialize calls to nf_register_net_hooks()Eric Dumazet
syzbot found a race in ila_add_mapping() [1] commit 031ae72825ce ("ila: call nf_unregister_net_hooks() sooner") attempted to fix a similar issue. Looking at the syzbot repro, we have concurrent ILA_CMD_ADD commands. Add a mutex to make sure at most one thread is calling nf_register_net_hooks(). [1] BUG: KASAN: slab-use-after-free in rht_key_hashfn include/linux/rhashtable.h:159 [inline] BUG: KASAN: slab-use-after-free in __rhashtable_lookup.constprop.0+0x426/0x550 include/linux/rhashtable.h:604 Read of size 4 at addr ffff888028f40008 by task dhcpcd/5501 CPU: 1 UID: 0 PID: 5501 Comm: dhcpcd Not tainted 6.13.0-rc4-syzkaller-00054-gd6ef8b40d075 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xc3/0x620 mm/kasan/report.c:489 kasan_report+0xd9/0x110 mm/kasan/report.c:602 rht_key_hashfn include/linux/rhashtable.h:159 [inline] __rhashtable_lookup.constprop.0+0x426/0x550 include/linux/rhashtable.h:604 rhashtable_lookup include/linux/rhashtable.h:646 [inline] rhashtable_lookup_fast include/linux/rhashtable.h:672 [inline] ila_lookup_wildcards net/ipv6/ila/ila_xlat.c:127 [inline] ila_xlat_addr net/ipv6/ila/ila_xlat.c:652 [inline] ila_nf_input+0x1ee/0x620 net/ipv6/ila/ila_xlat.c:185 nf_hook_entry_hookfn include/linux/netfilter.h:154 [inline] nf_hook_slow+0xbb/0x200 net/netfilter/core.c:626 nf_hook.constprop.0+0x42e/0x750 include/linux/netfilter.h:269 NF_HOOK include/linux/netfilter.h:312 [inline] ipv6_rcv+0xa4/0x680 net/ipv6/ip6_input.c:309 __netif_receive_skb_one_core+0x12e/0x1e0 net/core/dev.c:5672 __netif_receive_skb+0x1d/0x160 net/core/dev.c:5785 process_backlog+0x443/0x15f0 net/core/dev.c:6117 __napi_poll.constprop.0+0xb7/0x550 net/core/dev.c:6883 napi_poll net/core/dev.c:6952 [inline] net_rx_action+0xa94/0x1010 net/core/dev.c:7074 handle_softirqs+0x213/0x8f0 kernel/softirq.c:561 __do_softirq kernel/softirq.c:595 [inline] invoke_softirq kernel/softirq.c:435 [inline] __irq_exit_rcu+0x109/0x170 kernel/softirq.c:662 irq_exit_rcu+0x9/0x30 kernel/softirq.c:678 instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1049 [inline] sysvec_apic_timer_interrupt+0xa4/0xc0 arch/x86/kernel/apic/apic.c:1049 Fixes: 7f00feaf1076 ("ila: Add generic ILA translation facility") Reported-by: syzbot+47e761d22ecf745f72b9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6772c9ae.050a0220.2f3838.04c7.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Tom Herbert <tom@herbertland.com> Link: https://patch.msgid.link/20241230162849.2795486-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-23netlink: correct nlmsg size for multicast notificationsYuyang Huang
Corrected the netlink message size calculation for multicast group join/leave notifications. The previous calculation did not account for the inclusion of both IPv4/IPv6 addresses and ifa_cacheinfo in the payload. This fix ensures that the allocated message size is sufficient to hold all necessary information. This patch also includes the following improvements: * Uses GFP_KERNEL instead of GFP_ATOMIC when holding the RTNL mutex. * Uses nla_total_size(sizeof(struct in6_addr)) instead of nla_total_size(16). * Removes unnecessary EXPORT_SYMBOL(). Fixes: 2c2b61d2138f ("netlink: add IGMP/MLD join/leave notifications") Cc: Maciej Żenczykowski <maze@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Yuyang Huang <yuyanghuang@google.com> Link: https://patch.msgid.link/20241221100007.1910089-1-yuyanghuang@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-23udp: Deal with race between UDP socket address change and rehashStefano Brivio
If a UDP socket changes its local address while it's receiving datagrams, as a result of connect(), there is a period during which a lookup operation might fail to find it, after the address is changed but before the secondary hash (port and address) and the four-tuple hash (local and remote ports and addresses) are updated. Secondary hash chains were introduced by commit 30fff9231fad ("udp: bind() optimisation") and, as a result, a rehash operation became needed to make a bound socket reachable again after a connect(). This operation was introduced by commit 719f835853a9 ("udp: add rehash on connect()") which isn't however a complete fix: the socket will be found once the rehashing completes, but not while it's pending. This is noticeable with a socat(1) server in UDP4-LISTEN mode, and a client sending datagrams to it. After the server receives the first datagram (cf. _xioopen_ipdgram_listen()), it issues a connect() to the address of the sender, in order to set up a directed flow. Now, if the client, running on a different CPU thread, happens to send a (subsequent) datagram while the server's socket changes its address, but is not rehashed yet, this will result in a failed lookup and a port unreachable error delivered to the client, as apparent from the following reproducer: LEN=$(($(cat /proc/sys/net/core/wmem_default) / 4)) dd if=/dev/urandom bs=1 count=${LEN} of=tmp.in while :; do taskset -c 1 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.1 || sleep 1 taskset -c 2 socat OPEN:tmp.in UDP4:localhost:1337,shut-null wait done where the client will eventually get ECONNREFUSED on a write() (typically the second or third one of a given iteration): 2024/11/13 21:28:23 socat[46901] E write(6, 0x556db2e3c000, 8192): Connection refused This issue was first observed as a seldom failure in Podman's tests checking UDP functionality while using pasta(1) to connect the container's network namespace, which leads us to a reproducer with the lookup error resulting in an ICMP packet on a tap device: LOCAL_ADDR="$(ip -j -4 addr show|jq -rM '.[] | .addr_info[0] | select(.scope == "global").local')" while :; do ./pasta --config-net -p pasta.pcap -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc & sleep 0.2 || sleep 1 socat OPEN:tmp.in UDP4:${LOCAL_ADDR}:1337,shut-null wait cmp tmp.in tmp.out done Once this fails: tmp.in tmp.out differ: char 8193, line 29 we can finally have a look at what's going on: $ tshark -r pasta.pcap 1 0.000000 :: ? ff02::16 ICMPv6 110 Multicast Listener Report Message v2 2 0.168690 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 3 0.168767 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 4 0.168806 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 5 0.168827 c6:47:05:8d:dc:04 ? Broadcast ARP 42 Who has 88.198.0.161? Tell 88.198.0.164 6 0.168851 9a:55:9a:55:9a:55 ? c6:47:05:8d:dc:04 ARP 42 88.198.0.161 is at 9a:55:9a:55:9a:55 7 0.168875 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 8 0.168896 88.198.0.164 ? 88.198.0.161 ICMP 590 Destination unreachable (Port unreachable) 9 0.168926 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 10 0.168959 88.198.0.161 ? 88.198.0.164 UDP 8234 60260 ? 1337 Len=8192 11 0.168989 88.198.0.161 ? 88.198.0.164 UDP 4138 60260 ? 1337 Len=4096 12 0.169010 88.198.0.161 ? 88.198.0.164 UDP 42 60260 ? 1337 Len=0 On the third datagram received, the network namespace of the container initiates an ARP lookup to deliver the ICMP message. In another variant of this reproducer, starting the client with: strace -f pasta --config-net -u 1337 socat UDP4-LISTEN:1337,null-eof OPEN:tmp.out,create,trunc 2>strace.log & and connecting to the socat server using a loopback address: socat OPEN:tmp.in UDP4:localhost:1337,shut-null we can more clearly observe a sendmmsg() call failing after the first datagram is delivered: [pid 278012] connect(173, 0x7fff96c95fc0, 16) = 0 [...] [pid 278012] recvmmsg(173, 0x7fff96c96020, 1024, MSG_DONTWAIT, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = 1 [...] [pid 278012] sendmmsg(173, 0x561c5ad0a720, 1, MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused) and, somewhat confusingly, after a connect() on the same socket succeeded. Until commit 4cdeeee9252a ("net: udp: prefer listeners bound to an address"), the race between receive address change and lookup didn't actually cause visible issues, because, once the lookup based on the secondary hash chain failed, we would still attempt a lookup based on the primary hash (destination port only), and find the socket with the outdated secondary hash. That change, however, dropped port-only lookups altogether, as side effect, making the race visible. To fix this, while avoiding the need to make address changes and rehash atomic against lookups, reintroduce primary hash lookups as fallback, if lookups based on four-tuple and secondary hashes fail. To this end, introduce a simplified lookup implementation, which doesn't take care of SO_REUSEPORT groups: if we have one, there are multiple sockets that would match the four-tuple or secondary hash, meaning that we can't run into this race at all. v2: - instead of synchronising lookup operations against address change plus rehash, reintroduce a simplified version of the original primary hash lookup as fallback v1: - fix build with CONFIG_IPV6=n: add ifdef around sk_v6_rcv_saddr usage (Kuniyuki Iwashima) - directly use sk_rcv_saddr for IPv4 receive addresses instead of fetching inet_rcv_saddr (Kuniyuki Iwashima) - move inet_update_saddr() to inet_hashtables.h and use that to set IPv4/IPv6 addresses as suitable (Kuniyuki Iwashima) - rebase onto net-next, update commit message accordingly Reported-by: Ed Santiago <santiago@redhat.com> Link: https://github.com/containers/podman/issues/24147 Analysed-by: David Gibson <david@gibson.dropbear.id.au> Fixes: 30fff9231fad ("udp: bind() optimisation") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-19ipv6: Add flow label to route get requestsIdo Schimmel
The default IPv6 multipath hash policy takes the flow label into account when calculating a multipath hash and previous patches added a flow label selector to IPv6 FIB rules. Allow user space to specify a flow label in route get requests by adding a new netlink attribute and using its value to populate the "flowlabel" field in the IPv6 flow info structure prior to a route lookup. Deny the attribute in RTM_{NEW,DEL}ROUTE requests by checking for it in rtm_to_fib6_config() and returning an error if present. A subsequent patch will use this capability to test the new flow label selector in IPv6 FIB rules. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-19ipv6: fib_rules: Add flow label supportIdo Schimmel
Implement support for the new flow label selector which allows IPv6 FIB rules to match on the flow label with a mask. Ensure that both flow label attributes are specified (or none) and that the mask is valid. Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-17inetpeer: do not get a refcount in inet_getpeer()Eric Dumazet
All inet_getpeer() callers except ip4_frag_init() don't need to acquire a permanent refcount on the inetpeer. They can switch to full RCU protection. Move the refcount_inc_not_zero() into ip4_frag_init(), so that all the other callers no longer have to perform a pair of expensive atomic operations on a possibly contended cache line. inet_putpeer() no longer needs to be exported. After this patch, my DUT can receive 8,400,000 UDP packets per second targeting closed ports, using 50% less cpu cycles than before. Also change two calls to l3mdev_master_ifindex() by l3mdev_master_ifindex_rcu() (Ido ideas) Fixes: 8c2bd38b95f7 ("icmp: change the order of rate limits") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241215175629.1248773-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-17inetpeer: remove create argument of inet_getpeer_v[46]()Eric Dumazet
All callers of inet_getpeer_v4() and inet_getpeer_v6() want to create an inetpeer. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20241215175629.1248773-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-16sock: support SO_PRIORITY cmsgAnna Emese Nyiri
The Linux socket API currently allows setting SO_PRIORITY at the socket level, applying a uniform priority to all packets sent through that socket. The exception to this is IP_TOS, when the priority value is calculated during the handling of ancillary data, as implemented in commit f02db315b8d8 ("ipv4: IP_TOS and IP_TTL can be specified as ancillary data"). However, this is a computed value, and there is currently no mechanism to set a custom priority via control messages prior to this patch. According to this patch, if SO_PRIORITY is specified as ancillary data, the packet is sent with the priority value set through sockc->priority, overriding the socket-level values set via the traditional setsockopt() method. This is analogous to the existing support for SO_MARK, as implemented in commit c6af0c227a22 ("ip: support SO_MARK cmsg"). If both cmsg SO_PRIORITY and IP_TOS are passed, then the one that takes precedence is the last one in the cmsg list. This patch has the side effect that raw_send_hdrinc now interprets cmsg IP_TOS. Reviewed-by: Willem de Bruijn <willemb@google.com> Suggested-by: Ferenc Fejes <fejes@inf.elte.hu> Signed-off-by: Anna Emese Nyiri <annaemesenyiri@gmail.com> Link: https://patch.msgid.link/20241213084457.45120-3-annaemesenyiri@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-15netlink: add IGMP/MLD join/leave notificationsYuyang Huang
This change introduces netlink notifications for multicast address changes. The following features are included: * Addition and deletion of multicast addresses are reported using RTM_NEWMULTICAST and RTM_DELMULTICAST messages with AF_INET and AF_INET6. * Two new notification groups: RTNLGRP_IPV4_MCADDR and RTNLGRP_IPV6_MCADDR are introduced for receiving these events. This change allows user space applications (e.g., ip monitor) to efficiently track multicast group memberships by listening for netlink events. Previously, applications relied on inefficient polling of procfs, introducing delays. With netlink notifications, applications receive realtime updates on multicast group membership changes, enabling more precise metrics collection and system monitoring.  This change also unlocks the potential for implementing a wide range of sophisticated multicast related features in user space by allowing applications to combine kernel provided multicast address information with user space data and communicate decisions back to the kernel for more fine grained control. This mechanism can be used for various purposes, including multicast filtering, IGMP/MLD offload, and IGMP/MLD snooping. Cc: Maciej Żenczykowski <maze@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Co-developed-by: Patrick Ruddy <pruddy@vyatta.att-mail.com> Signed-off-by: Patrick Ruddy <pruddy@vyatta.att-mail.com> Link: https://lore.kernel.org/r/20180906091056.21109-1-pruddy@vyatta.att-mail.com Signed-off-by: Yuyang Huang <yuyanghuang@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-12-11ipv6: mcast: annotate data-race around psf->sf_count[MCAST_XXX]Eric Dumazet
psf->sf_count[MCAST_XXX] fields are read locklessly from ipv6_chk_mcast_addr() and igmp6_mcf_seq_show(). Add READ_ONCE() and WRITE_ONCE() annotations accordingly. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20241210183352.86530-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-11ipv6: mcast: annotate data-races around mc->mca_sfcount[MCAST_EXCLUDE]Eric Dumazet
mc->mca_sfcount[MCAST_EXCLUDE] is read locklessly from ipv6_chk_mcast_addr(). Add READ_ONCE() and WRITE_ONCE() annotations accordingly. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20241210183352.86530-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-11ipv6: mcast: reduce ipv6_chk_mcast_addr() indentationEric Dumazet
Add a label and two gotos to shorten lines by two tabulations, to ease code review of following patches. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20241210183352.86530-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.13-rc2). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-05net: ipv6: rpl_iptunnel: mitigate 2-realloc issueJustin Iurman
This patch mitigates the two-reallocations issue with rpl_iptunnel by providing the dst_entry (in the cache) to the first call to skb_cow_head(). As a result, the very first iteration would still trigger two reallocations (i.e., empty cache), while next iterations would only trigger a single reallocation. Performance tests before/after applying this patch, which clearly shows there is no impact (it even shows improvement): - before: https://ibb.co/nQJhqwc - after: https://ibb.co/4ZvW6wV Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Cc: Alexander Aring <aahringo@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-05net: ipv6: seg6_iptunnel: mitigate 2-realloc issueJustin Iurman
This patch mitigates the two-reallocations issue with seg6_iptunnel by providing the dst_entry (in the cache) to the first call to skb_cow_head(). As a result, the very first iteration would still trigger two reallocations (i.e., empty cache), while next iterations would only trigger a single reallocation. Performance tests before/after applying this patch, which clearly shows the improvement: - before: https://ibb.co/3Cg4sNH - after: https://ibb.co/8rQ350r Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Cc: David Lebrun <dlebrun@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-05net: ipv6: ioam6_iptunnel: mitigate 2-realloc issueJustin Iurman
This patch mitigates the two-reallocations issue with ioam6_iptunnel by providing the dst_entry (in the cache) to the first call to skb_cow_head(). As a result, the very first iteration may still trigger two reallocations (i.e., empty cache), while next iterations would only trigger a single reallocation. Performance tests before/after applying this patch, which clearly shows the improvement: - inline mode: - before: https://ibb.co/LhQ8V63 - after: https://ibb.co/x5YT2bS - encap mode: - before: https://ibb.co/3Cjm5m0 - after: https://ibb.co/TwpsxTC - encap mode with tunsrc: - before: https://ibb.co/Gpy9QPg - after: https://ibb.co/PW1bZFT This patch also fixes an incorrect behavior: after the insertion, the second call to skb_cow_head() makes sure that the dev has enough headroom in the skb for layer 2 and stuff. In that case, the "old" dst_entry was used, which is now fixed. After discussing with Paolo, it appears that both patches can be merged into a single one -this one- (for the sake of readability) and target net-next. Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-12-05xfrm: add generic iptfs defines and functionalityChristian Hopps
Define `XFRM_MODE_IPTFS` and `IPSEC_MODE_IPTFS` constants, and add these to switch case and conditionals adjacent with the existing TUNNEL modes. Signed-off-by: Christian Hopps <chopps@labn.net> Tested-by: Antony Antony <antony.antony@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-12-04inet: add indirect call wrapper for getfrag() callsEric Dumazet
UDP send path suffers from one indirect call to ip_generic_getfrag() We can use INDIRECT_CALL_1() to avoid it. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Brian Vazquez <brianvv@google.com> Link: https://patch.msgid.link/20241203173617.2595451-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-04ipmr: tune the ipmr_can_free_table() checks.Paolo Abeni
Eric reported a syzkaller-triggered splat caused by recent ipmr changes: WARNING: CPU: 2 PID: 6041 at net/ipv6/ip6mr.c:419 ip6mr_free_table+0xbd/0x120 net/ipv6/ip6mr.c:419 Modules linked in: CPU: 2 UID: 0 PID: 6041 Comm: syz-executor183 Not tainted 6.12.0-syzkaller-10681-g65ae975e97d5 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:ip6mr_free_table+0xbd/0x120 net/ipv6/ip6mr.c:419 Code: 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 75 58 49 83 bc 24 c0 0e 00 00 00 74 09 e8 44 ef a9 f7 90 <0f> 0b 90 e8 3b ef a9 f7 48 8d 7b 38 e8 12 a3 96 f7 48 89 df be 0f RSP: 0018:ffffc90004267bd8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff88803c710000 RCX: ffffffff89e4d844 RDX: ffff88803c52c880 RSI: ffffffff89e4d87c RDI: ffff88803c578ec0 RBP: 0000000000000001 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000001 R12: ffff88803c578000 R13: ffff88803c710000 R14: ffff88803c710008 R15: dead000000000100 FS: 00007f7a855ee6c0(0000) GS:ffff88806a800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7a85689938 CR3: 000000003c492000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ip6mr_rules_exit+0x176/0x2d0 net/ipv6/ip6mr.c:283 ip6mr_net_exit_batch+0x53/0xa0 net/ipv6/ip6mr.c:1388 ops_exit_list+0x128/0x180 net/core/net_namespace.c:177 setup_net+0x4fe/0x860 net/core/net_namespace.c:394 copy_net_ns+0x2b4/0x6b0 net/core/net_namespace.c:500 create_new_namespaces+0x3ea/0xad0 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc0/0x1f0 kernel/nsproxy.c:228 ksys_unshare+0x45d/0xa40 kernel/fork.c:3334 __do_sys_unshare kernel/fork.c:3405 [inline] __se_sys_unshare kernel/fork.c:3403 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:3403 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f7a856332d9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f7a855ee238 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007f7a856bd308 RCX: 00007f7a856332d9 RDX: 00007f7a8560f8c6 RSI: 0000000000000000 RDI: 0000000062040200 RBP: 00007f7a856bd300 R08: 00007fff932160a7 R09: 00007f7a855ee6c0 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f7a856bd30c R13: 0000000000000000 R14: 00007fff93215fc0 R15: 00007fff932160a8 </TASK> The root cause is a network namespace creation failing after successful initialization of the ipmr subsystem. Such a case is not currently matched by the ipmr_can_free_table() helper. New namespaces are zeroed on allocation and inserted into net ns list only after successful creation; when deleting an ipmr table, the list next pointer can be NULL only on netns initialization failure. Update the ipmr_can_free_table() checks leveraging such condition. Reported-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot+6e8cb445d4b43d006e0c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6e8cb445d4b43d006e0c Fixes: 11b6e701bce9 ("ipmr: add debug check for mr table cleanup") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/8bde975e21bbca9d9c27e36209b2dd4f1d7a3f00.1733212078.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>