summaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)Author
2025-05-26Merge tag 'nf-next-25-05-23' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following batch contains Netfilter updates for net-next, specifically 26 patches: 5 patches adding/updating selftests, 4 fixes, 3 PREEMPT_RT fixes, and 14 patches to enhance nf_tables): 1) Improve selftest coverage for pipapo 4 bit group format, from Florian Westphal. 2) Fix incorrect dependencies when compiling a kernel without legacy ip{6}tables support, also from Florian. 3) Two patches to fix nft_fib vrf issues, including selftest updates to improve coverage, also from Florian Westphal. 4) Fix incorrect nesting in nft_tunnel's GENEVE support, from Fernando F. Mancera. 5) Three patches to fix PREEMPT_RT issues with nf_dup infrastructure and nft_inner to match in inner headers, from Sebastian Andrzej Siewior. 6) Integrate conntrack information into nft trace infrastructure, from Florian Westphal. 7) A series of 13 patches to allow to specify wildcard netdevice in netdev basechain and flowtables, eg. table netdev filter { chain ingress { type filter hook ingress devices = { eth0, eth1, vlan* } priority 0; policy accept; } } This also allows for runtime hook registration on NETDEV_{UN}REGISTER event, from Phil Sutter. netfilter pull request 25-05-23 * tag 'nf-next-25-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: (26 commits) selftests: netfilter: Torture nftables netdev hooks netfilter: nf_tables: Add notifications for hook changes netfilter: nf_tables: Support wildcard netdev hook specs netfilter: nf_tables: Sort labels in nft_netdev_hook_alloc() netfilter: nf_tables: Handle NETDEV_CHANGENAME events netfilter: nf_tables: Wrap netdev notifiers netfilter: nf_tables: Respect NETDEV_REGISTER events netfilter: nf_tables: Prepare for handling NETDEV_REGISTER events netfilter: nf_tables: Have a list of nf_hook_ops in nft_hook netfilter: nf_tables: Pass nf_hook_ops to nft_unregister_flowtable_hook() netfilter: nf_tables: Introduce nft_register_flowtable_ops() netfilter: nf_tables: Introduce nft_hook_find_ops{,_rcu}() netfilter: nf_tables: Introduce functions freeing nft_hook objects netfilter: nf_tables: add packets conntrack state to debug trace info netfilter: conntrack: make nf_conntrack_id callable without a module dependency netfilter: nf_dup_netdev: Move the recursion counter struct netdev_xmit netfilter: nft_inner: Use nested-BH locking for nft_pcpu_tun_ctx netfilter: nf_dup{4, 6}: Move duplication check to task_struct netfilter: nft_tunnel: fix geneve_opt dump selftests: netfilter: nft_fib.sh: add type and oif tests with and without VRFs ... ==================== Link: https://patch.msgid.link/20250523132712.458507-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-23netfilter: nf_dup{4, 6}: Move duplication check to task_structSebastian Andrzej Siewior
nf_skb_duplicated is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Due to the recursion involved, the simplest change is to make it a per-task variable. Move the per-CPU variable nf_skb_duplicated to task_struct and name it in_nf_duplicate. Add it to the existing bitfield so it doesn't use additional memory. Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-23netfilter: nf_tables: nft_fib: consistent l3mdev handlingFlorian Westphal
fib has two modes: 1. Obtain output device according to source or destination address 2. Obtain the type of the address, e.g. local, unicast, multicast. 'fib daddr type' should return 'local' if the address is configured in this netns or unicast otherwise. 'fib daddr . iif type' should return 'local' if the address is configured on the input interface or unicast otherwise, i.e. more restrictive. However, if the interface is part of a VRF, then 'fib daddr type' returns unicast even if the address is configured on the incoming interface. This is broken for both ipv4 and ipv6. In the ipv4 case, inet_dev_addr_type must only be used if the 'iif' or 'oif' (strict mode) was requested. Else inet_addr_type_dev_table() needs to be used and the correct dev argument must be passed as well so the correct fib (vrf) table is used. In the ipv6 case, the bug is similar, without strict mode, dev is NULL so .flowi6_l3mdev will be set to 0. Add a new 'nft_fib_l3mdev_master_ifindex_rcu()' helper and use that to init the .l3mdev structure member. For ipv6, use it from nft_fib6_flowi_init() which gets called from both the 'type' and the 'route' mode eval functions. This provides consistent behaviour for all modes for both ipv4 and ipv6: If strict matching is requested, the input respectively output device of the netfilter hooks is used. Otherwise, use skb->dev to obtain the l3mdev ifindex. Without this, most type checks in updated nft_fib.sh selftest fail: FAIL: did not find veth0 . 10.9.9.1 . local in fibtype4 FAIL: did not find veth0 . dead:1::1 . local in fibtype6 FAIL: did not find veth0 . dead:9::1 . local in fibtype6 FAIL: did not find tvrf . 10.0.1.1 . local in fibtype4 FAIL: did not find tvrf . 10.9.9.1 . local in fibtype4 FAIL: did not find tvrf . dead:1::1 . local in fibtype6 FAIL: did not find tvrf . dead:9::1 . local in fibtype6 FAIL: fib expression address types match (iif in vrf) (fib errounously returns 'unicast' for all of them, even though all of these addresses are local to the vrf). Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc8). Conflicts: 80f2ab46c2ee ("irdma: free iwdev->rf after removing MSI-X") 4bcc063939a5 ("ice, irdma: fix an off by one in error handling code") c24a65b6a27c ("iidc/ice/irdma: Update IDC to support multiple consumers") https://lore.kernel.org/20250513130630.280ee6c5@canb.auug.org.au No extra adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-22netfilter: nf_tables: nft_fib_ipv6: fix VRF ipv4/ipv6 result discrepancyFlorian Westphal
With a VRF, ipv4 and ipv6 FIB expression behave differently. fib daddr . iif oif Will return the input interface name for ipv4, but the real device for ipv6. Example: If VRF device name is tvrf and real (incoming) device is veth0. First round is ok, both ipv4 and ipv6 will yield 'veth0'. But in the second round (incoming device will be set to "tvrf"), ipv4 will yield "tvrf" whereas ipv6 returns "veth0" for the second round too. This makes ipv6 behave like ipv4. A followup patch will add a test case for this, without this change it will fail with: get element inet t fibif6iif { tvrf . dead:1::99 . tvrf } ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ FAIL: did not find tvrf . dead:1::99 . tvrf in fibif6iif Alternatively we could either not do anything at all or change ipv4 to also return the lower/real device, however, nft (userspace) doc says "iif: if fib lookup provides a route then check its output interface is identical to the packets input interface." which is what the nft fib ipv4 behaviour is. Fixes: f6d0cbcf09c5 ("netfilter: nf_tables: add fib expression") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-22Merge tag 'ipsec-2025-05-21' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2025-05-21 1) Fix some missing kfree_skb in the error paths of espintcp. From Sabrina Dubroca. 2) Fix a reference leak in espintcp. From Sabrina Dubroca. 3) Fix UDP GRO handling for ESPINUDP. From Tobias Brunner. 4) Fix ipcomp truesize computation on the receive path. From Sabrina Dubroca. 5) Sanitize marks before policy/state insertation. From Paul Chaignon. * tag 'ipsec-2025-05-21' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: Sanitize marks before insert xfrm: ipcomp: fix truesize computation on receive xfrm: Fix UDP GRO handling for some corner cases espintcp: remove encap socket caching to avoid reference leak espintcp: fix skb leaks ==================== Link: https://patch.msgid.link/20250521054348.4057269-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-20ipv6: Revert two per-cpu var allocation for RTM_NEWROUTE.Kuniyuki Iwashima
These two commits preallocated two per-cpu variables in ip6_route_info_create() as fib_nh_common_init() and fib6_nh_init() were expected to be called under RCU. * commit d27b9c40dbd6 ("ipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().") * commit 5720a328c3e9 ("ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().") Now these functions can be called without RCU and can use GFP_KERNEL. Let's revert the commits. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20ipv6: Pass gfp_flags down to ip6_route_info_create_nh().Kuniyuki Iwashima
Since commit c4837b9853e5 ("ipv6: Split ip6_route_info_create()."), ip6_route_info_create_nh() uses GFP_ATOMIC as it was expected to be called under RCU. Now, we can call it without RCU and use GFP_KERNEL. Let's pass gfp_flags to ip6_route_info_create_nh(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-7-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20Revert "ipv6: Factorise ip6_route_multipath_add()."Kuniyuki Iwashima
Commit 71c0efb6d12f ("ipv6: Factorise ip6_route_multipath_add().") split a loop in ip6_route_multipath_add() so that we can put rcu_read_lock() between ip6_route_info_create() and ip6_route_info_create_nh(). We no longer need to do so as ip6_route_info_create_nh() does not require RCU now. Let's revert the commit to simplify the code. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-6-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20Revert "ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during ↵Kuniyuki Iwashima
seg6local LWT setup" The previous patch fixed the same issue mentioned in commit 14a0087e7236 ("ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during seg6local LWT setup"). Let's revert it. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it> Link: https://patch.msgid.link/20250516022759.44392-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20ipv6: Narrow down RCU critical section in inet6_rtm_newroute().Kuniyuki Iwashima
Commit 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.") added rcu_read_lock() covering ip6_route_info_create_nh() and __ip6_ins_rt() to guarantee that nexthop and netdev will not go away. However, as reported by syzkaller [0], ip_tun_build_state() calls dst_cache_init() with GFP_KERNEL during the RCU critical section. ip6_route_info_create_nh() fetches nexthop or netdev depending on whether RTA_NH_ID is set, and struct fib6_info holds a refcount of either of them by nexthop_get() or netdev_get_by_index(). netdev_get_by_index() looks up a dev and calls dev_hold() under RCU. So, we need RCU only around nexthop_find_by_id() and nexthop_get() ( and a few more nexthop code). Let's add rcu_read_lock() there and remove rcu_read_lock() in ip6_route_add() and ip6_route_multipath_add(). Now these functions called from fib6_add() need RCU: - inet6_rt_notify() - fib6_drop_pcpu_from() (via fib6_purge_rt()) - rt6_flush_exceptions() (via fib6_purge_rt()) - ip6_ignore_linkdown() (via rt6_multipath_rebalance()) All callers of inet6_rt_notify() need RCU, so rcu_read_lock() is added there. [0]: [ BUG: Invalid wait context ] 6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 Tainted: G W ---------------------------- syz-executor234/5832 is trying to lock: ffffffff8e021688 (pcpu_alloc_mutex){+.+.}-{4:4}, at: pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782 other info that might help us debug this: context-{5:5} 1 lock held by syz-executor234/5832: 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:331 [inline] 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:841 [inline] 0: ffffffff8df3b860 (rcu_read_lock){....}-{1:3}, at: ip6_route_add+0x4d/0x2f0 net/ipv6/route.c:3913 stack backtrace: CPU: 0 UID: 0 PID: 5832 Comm: syz-executor234 Tainted: G W 6.15.0-rc4-syzkaller-00746-g836b313a14a3 #0 PREEMPT(full) Tainted: [W]=WARN Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/29/2025 Call Trace: <TASK> dump_stack_lvl+0x189/0x250 lib/dump_stack.c:120 print_lock_invalid_wait_context kernel/locking/lockdep.c:4831 [inline] check_wait_context kernel/locking/lockdep.c:4903 [inline] __lock_acquire+0xbcf/0xd20 kernel/locking/lockdep.c:5185 lock_acquire+0x120/0x360 kernel/locking/lockdep.c:5866 __mutex_lock_common kernel/locking/mutex.c:601 [inline] __mutex_lock+0x182/0xe80 kernel/locking/mutex.c:746 pcpu_alloc_noprof+0x284/0x16b0 mm/percpu.c:1782 dst_cache_init+0x37/0xc0 net/core/dst_cache.c:145 ip_tun_build_state+0x193/0x6b0 net/ipv4/ip_tunnel_core.c:687 lwtunnel_build_state+0x381/0x4c0 net/core/lwtunnel.c:137 fib_nh_common_init+0x129/0x460 net/ipv4/fib_semantics.c:635 fib6_nh_init+0x15e4/0x2030 net/ipv6/route.c:3669 ip6_route_info_create_nh+0x139/0x870 net/ipv6/route.c:3866 ip6_route_add+0xf6/0x2f0 net/ipv6/route.c:3915 inet6_rtm_newroute+0x284/0x1c50 net/ipv6/route.c:5732 rtnetlink_rcv_msg+0x7cc/0xb70 net/core/rtnetlink.c:6955 netlink_rcv_skb+0x219/0x490 net/netlink/af_netlink.c:2534 netlink_unicast_kernel net/netlink/af_netlink.c:1313 [inline] netlink_unicast+0x758/0x8d0 net/netlink/af_netlink.c:1339 netlink_sendmsg+0x805/0xb30 net/netlink/af_netlink.c:1883 sock_sendmsg_nosec net/socket.c:712 [inline] __sock_sendmsg+0x219/0x270 net/socket.c:727 ____sys_sendmsg+0x505/0x830 net/socket.c:2566 ___sys_sendmsg+0x21f/0x2a0 net/socket.c:2620 __sys_sendmsg net/socket.c:2652 [inline] __do_sys_sendmsg net/socket.c:2657 [inline] __se_sys_sendmsg net/socket.c:2655 [inline] __x64_sys_sendmsg+0x19b/0x260 net/socket.c:2655 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xf6/0x210 arch/x86/entry/syscall_64.c:94 Fixes: 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.") Reported-by: syzbot+bcc12d6799364500fbec@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=bcc12d6799364500fbec Reported-by: Eric Dumazet <edumazet@google.com> Closes: https://lore.kernel.org/netdev/CANn89i+r1cGacVC_6n3-A-WSkAa_Nr+pmxJ7Gt+oP-P9by2aGw@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-4-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20inet: Remove rtnl_is_held arg of lwtunnel_valid_encap_type(_attr)?().Kuniyuki Iwashima
Commit f130a0cc1b4f ("inet: fix lwtunnel_valid_encap_type() lock imbalance") added the rtnl_is_held argument as a temporary fix while I'm converting nexthop and IPv6 routing table to per-netns RTNL or RCU. Now all callers of lwtunnel_valid_encap_type() do not hold RTNL. Let's remove the argument. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-3-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20ipv6: Remove rcu_read_lock() in fib6_get_table().Kuniyuki Iwashima
Once allocated, the IPv6 routing table is not freed until netns is dismantled. fib6_get_table() uses rcu_read_lock() while iterating net->ipv6.fib_table_hash[], but it's not needed and rather confusing. Because some callers have this pattern, table = fib6_get_table(); rcu_read_lock(); /* ... use table here ... */ rcu_read_unlock(); [ See: addrconf_get_prefix_route(), ip6_route_del(), rt6_get_route_info(), rt6_get_dflt_router() ] and this looks illegal but is actually safe. Let's remove rcu_read_lock() in fib6_get_table() and pass true to the last argument of hlist_for_each_entry_rcu() to bypass the RCU check. Note that protection is not needed but RCU helper is used to avoid data-race. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250516022759.44392-2-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16mr: consolidate the ipmr_can_free_table() checks.Paolo Abeni
Guoyu Yin reported a splat in the ipmr netns cleanup path: WARNING: CPU: 2 PID: 14564 at net/ipv4/ipmr.c:440 ipmr_free_table net/ipv4/ipmr.c:440 [inline] WARNING: CPU: 2 PID: 14564 at net/ipv4/ipmr.c:440 ipmr_rules_exit+0x135/0x1c0 net/ipv4/ipmr.c:361 Modules linked in: CPU: 2 UID: 0 PID: 14564 Comm: syz.4.838 Not tainted 6.14.0 #1 Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:ipmr_free_table net/ipv4/ipmr.c:440 [inline] RIP: 0010:ipmr_rules_exit+0x135/0x1c0 net/ipv4/ipmr.c:361 Code: ff df 48 c1 ea 03 80 3c 02 00 75 7d 48 c7 83 60 05 00 00 00 00 00 00 5b 5d 41 5c 41 5d 41 5e e9 71 67 7f 00 e8 4c 2d 8a fd 90 <0f> 0b 90 eb 93 e8 41 2d 8a fd 0f b6 2d 80 54 ea 01 31 ff 89 ee e8 RSP: 0018:ffff888109547c58 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff888108c12dc0 RCX: ffffffff83e09868 RDX: ffff8881022b3300 RSI: ffffffff83e098d4 RDI: 0000000000000005 RBP: ffff888104288000 R08: 0000000000000000 R09: ffffed10211825c9 R10: 0000000000000001 R11: ffff88801816c4a0 R12: 0000000000000001 R13: ffff888108c13320 R14: ffff888108c12dc0 R15: fffffbfff0b74058 FS: 00007f84f39316c0(0000) GS:ffff88811b100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f84f3930f98 CR3: 0000000113b56000 CR4: 0000000000350ef0 Call Trace: <TASK> ipmr_net_exit_batch+0x50/0x90 net/ipv4/ipmr.c:3160 ops_exit_list+0x10c/0x160 net/core/net_namespace.c:177 setup_net+0x47d/0x8e0 net/core/net_namespace.c:394 copy_net_ns+0x25d/0x410 net/core/net_namespace.c:516 create_new_namespaces+0x3f6/0xaf0 kernel/nsproxy.c:110 unshare_nsproxy_namespaces+0xc3/0x180 kernel/nsproxy.c:228 ksys_unshare+0x78d/0x9a0 kernel/fork.c:3342 __do_sys_unshare kernel/fork.c:3413 [inline] __se_sys_unshare kernel/fork.c:3411 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:3411 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xa6/0x1a0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f84f532cc29 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f84f3931038 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 RAX: ffffffffffffffda RBX: 00007f84f5615fa0 RCX: 00007f84f532cc29 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000400 RBP: 00007f84f53fba18 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 00007f84f5615fa0 R15: 00007fff51c5f328 </TASK> The running kernel has CONFIG_IP_MROUTE_MULTIPLE_TABLES disabled, and the sanity check for such build is still too loose. Address the issue consolidating the relevant sanity check in a single helper regardless of the kernel configuration. Also share it between the ipv4 and ipv6 code. Reported-by: Guoyu Yin <y04609127@gmail.com> Fixes: 50b94204446e ("ipmr: tune the ipmr_can_free_table() checks.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/372dc261e1bf12742276e1b984fc5a071b7fc5a8.1747321903.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15ipv6: sr: Use nested-BH locking for hmac_storageSebastian Andrzej Siewior
hmac_storage is a per-CPU variable and relies on disabled BH for its locking. Without per-CPU locking in local_bh_disable() on PREEMPT_RT this data structure requires explicit locking. Add a local_lock_t to the data structure and use local_lock_nested_bh() for locking. This change adds only lockdep coverage and does not alter the functional behaviour for !PREEMPT_RT. Cc: David Ahern <dsahern@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20250512092736.229935-5-bigeasy@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13net: devmem: Implement TX pathMina Almasry
Augment dmabuf binding to be able to handle TX. Additional to all the RX binding, we also create tx_vec needed for the TX path. Provide API for sendmsg to be able to send dmabufs bound to this device: - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from. - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf. Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY implementation, while disabling instances where MSG_ZEROCOPY falls back to copying. We additionally pipe the binding down to the new zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems instead of the traditional page netmems. We also special case skb_frag_dma_map to return the dma-address of these dmabuf net_iovs instead of attempting to map pages. The TX path may release the dmabuf in a context where we cannot wait. This happens when the user unbinds a TX dmabuf while there are still references to its netmems in the TX path. In that case, the netmems will be put_netmem'd from a context where we can't unmap the dmabuf, Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd. Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat of the implementation came from devmem TCP RFC v1[1], which included the TX path, but Stan did all the rebasing on top of netmem/net_iov. Cc: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-5-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc6). No conflicts. Adjacent changes: net/core/dev.c: 08e9f2d584c4 ("net: Lock netdevices during dev_shutdown") a82dc19db136 ("net: avoid potential race between netdev_get_by_index_lock() and netns switch") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05gre: Fix again IPv6 link-local address generation.Guillaume Nault
Use addrconf_addr_gen() to generate IPv6 link-local addresses on GRE devices in most cases and fall back to using add_v4_addrs() only in case the GRE configuration is incompatible with addrconf_addr_gen(). GRE used to use addrconf_addr_gen() until commit e5dd729460ca ("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address") restricted this use to gretap and ip6gretap devices, and created add_v4_addrs() (borrowed from SIT) for non-Ethernet GRE ones. The original problem came when commit 9af28511be10 ("addrconf: refuse isatap eui64 for INADDR_ANY") made __ipv6_isatap_ifid() fail when its addr parameter was 0. The commit says that this would create an invalid address, however, I couldn't find any RFC saying that the generated interface identifier would be wrong. Anyway, since gre over IPv4 devices pass their local tunnel address to __ipv6_isatap_ifid(), that commit broke their IPv6 link-local address generation when the local address was unspecified. Then commit e5dd729460ca ("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address") tried to fix that case by defining add_v4_addrs() and calling it to generate the IPv6 link-local address instead of using addrconf_addr_gen() (apart for gretap and ip6gretap devices, which would still use the regular addrconf_addr_gen(), since they have a MAC address). That broke several use cases because add_v4_addrs() isn't properly integrated into the rest of IPv6 Neighbor Discovery code. Several of these shortcomings have been fixed over time, but add_v4_addrs() remains broken on several aspects. In particular, it doesn't send any Router Sollicitations, so the SLAAC process doesn't start until the interface receives a Router Advertisement. Also, add_v4_addrs() mostly ignores the address generation mode of the interface (/proc/sys/net/ipv6/conf/*/addr_gen_mode), thus breaking the IN6_ADDR_GEN_MODE_RANDOM and IN6_ADDR_GEN_MODE_STABLE_PRIVACY cases. Fix the situation by using add_v4_addrs() only in the specific scenario where the normal method would fail. That is, for interfaces that have all of the following characteristics: * run over IPv4, * transport IP packets directly, not Ethernet (that is, not gretap interfaces), * tunnel endpoint is INADDR_ANY (that is, 0), * device address generation mode is EUI64. In all other cases, revert back to the regular addrconf_addr_gen(). Also, remove the special case for ip6gre interfaces in add_v4_addrs(), since ip6gre devices now always use addrconf_addr_gen() instead. Note: This patch was originally applied as commit 183185a18ff9 ("gre: Fix IPv6 link-local address generation."). However, it was then reverted by commit fc486c2d060f ("Revert "gre: Fix IPv6 link-local address generation."") because it uncovered another bug that ended up breaking net/forwarding/ip6gre_custom_multipath_hash.sh. That other bug has now been fixed by commit 4d0ab3a6885e ("ipv6: Start path selection from the first nexthop"). Therefore we can now revive this GRE patch (no changes since original commit 183185a18ff9 ("gre: Fix IPv6 link-local address generation."). Fixes: e5dd729460ca ("ip/ip6_gre: use the same logic as SIT interfaces when computing v6LL address") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/a88cc5c4811af36007645d610c95102dccb360a6.1746225214.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-05netfilter: bridge: Move specific fragmented packet to slow_path instead of ↵Huajian Yang
dropping it The config NF_CONNTRACK_BRIDGE will change the bridge forwarding for fragmented packets. The original bridge does not know that it is a fragmented packet and forwards it directly, after NF_CONNTRACK_BRIDGE is enabled, function nf_br_ip_fragment and br_ip6_fragment will check the headroom. In original br_forward, insufficient headroom of skb may indeed exist, but there's still a way to save the skb in the device driver after dev_queue_xmit.So droping the skb will change the original bridge forwarding in some cases. Fixes: 3c171f496ef5 ("netfilter: bridge: add connection tracking system") Signed-off-by: Huajian Yang <huajianyang@asrmicro.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-05-02ipv6: Restore fib6_config validation for SIOCADDRT.Kuniyuki Iwashima
syzkaller reported out-of-bounds read in ipv6_addr_prefix(), where the prefix length was over 128. The cited commit accidentally removed some fib6_config validation from the ioctl path. Let's restore the validation. [0]: BUG: KASAN: slab-out-of-bounds in ip6_route_info_create (./include/net/ipv6.h:616 net/ipv6/route.c:3814) Read of size 1 at addr ff11000138020ad4 by task repro/261 CPU: 3 UID: 0 PID: 261 Comm: repro Not tainted 6.15.0-rc3-00614-g0d15a26b247d #87 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:123) print_report (mm/kasan/report.c:409 mm/kasan/report.c:521) kasan_report (mm/kasan/report.c:636) ip6_route_info_create (./include/net/ipv6.h:616 net/ipv6/route.c:3814) ip6_route_add (net/ipv6/route.c:3902) ipv6_route_ioctl (net/ipv6/route.c:4523) inet6_ioctl (net/ipv6/af_inet6.c:577) sock_do_ioctl (net/socket.c:1190) sock_ioctl (net/socket.c:1314) __x64_sys_ioctl (fs/ioctl.c:51 fs/ioctl.c:906 fs/ioctl.c:892 fs/ioctl.c:892) do_syscall_64 (arch/x86/entry/syscall_64.c:63 arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) RIP: 0033:0x7f518fb2de5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 9f 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007fff14f38d18 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f518fb2de5d RDX: 00000000200015c0 RSI: 000000000000890b RDI: 0000000000000003 RBP: 00007fff14f38d30 R08: 0000000000000800 R09: 0000000000000800 R10: 0000000000000000 R11: 0000000000000202 R12: 00007fff14f38e48 R13: 0000000000401136 R14: 0000000000403df0 R15: 00007f518fd3c000 </TASK> Fixes: fa76c1674f2e ("ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: Yi Lai <yi1.lai@linux.intel.com> Closes: https://lore.kernel.org/netdev/aBAcKDEFoN%2FLntBF@ly-workstation/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250501005335.53683-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-01ipv6: sr: switch to GFP_ATOMIC flag to allocate memory during seg6local LWT ↵Andrea Mayer
setup Recent updates to the locking mechanism that protects IPv6 routing tables [1] have affected the SRv6 networking subsystem. Such changes cause problems with some SRv6 Endpoints behaviors, like End.B6.Encaps and also impact SRv6 counters. Starting from commit 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE."), the inet6_rtm_newroute() function no longer needs to acquire the RTNL lock for creating and configuring IPv6 routes and set up lwtunnels. The RTNL lock can be avoided because the ip6_route_add() function finishes setting up a new route in a section protected by RCU. This makes sure that no dev/nexthops can disappear during the operation. Because of this, the steps for setting up lwtunnels - i.e., calling lwtunnel_build_state() - are now done in a RCU lock section and not under the RTNL lock anymore. However, creating and configuring a lwtunnel instance in an RCU-protected section can be problematic when that tunnel needs to allocate memory using the GFP_KERNEL flag. For example, the following trace shows what happens when an SRv6 End.B6.Encaps behavior is instantiated after commit 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE."): [ 3061.219696] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321 [ 3061.226136] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 445, name: ip [ 3061.232101] preempt_count: 0, expected: 0 [ 3061.235414] RCU nest depth: 1, expected: 0 [ 3061.238622] 1 lock held by ip/445: [ 3061.241458] #0: ffffffff83ec64a0 (rcu_read_lock){....}-{1:3}, at: ip6_route_add+0x41/0x1e0 [ 3061.248520] CPU: 1 UID: 0 PID: 445 Comm: ip Not tainted 6.15.0-rc3-micro-vm-dev-00590-ge527e891492d #2058 PREEMPT(full) [ 3061.248532] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 3061.248549] Call Trace: [ 3061.248620] <TASK> [ 3061.248633] dump_stack_lvl+0xa9/0xc0 [ 3061.248846] __might_resched+0x218/0x360 [ 3061.248871] __kmalloc_node_track_caller_noprof+0x332/0x4e0 [ 3061.248889] ? rcu_is_watching+0x3a/0x70 [ 3061.248902] ? parse_nla_srh+0x56/0xa0 [ 3061.248938] kmemdup_noprof+0x1c/0x40 [ 3061.248952] parse_nla_srh+0x56/0xa0 [ 3061.248969] seg6_local_build_state+0x2e0/0x580 [ 3061.248992] ? __lock_acquire+0xaff/0x1cd0 [ 3061.249013] ? do_raw_spin_lock+0x111/0x1d0 [ 3061.249027] ? __pfx_seg6_local_build_state+0x10/0x10 [ 3061.249068] ? lwtunnel_build_state+0xe1/0x3a0 [ 3061.249274] lwtunnel_build_state+0x10d/0x3a0 [ 3061.249303] fib_nh_common_init+0xce/0x1e0 [ 3061.249337] ? __pfx_fib_nh_common_init+0x10/0x10 [ 3061.249352] ? in6_dev_get+0xaf/0x1f0 [ 3061.249369] ? __rcu_read_unlock+0x64/0x2e0 [ 3061.249392] fib6_nh_init+0x290/0xc30 [ 3061.249422] ? __pfx_fib6_nh_init+0x10/0x10 [ 3061.249447] ? __lock_acquire+0xaff/0x1cd0 [ 3061.249459] ? _raw_spin_unlock_irqrestore+0x22/0x70 [ 3061.249624] ? ip6_route_info_create+0x423/0x520 [ 3061.249641] ? rcu_is_watching+0x3a/0x70 [ 3061.249683] ip6_route_info_create_nh+0x190/0x390 [ 3061.249715] ip6_route_add+0x71/0x1e0 [ 3061.249730] ? __pfx_inet6_rtm_newroute+0x10/0x10 [ 3061.249743] inet6_rtm_newroute+0x426/0xc50 [ 3061.249764] ? avc_has_perm_noaudit+0x13d/0x360 [ 3061.249853] ? __pfx_inet6_rtm_newroute+0x10/0x10 [ 3061.249905] ? __lock_acquire+0xaff/0x1cd0 [ 3061.249962] ? rtnetlink_rcv_msg+0x52f/0x890 [ 3061.249996] ? __pfx_inet6_rtm_newroute+0x10/0x10 [ 3061.250012] rtnetlink_rcv_msg+0x551/0x890 [ 3061.250040] ? __pfx_rtnetlink_rcv_msg+0x10/0x10 [ 3061.250065] ? __lock_acquire+0xaff/0x1cd0 [ 3061.250092] netlink_rcv_skb+0xbd/0x1f0 [ 3061.250108] ? __pfx_rtnetlink_rcv_msg+0x10/0x10 [ 3061.250124] ? __pfx_netlink_rcv_skb+0x10/0x10 [ 3061.250179] ? netlink_deliver_tap+0x10b/0x700 [ 3061.250210] netlink_unicast+0x2e7/0x410 [ 3061.250232] ? __pfx_netlink_unicast+0x10/0x10 [ 3061.250241] ? __lock_acquire+0xaff/0x1cd0 [ 3061.250280] netlink_sendmsg+0x366/0x670 [ 3061.250306] ? __pfx_netlink_sendmsg+0x10/0x10 [ 3061.250313] ? find_held_lock+0x2d/0xa0 [ 3061.250344] ? import_ubuf+0xbc/0xf0 [ 3061.250370] ? __pfx_netlink_sendmsg+0x10/0x10 [ 3061.250381] __sock_sendmsg+0x13e/0x150 [ 3061.250420] ____sys_sendmsg+0x33d/0x450 [ 3061.250442] ? __pfx_____sys_sendmsg+0x10/0x10 [ 3061.250453] ? __pfx_copy_msghdr_from_user+0x10/0x10 [ 3061.250489] ? __pfx_slab_free_after_rcu_debug+0x10/0x10 [ 3061.250514] ___sys_sendmsg+0xe5/0x160 [ 3061.250530] ? __pfx____sys_sendmsg+0x10/0x10 [ 3061.250568] ? __lock_acquire+0xaff/0x1cd0 [ 3061.250617] ? find_held_lock+0x2d/0xa0 [ 3061.250678] ? __virt_addr_valid+0x199/0x340 [ 3061.250704] ? preempt_count_sub+0xf/0xc0 [ 3061.250736] __sys_sendmsg+0xca/0x140 [ 3061.250750] ? __pfx___sys_sendmsg+0x10/0x10 [ 3061.250786] ? syscall_exit_to_user_mode+0xa2/0x1e0 [ 3061.250825] do_syscall_64+0x62/0x140 [ 3061.250844] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 3061.250855] RIP: 0033:0x7f0b042ef914 [ 3061.250868] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 48 8d 05 e9 5d 0c 00 8b 00 85 c0 75 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 41 89 d4 55 48 89 f5 53 [ 3061.250876] RSP: 002b:00007ffc2d113ef8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 3061.250885] RAX: ffffffffffffffda RBX: 00000000680f93fa RCX: 00007f0b042ef914 [ 3061.250891] RDX: 0000000000000000 RSI: 00007ffc2d113f60 RDI: 0000000000000003 [ 3061.250897] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000008 [ 3061.250902] R10: fffffffffffff26d R11: 0000000000000246 R12: 0000000000000001 [ 3061.250907] R13: 000055a961f8a520 R14: 000055a961f63eae R15: 00007ffc2d115270 [ 3061.250952] </TASK> To solve this issue, we replace the GFP_KERNEL flag with the GFP_ATOMIC one in those SRv6 Endpoints that need to allocate memory during the setup phase. This change makes sure that memory allocations are handled in a way that works with RCU critical sections. [1] - https://lore.kernel.org/all/20250418000443.43734-1-kuniyu@amazon.com/ Fixes: 169fd62799e8 ("ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.") Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20250429132453.31605-1-andrea.mayer@uniroma2.it Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc5). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-01net: use sock_gen_put() when sk_state is TCP_TIME_WAITJibin Zhang
It is possible for a pointer of type struct inet_timewait_sock to be returned from the functions __inet_lookup_established() and __inet6_lookup_established(). This can cause a crash when the returned pointer is of type struct inet_timewait_sock and sock_put() is called on it. The following is a crash call stack that shows sk->sk_wmem_alloc being accessed in sk_free() during the call to sock_put() on a struct inet_timewait_sock pointer. To avoid this issue, use sock_gen_put() instead of sock_put() when sk->sk_state is TCP_TIME_WAIT. mrdump.ko ipanic() + 120 vmlinux notifier_call_chain(nr_to_call=-1, nr_calls=0) + 132 vmlinux atomic_notifier_call_chain(val=0) + 56 vmlinux panic() + 344 vmlinux add_taint() + 164 vmlinux end_report() + 136 vmlinux kasan_report(size=0) + 236 vmlinux report_tag_fault() + 16 vmlinux do_tag_recovery() + 16 vmlinux __do_kernel_fault() + 88 vmlinux do_bad_area() + 28 vmlinux do_tag_check_fault() + 60 vmlinux do_mem_abort() + 80 vmlinux el1_abort() + 56 vmlinux el1h_64_sync_handler() + 124 vmlinux > 0xFFFFFFC080011294() vmlinux __lse_atomic_fetch_add_release(v=0xF2FFFF82A896087C) vmlinux __lse_atomic_fetch_sub_release(v=0xF2FFFF82A896087C) vmlinux arch_atomic_fetch_sub_release(i=1, v=0xF2FFFF82A896087C) + 8 vmlinux raw_atomic_fetch_sub_release(i=1, v=0xF2FFFF82A896087C) + 8 vmlinux atomic_fetch_sub_release(i=1, v=0xF2FFFF82A896087C) + 8 vmlinux __refcount_sub_and_test(i=1, r=0xF2FFFF82A896087C, oldp=0) + 8 vmlinux __refcount_dec_and_test(r=0xF2FFFF82A896087C, oldp=0) + 8 vmlinux refcount_dec_and_test(r=0xF2FFFF82A896087C) + 8 vmlinux sk_free(sk=0xF2FFFF82A8960700) + 28 vmlinux sock_put() + 48 vmlinux tcp6_check_fraglist_gro() + 236 vmlinux tcp6_gro_receive() + 624 vmlinux ipv6_gro_receive() + 912 vmlinux dev_gro_receive() + 1116 vmlinux napi_gro_receive() + 196 ccmni.ko ccmni_rx_callback() + 208 ccmni.ko ccmni_queue_recv_skb() + 388 ccci_dpmaif.ko dpmaif_rxq_push_thread() + 1088 vmlinux kthread() + 268 vmlinux 0xFFFFFFC08001F30C() Fixes: c9d1d23e5239 ("net: add heuristic for enabling TCP fraglist GRO") Signed-off-by: Jibin Zhang <jibin.zhang@mediatek.com> Signed-off-by: Shiming Cheng <shiming.cheng@mediatek.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/20250429020412.14163-1-shiming.cheng@mediatek.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-29ip: load balance tcp connections to single dst addr and portWillem de Bruijn
Load balance new TCP connections across nexthops also when they connect to the same service at a single remote address and port. This affects only port-based multipath hashing: fib_multipath_hash_policy 1 or 3. Local connections must choose both a source address and port when connecting to a remote service, in ip_route_connect. This "chicken-and-egg problem" (commit 2d7192d6cbab ("ipv4: Sanitize and simplify ip_route_{connect,newports}()")) is resolved by first selecting a source address, by looking up a route using the zero wildcard source port and address. As a result multiple connections to the same destination address and port have no entropy in fib_multipath_hash. This is not a problem when forwarding, as skb-based hashing has a 4-tuple. Nor when establishing UDP connections, as autobind there selects a port before reaching ip_route_connect. Load balance also TCP, by using a random port in fib_multipath_hash. Port assignment in inet_hash_connect is not atomic with ip_route_connect. Thus ports are unpredictable, effectively random. Implementation details: Do not actually pass a random fl4_sport, as that affects not only hashing, but routing more broadly, and can match a source port based policy route, which existing wildcard port 0 will not. Instead, define a new wildcard flowi flag that is used only for hashing. Selecting a random source is equivalent to just selecting a random hash entirely. But for code clarity, follow the normal 4-tuple hash process and only update this field. fib_multipath_hash can be reached with zero sport from other code paths, so explicitly pass this flowi flag, rather than trying to infer this case in the function itself. Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20250424143549.669426-3-willemdebruijn.kernel@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.Kuniyuki Iwashima
Now we are ready to remove RTNL from SIOCADDRT and RTM_NEWROUTE. The remaining things to do are 1. pass false to lwtunnel_valid_encap_type_attr() 2. use rcu_dereference_rtnl() in fib6_check_nexthop() 3. place rcu_read_lock() before ip6_route_info_create_nh(). Let's complete the RTNL-free conversion. When each CPU-X adds 100000 routes on table-X in a batch concurrently on c7a.metal-48xl EC2 instance with 192 CPUs, without this series: $ sudo ./route_test.sh ... added 19200000 routes (100000 routes * 192 tables). time elapsed: 191577 milliseconds. with this series: $ sudo ./route_test.sh ... added 19200000 routes (100000 routes * 192 tables). time elapsed: 62854 milliseconds. I changed the number of routes in each table (1000 ~ 100000) and consistently saw it finish 3x faster with this series. Note that now every caller of lwtunnel_valid_encap_type() passes false as the last argument, and this can be removed later. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-16-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Protect nh->f6i_list with spinlock and flag.Kuniyuki Iwashima
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT. Then, we may be going to add a route tied to a dying nexthop. The nexthop itself is not freed during the RCU grace period, but if we link a route after __remove_nexthop_fib() is called for the nexthop, the route will be leaked. To avoid the race between IPv6 route addition under RCU vs nexthop deletion under RTNL, let's add a dead flag and protect it and nh->f6i_list with a spinlock. __remove_nexthop_fib() acquires the nexthop's spinlock and sets false to nh->dead, then calls ip6_del_rt() for the linked route one by one without the spinlock because fib6_purge_rt() acquires it later. While adding an IPv6 route, fib6_add() acquires the nexthop lock and checks the dead flag just before inserting the route. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-15-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Defer fib6_purge_rt() in fib6_add_rt2node() to fib6_add().Kuniyuki Iwashima
The next patch adds per-nexthop spinlock which protects nh->f6i_list. When rt->nh is not NULL, fib6_add_rt2node() will be called under the lock. fib6_add_rt2node() could call fib6_purge_rt() for another route, which could holds another nexthop lock. Then, deadlock could happen between two nexthops. Let's defer fib6_purge_rt() after fib6_add_rt2node(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-14-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Protect fib6_link_table() with spinlock.Kuniyuki Iwashima
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT. If the request specifies a new table ID, fib6_new_table() is called to create a new routing table. Two concurrent requests could specify the same table ID, so we need a lock to protect net->ipv6.fib_table_hash[h]. Let's add a spinlock to protect the hash bucket linkage. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-13-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Factorise ip6_route_multipath_add().Kuniyuki Iwashima
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely on RCU to guarantee dev and nexthop lifetime. Then, the RCU section will start before ip6_route_info_create_nh() in ip6_route_multipath_add(), but ip6_route_info_create() is called in the same loop and will sleep. Let's split the loop into ip6_route_mpath_info_create() and ip6_route_mpath_info_create_nh(). Note that ip6_route_info_append() is now integrated into ip6_route_mpath_info_create_nh() because we need to call different free functions for nexthops that passed ip6_route_info_create_nh(). In case of failure, the remaining nexthops that ip6_route_info_create_nh() has not been called for will be freed by ip6_route_mpath_info_cleanup(). OTOH, if a nexthop passes ip6_route_info_create_nh(), it will be linked to a local temporary list, which will be spliced back to rt6_nh_list. In case of failure, these nexthops will be released by fib6_info_release() in ip6_route_multipath_add(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-12-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Rename rt6_nh.next to rt6_nh.list.Kuniyuki Iwashima
ip6_route_multipath_add() allocates struct rt6_nh for each config of multipath routes to link them to a local list rt6_nh_list. struct rt6_nh.next is the list node of each config, so the name is quite misleading. Let's rename it to list. Suggested-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/netdev/c9bee472-c94e-4878-8cc2-1512b2c54db5@redhat.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-11-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Don't pass net to ip6_route_info_append().Kuniyuki Iwashima
net is not used in ip6_route_info_append() after commit 36f19d5b4f99 ("net/ipv6: Remove extra call to ip6_convert_metrics for multipath case"). Let's remove the argument. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-10-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().Kuniyuki Iwashima
ip6_route_info_create_nh() will be called under RCU. It calls fib_nh_common_init() and allocates nhc->nhc_pcpu_rth_output. As with the reason for rt->fib6_nh->rt6i_pcpu, we want to avoid GFP_ATOMIC allocation for nhc->nhc_pcpu_rth_output under RCU. Let's preallocate it in ip6_route_info_create(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-9-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().Kuniyuki Iwashima
ip6_route_info_create_nh() will be called under RCU. Then, fib6_nh_init() is also under RCU, but per-cpu memory allocation is very likely to fail with GFP_ATOMIC while bulk-adding IPv6 routes and we would see a bunch of this message in dmesg. percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left percpu: allocation failed, size=8 align=8 atomic=1, atomic alloc failed, no space left Let's preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create(). If something fails before the original memory allocation in fib6_nh_init(), ip6_route_info_create_nh() calls fib6_info_release(), which releases the preallocated per-cpu memory. Note that rt->fib6_nh->rt6i_pcpu is not preallocated when called via ipv6_stub, so we still need alloc_percpu_gfp() in fib6_nh_init(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-8-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Split ip6_route_info_create().Kuniyuki Iwashima
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT and rely on RCU to guarantee dev and nexthop lifetime. Then, we want to allocate as much as possible before entering the RCU section. The RCU section will start in the middle of ip6_route_info_create(), and this is problematic for ip6_route_multipath_add() that calls ip6_route_info_create() multiple times. Let's split ip6_route_info_create() into two parts; one for memory allocation and another for nexthop setup. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-7-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Move nexthop_find_by_id() after fib6_info_alloc().Kuniyuki Iwashima
We will get rid of RTNL from RTM_NEWROUTE and SIOCADDRT. Then, we must perform two lookups for nexthop and dev under RCU to guarantee their lifetime. ip6_route_info_create() calls nexthop_find_by_id() first if RTA_NH_ID is specified, and then allocates struct fib6_info. nexthop_find_by_id() must be called under RCU, but we do not want to use GFP_ATOMIC for memory allocation here, which will be likely to fail in ip6_route_multipath_add(). Let's move nexthop_find_by_id() after the memory allocation so that we can later split ip6_route_info_create() into two parts: the sleepable part and the RCU part. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-6-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Check GATEWAY in rtm_to_fib6_multipath_config().Kuniyuki Iwashima
In ip6_route_multipath_add(), we call rt6_qualify_for_ecmp() for each entry. If it returns false, the request fails. rt6_qualify_for_ecmp() returns false if either of the conditions below is true: 1. f6i->fib6_flags has RTF_ADDRCONF 2. f6i->nh is not NULL 3. f6i->fib6_nh->fib_nh_gw_family is AF_UNSPEC 1 is unnecessary because rtm_to_fib6_config() never sets RTF_ADDRCONF to cfg->fc_flags. 2. is equivalent with cfg->fc_nh_id. 3. can be replaced by checking RTF_GATEWAY in the base and each multipath entry because AF_INET6 is set to f6i->fib6_nh->fib_nh_gw_family only when cfg.fc_is_fdb is true or RTF_GATEWAY is set, but the former is always false. These checks do not require RCU and can be done earlier. Let's perform the equivalent checks in rtm_to_fib6_multipath_config(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-5-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Move some validation from ip6_route_info_create() to rtm_to_fib6_config().Kuniyuki Iwashima
ip6_route_info_create() is called from 3 functions: * ip6_route_add() * ip6_route_multipath_add() * addrconf_f6i_alloc() addrconf_f6i_alloc() does not need validation for struct fib6_config in ip6_route_info_create(). ip6_route_multipath_add() calls ip6_route_info_create() for multiple routes with slightly different fib6_config instances, which is copied from the base config passed from userspace. So, we need not validate the same config repeatedly. Let's move such validation into rtm_to_fib6_config(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20250418000443.43734-4-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.Kuniyuki Iwashima
Basically, removing an IPv6 route does not require RTNL because the IPv6 routing tables are protected by per table lock. inet6_rtm_delroute() calls nexthop_find_by_id() to check if the nexthop specified by RTA_NH_ID exists. nexthop uses rbtree and the top-down walk can be safely performed under RCU. ip6_route_del() already relies on RCU and the table lock, but we need to extend the RCU critical section a bit more to cover __ip6_del_rt(). For example, nexthop_for_each_fib6_nh() and inet6_rt_notify() needs RCU. Let's call nexthop_find_by_id() and __ip6_del_rt() under RCU and get rid of RTNL from inet6_rtm_delroute() and SIOCDELRT. Even if the nexthop is removed after rcu_read_unlock() in inet6_rtm_delroute(), __remove_nexthop_fib() cleans up the routes tied to the nexthop, and ip6_route_del() returns -ESRCH. So the request was at least valid as of nexthop_find_by_id(), and it's just a matter of timing. Note that we need to pass false to lwtunnel_valid_encap_type_attr(). The following patches also use the newroute bool. Note also that fib6_get_table() does not require RCU because once allocated fib6_table is not freed until netns dismantle. I will post a follow-up series to convert such callers to RCU-lockless version. [0] Link: https://lore.kernel.org/netdev/20250417174557.65721-1-kuniyu@amazon.com/ #[0] Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-3-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-24ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config().Kuniyuki Iwashima
We will perform RTM_NEWROUTE and RTM_DELROUTE under RCU, and then we want to perform some validation out of the RCU scope. When creating / removing an IPv6 route with RTA_MULTIPATH, inet6_rtm_newroute() / inet6_rtm_delroute() validates RTA_GATEWAY in each multipath entry. Let's do that in rtm_to_fib6_config(). Note that now RTM_DELROUTE returns an error for RTA_MULTIPATH with 0 entries, which was accepted but should result in -EINVAL as RTM_NEWROUTE. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250418000443.43734-2-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc3). No conflicts. Adjacent changes: tools/net/ynl/pyynl/ynl_gen_c.py 4d07bbf2d456 ("tools: ynl-gen: don't declare loop iterator in place") 7e8ba0c7de2b ("tools: ynl: don't use genlmsghdr in classic netlink") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-17net: ipv6: ioam6: fix double reallocationJustin Iurman
If the dst_entry is the same post transformation (which is a valid use case for IOAM), we don't add it to the cache to avoid a reference loop. Instead, we use a "fake" dst_entry and add it to the cache as a signal. When we read the cache, we compare it with our "fake" dst_entry and therefore detect if we're in the special case. Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Link: https://patch.msgid.link/20250415112554.23823-3-justin.iurman@uliege.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17net: ipv6: ioam6: use consistent dst namesJustin Iurman
Be consistent and use the same terminology as other lwt users: orig_dst is the dst_entry before the transformation, while dst is either the dst_entry in the cache or the dst_entry after the transformation Signed-off-by: Justin Iurman <justin.iurman@uliege.be> Link: https://patch.msgid.link/20250415112554.23823-2-justin.iurman@uliege.be Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17ovpn: implement TCP transportAntonio Quartulli
With this change ovpn is allowed to communicate to peers also via TCP. Parsing of incoming messages is implemented through the strparser API. Note that ovpn redefines sk_prot and sk_socket->ops for the TCP socket used to communicate with the peer. For this reason it needs to access inet6_stream_ops, which is declared as extern in the IPv6 module, but it is not fully exported. Therefore this patch is also adding EXPORT_SYMBOL_GPL(inet6_stream_ops) to net/ipv6/af_inet6.c. Cc: David Ahern <dsahern@kernel.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Link: https://patch.msgid.link/20250415-b4-ovpn-v26-11-577f6097b964@openvpn.net Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17xfrm: Fix UDP GRO handling for some corner casesTobias Brunner
This fixes an issue that's caused if there is a mismatch between the data offset in the GRO header and the length fields in the regular sk_buff due to the pskb_pull()/skb_push() calls. That's because the UDP GRO layer stripped off the UDP header via skb_gro_pull() already while the UDP header was explicitly not pulled/pushed in this function. For example, an IKE packet that triggered this had len=data_len=1268 and the data_offset in the GRO header was 28 (IPv4 + UDP). So pskb_pull() was called with an offset of 28-8=20, which reduced len to 1248 and via pskb_may_pull() and __pskb_pull_tail() it also set data_len to 1248. As the ESP offload module was not loaded, the function bailed out and called skb_push(), which restored len to 1268, however, data_len remained at 1248. So while skb_headlen() was 0 before, it was now 20. The latter caused a difference of 8 instead of 28 (or 0 if pskb_pull()/skb_push() was called with the complete GRO data_offset) in gro_try_pull_from_frag0() that triggered a call to gro_pull_from_frag0() that corrupted the packet. This change uses a more GRO-like approach seen in other GRO receivers via skb_gro_header() to just read the actual data we are interested in and does not try to "restore" the UDP header at this point to call the existing function. If the offload module is not loaded, it immediately bails out, otherwise, it only does a quick check to see if the packet is an IKE or keepalive packet instead of calling the existing function. Fixes: 172bf009c18d ("xfrm: Support GRO for IPv4 ESP in UDP encapsulation") Fixes: 221ddb723d90 ("xfrm: Support GRO for IPv6 ESP in UDP encapsulation") Signed-off-by: Tobias Brunner <tobias@strongswan.org> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-16ipv6: Use nlmsg_payload in route fileBreno Leitao
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-3-a1c75d493fd7@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16ipv6: Use nlmsg_payload in addrconf fileBreno Leitao
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-2-a1c75d493fd7@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16ipv6: Use nlmsg_payload in addrlabel fileBreno Leitao
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. This changes function ip6addrlbl_valid_get_req() and ip6addrlbl_valid_dump_req(). Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250415-nlmsg_v2-v1-1-a1c75d493fd7@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15ipv6: Use nlmsg_payload in inet6_rtm_valid_getaddr_reqBreno Leitao
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250414-nlmsg-v2-7-3d90cb42c6af@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15ipv6: Use nlmsg_payload in inet6_valid_dump_ifaddr_reqBreno Leitao
Leverage the new nlmsg_payload() helper to avoid checking for message size and then reading the nlmsg data. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20250414-nlmsg-v2-6-3d90cb42c6af@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-14ipv6: Convert tunnel devices' ->exit_batch_rtnl() to ->exit_rtnl().Kuniyuki Iwashima
The following functions iterates the dying netns list and performs the same operations for each netns. * ip6gre_exit_batch_rtnl() * ip6_tnl_exit_batch_rtnl() * vti6_exit_batch_rtnl() * sit_exit_batch_rtnl() Let's use ->exit_rtnl(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250411205258.63164-8-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>