summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2023-08-20ipv6: do not match device when remove source routeHangbin Liu
After deleting an IPv6 address on an interface and cleaning up the related preferred source entries, it is important to ensure that all routes associated with the deleted address are properly cleared. The current implementation of rt6_remove_prefsrc() only checks the preferred source addresses bound to the current device. However, there may be routes that are bound to other devices but still utilize the same preferred source address. To address this issue, it is necessary to also delete entries that are bound to other interfaces but share the same source address with the current device. Failure to delete these entries would leave routes that are bound to the deleted address unclear. Here is an example reproducer (I have omitted unrelated routes): + ip link add dummy1 type dummy + ip link add dummy2 type dummy + ip link set dummy1 up + ip link set dummy2 up + ip addr add 1:2:3:4::5/64 dev dummy1 + ip route add 7:7:7:0::1 dev dummy1 src 1:2:3:4::5 + ip route add 7:7:7:0::2 dev dummy2 src 1:2:3:4::5 + ip -6 route show 1:2:3:4::/64 dev dummy1 proto kernel metric 256 pref medium 7:7:7::1 dev dummy1 src 1:2:3:4::5 metric 1024 pref medium 7:7:7::2 dev dummy2 src 1:2:3:4::5 metric 1024 pref medium + ip addr del 1:2:3:4::5/64 dev dummy1 + ip -6 route show 7:7:7::1 dev dummy1 metric 1024 pref medium 7:7:7::2 dev dummy2 src 1:2:3:4::5 metric 1024 pref medium As Ido reminds, in IPv6, the preferred source address is looked up in the same VRF as the first nexthop device, which is different with IPv4. So, while removing the device checking, we also need to add an ipv6_chk_addr() check to make sure the address does not exist on the other devices of the rt nexthop device's VRF. After fix: + ip addr del 1:2:3:4::5/64 dev dummy1 + ip -6 route show 7:7:7::1 dev dummy1 metric 1024 pref medium 7:7:7::2 dev dummy2 metric 1024 pref medium Reported-by: Thomas Haller <thaller@redhat.com> Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2170513 Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-20net: release reference to inet6_dev pointerPatrick Rohr
addrconf_prefix_rcv returned early without releasing the inet6_dev pointer when the PIO lifetime is less than accept_ra_min_lft. Fixes: 5027d54a9c30 ("net: change accept_ra_min_rtr_lft to affect all RA lifetimes") Cc: Maciej Żenczykowski <maze@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: David Ahern <dsahern@kernel.org> Cc: Simon Horman <horms@kernel.org> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Patrick Rohr <prohr@google.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-20net: selectively purge error queue in IP_RECVERR / IPV6_RECVERREric Dumazet
Setting IP_RECVERR and IPV6_RECVERR options to zero currently purges the socket error queue, which was probably not expected for zerocopy and tx_timestamp users. I discovered this issue while preparing commit 6b5f43ea0815 ("inet: move inet->recverr to inet->inet_flags"), I presume this change does not need to be backported to stable kernels. Add skb_errqueue_purge() helper to purge error messages only. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-20Merge commit b320441c04c9 ("Merge tag 'tty-6.5-rc7' of ↵Greg Kroah-Hartman
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty") into tty-next We need the serial-core fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-20ipv4: fix data-races around inet->inet_idEric Dumazet
UDP sendmsg() is lockless, so ip_select_ident_segs() can very well be run from multiple cpus [1] Convert inet->inet_id to an atomic_t, but implement a dedicated path for TCP, avoiding cost of a locked instruction (atomic_add_return()) Note that this patch will cause a trivial merge conflict because we added inet->flags in net-next tree. v2: added missing change in drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c (David Ahern) [1] BUG: KCSAN: data-race in __ip_make_skb / __ip_make_skb read-write to 0xffff888145af952a of 2 bytes by task 7803 on cpu 1: ip_select_ident_segs include/net/ip.h:542 [inline] ip_select_ident include/net/ip.h:556 [inline] __ip_make_skb+0x844/0xc70 net/ipv4/ip_output.c:1446 ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560 udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260 inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2634 __do_sys_sendmmsg net/socket.c:2663 [inline] __se_sys_sendmmsg net/socket.c:2660 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff888145af952a of 2 bytes by task 7804 on cpu 0: ip_select_ident_segs include/net/ip.h:541 [inline] ip_select_ident include/net/ip.h:556 [inline] __ip_make_skb+0x817/0xc70 net/ipv4/ip_output.c:1446 ip_make_skb+0x233/0x2c0 net/ipv4/ip_output.c:1560 udp_sendmsg+0x1199/0x1250 net/ipv4/udp.c:1260 inet_sendmsg+0x63/0x80 net/ipv4/af_inet.c:830 sock_sendmsg_nosec net/socket.c:725 [inline] sock_sendmsg net/socket.c:748 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2494 ___sys_sendmsg net/socket.c:2548 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2634 __do_sys_sendmmsg net/socket.c:2663 [inline] __se_sys_sendmmsg net/socket.c:2660 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2660 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x184d -> 0x184e Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 7804 Comm: syz-executor.1 Not tainted 6.5.0-rc6-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 ================================================================== Fixes: 23f57406b82d ("ipv4: avoid using shared IP generator for connected sockets") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-20net: validate veth and vxcan peer ifindexesJakub Kicinski
veth and vxcan need to make sure the ifindexes of the peer are not negative, core does not validate this. Using iproute2 with user-space-level checking removed: Before: # ./ip link add index 10 type veth peer index -1 # ip link show 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 52:54:00:74:b2:03 brd ff:ff:ff:ff:ff:ff 10: veth1@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 8a:90:ff:57:6d:5d brd ff:ff:ff:ff:ff:ff -1: veth0@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether ae:ed:18:e6:fa:7f brd ff:ff:ff:ff:ff:ff Now: $ ./ip link add index 10 type veth peer index -1 Error: ifindex can't be negative. This problem surfaced in net-next because an explicit WARN() was added, the root cause is older. Fixes: e6f8f1a739b6 ("veth: Allow to create peer link with given ifindex") Fixes: a8f820a380a2 ("can: add Virtual CAN Tunnel driver (vxcan)") Reported-by: syzbot+5ba06978f34abb058571@syzkaller.appspotmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-19net: add skb_queue_purge_reason and __skb_queue_purge_reasonEric Dumazet
skb_queue_purge() and __skb_queue_purge() become wrappers around the new generic functions. New SKB_DROP_REASON_QUEUE_PURGE drop reason is added, but users can start adding more specific reasons. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-19xprtrdma: Remap Receive buffers after a reconnectChuck Lever
On server-initiated disconnect, rpcrdma_xprt_disconnect() was DMA- unmapping the Receive buffers, but rpcrdma_post_recvs() neglected to remap them after a new connection had been established. The result was immediate failure of the new connection with the Receives flushing with LOCAL_PROT_ERR. Fixes: 671c450b6fe0 ("xprtrdma: Fix oops in Receive handler after device removal") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2023-08-19net/smc: Extend SMCR v2 linkgroup netlink attributeGuangguan Wang
Add SMC_NLA_LGR_R_V2_MAX_CONNS and SMC_NLA_LGR_R_V2_MAX_LINKS to SMCR v2 linkgroup netlink attribute SMC_NLA_LGR_R_V2 for linkgroup's detail info showing. Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-19net/smc: support max links per lgr negotiation in clc handshakeGuangguan Wang
Support max links per lgr negotiation in clc handshake for SMCR v2.1, which is one of smc v2.1 features. Server makes decision for the final value of max links based on the client preferred max links and self-preferred max links. Here use the minimum value of the client preferred max links and server preferred max links. Client Server Proposal(max links(client preferred)) --------------------------------------> Accept(max links(accepted value)) accepted value=min(client preferred, server preferred) <------------------------------------- Confirm(max links(accepted value)) -------------------------------------> Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-19net/smc: support max connections per lgr negotiationGuangguan Wang
Support max connections per lgr negotiation for SMCR v2.1, which is one of smc v2.1 features. Server makes decision for the final value of max conns based on the client preferred max conns and self-preferred max conns. Here use the minimum value of client preferred max conns and server preferred max conns. Client Server Proposal(max conns(client preferred)) ------------------------------------> Accept(max conns(accepted value)) accepted value=min(client preferred, server preferred) <----------------------------------- Confirm(max conns(accepted value)) -----------------------------------> Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-19net/smc: support smc v2.x features validateGuangguan Wang
Support SMC v2.x features validate for SMC v2.1. This is the frame code for SMC v2.x features validate, and will take effects only when the negotiated release version is v2.1 or later. For Server, v2.x features' validation should be done in smc_clc_srv_ v2x_features_validate when receiving v2.1 or later CLC Proposal Message, such as max conns, max links negotiation, the decision of the final value of max conns and max links should be made in this function. And final check for server when receiving v2.1 or later CLC Confirm Message should be done in smc_clc_v2x_features_confirm_check. For client, v2.x features' validation should be done in smc_clc_clnt_ v2x_features_validate when receiving v2.1 or later CLC Accept Message, for example, the decision to accpt the accepted value or to decline should be made in this function. Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-19net/smc: add vendor unique experimental options area in clc handshakeGuangguan Wang
Add vendor unique experimental options area in clc handshake. In clc accept and confirm msg, vendor unique experimental options use the 16-Bytes reserved field, which defined in struct smc_clc_fce_gid_ext in previous version. Because of the struct smc_clc_first_contact_ext is widely used and limit the scope of modification, this patch moves the 16-Bytes reserved field out of struct smc_clc_fce_gid_ext, and followed with the struct smc_clc_first_contact_ext in a new struct names struct smc_clc_first_contact_ext_v2x. For SMC-R first connection, in previous version, the struct smc_clc_ first_contact_ext and the 16-Bytes reserved field has already been included in clc accept and confirm msg. Thus, this patch use struct smc_clc_first_contact_ext_v2x instead of the struct smc_clc_first_ contact_ext and the 16-Bytes reserved field in SMC-R clc accept and confirm msg is compatible with previous version. For SMC-D first connection, in previous version, only the struct smc_ clc_first_contact_ext is included in clc accept and confirm msg, and the 16-Bytes reserved field is not included. Thus, when the negotiated smc release version is the version before v2.1, we still use struct smc_clc_first_contact_ext for compatible consideration. If the negotiated smc release version is v2.1 or later, use struct smc_clc_first_contact_ ext_v2x instead. Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-19net/smc: support smc release version negotiation in clc handshakeGuangguan Wang
Support smc release version negotiation in clc handshake based on SMC v2, where no negotiation process for different releases, but for different versions. The latest smc release version was updated to v2.1. And currently there are two release versions of SMCv2, v2.0 and v2.1. In the release version negotiation, client sends the preferred release version by CLC Proposal Message, server makes decision for which release version to use based on the client preferred release version and self-supported release version (here choose the minimum release version of the client preferred and server latest supported), then the decision returns to client by CLC Accept Message. Client confirms the decision by CLC Confirm Message. Client Server Proposal(preferred release version) ------------------------------------> Accept(accpeted release version) min(client preferred, server latest supported) <------------------------------------ Confirm(accpeted release version) ------------------------------------> Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-18dccp: annotate data-races in dccp_poll()Eric Dumazet
We changed tcp_poll() over time, bug never updated dccp. Note that we also could remove dccp instead of maintaining it. Fixes: 7c657876b63c ("[DCCP]: Initial implementation") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230818015820.2701595-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-18tcp: refine skb->ooo_okay settingEric Dumazet
Enabling BIG TCP on a low end platform apparently increased chances of getting flows locked on one busy TX queue. A similar problem was handled in commit 9b462d02d6dd ("tcp: TCP Small Queues and strange attractors"), but the strategy worked for either bulk flows, or 'large enough' RPC. BIG TCP changed how large RPC needed to be to enable the work around: If RPC fits in a single skb, TSQ never triggers. Root cause for the problem is a busy TX queue, with delayed TX completions. This patch changes how we set skb->ooo_okay to detect the case TX completion was not done, but incoming ACK already was processed and emptied rtx queue. Update the comment to explain the tricky details. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230817182353.2523746-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-18sock: annotate data-races around prot->memory_pressureEric Dumazet
*prot->memory_pressure is read/writen locklessly, we need to add proper annotations. A recent commit added a new race, it is time to audit all accesses. Fixes: 2d0c88e84e48 ("sock: Fix misuse of sk_under_memory_pressure()") Fixes: 4d93df0abd50 ("[SCTP]: Rewrite of sctp buffer management code") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Abel Wu <wuyun.abel@bytedance.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Link: https://lore.kernel.org/r/20230818015132.2699348-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-18devlink: add missing unregister linecard notificationJiri Pirko
Cited fixes commit introduced linecard notifications for register, however it didn't add them for unregister. Fix that by adding them. Fixes: c246f9b5fd61 ("devlink: add support to create line card and expose to user") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230817125240.2144794-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-18Merge tag 'batadv-next-pullrequest-20230816' of ↵Jakub Kicinski
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== This cleanup patchset includes the following patches: - bump version strings, by Simon Wunderlich - Remove unused declarations, by Yue Haibing - Clean up MTU handling, by Sven Eckelmann (2 patches) - Clean up/remove (obsolete) functions, by Sven Eckelmann (3 patches) * tag 'batadv-next-pullrequest-20230816' of git://git.open-mesh.org/linux-merge: batman-adv: Drop per algo GW section class code batman-adv: Keep batadv_netlink_notify_* static batman-adv: Drop unused function batadv_gw_bandwidth_set batman-adv: Check hardif MTU against runtime MTU batman-adv: Avoid magic value for minimum MTU batman-adv: Remove unused declarations batman-adv: Start new development cycle ==================== Link: https://lore.kernel.org/r/20230816164000.190884-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-18Merge tag 'batadv-net-pullrequest-20230816' of ↵Jakub Kicinski
git://git.open-mesh.org/linux-merge Simon Wunderlich says: ==================== Here are some batman-adv bugfixes: - Fix issues with adjusted MTUs (2 patches), by Sven Eckelmann - Fix header access for memory reallocation case, by Remi Pommarel - Fix two memory leaks (2 patches), by Remi Pommarel * tag 'batadv-net-pullrequest-20230816' of git://git.open-mesh.org/linux-merge: batman-adv: Fix batadv_v_ogm_aggr_send memory leak batman-adv: Fix TT global entry leak when client roamed back batman-adv: Do not get eth header before batadv_check_management_packet batman-adv: Don't increase MTU when set by user batman-adv: Trigger events for auto adjusted MTU ==================== Link: https://lore.kernel.org/r/20230816163318.189996-1-sw@simonwunderlich.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-18net: use SLAB_NO_MERGE for kmem_cache skbuff_head_cacheJesper Dangaard Brouer
Since v6.5-rc1 MM-tree is merged and contains a new flag SLAB_NO_MERGE in commit d0bf7d5759c1 ("mm/slab: introduce kmem_cache flag SLAB_NO_MERGE") now is the time to use this flag for networking as proposed earlier see link. The SKB (sk_buff) kmem_cache slab is critical for network performance. Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain performance by amortising the alloc/free cost. For the bulk API to perform efficiently the slub fragmentation need to be low. Especially for the SLUB allocator, the efficiency of bulk free API depend on objects belonging to the same slab (page). When running different network performance microbenchmarks, I started to notice that performance was reduced (slightly) when machines had longer uptimes. I believe the cause was 'skbuff_head_cache' got aliased/merged into the general slub for 256 bytes sized objects (with my kernel config, without CONFIG_HARDENED_USERCOPY). For SKB kmem_cache network stack have other various reasons for not merging, but it varies depending on kernel config (e.g. CONFIG_HARDENED_USERCOPY). We want to explicitly set SLAB_NO_MERGE for this kmem_cache to get most out of kmem_cache_{alloc,free}_bulk APIs. When CONFIG_SLUB_TINY is configured the bulk APIs are essentially disabled. Thus, for this case drop the SLAB_NO_MERGE flag. Link: https://lore.kernel.org/all/167396280045.539803.7540459812377220500.stgit@firesoul/ Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/169211265663.1491038.8580163757548985946.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/sfc/tc.c fa165e194997 ("sfc: don't unregister flow_indr if it was never registered") 3bf969e88ada ("sfc: add MAE table machinery for conntrack table") https://lore.kernel.org/all/20230818112159.7430e9b4@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-18mm: allow per-VMA locks on file-backed VMAsMatthew Wilcox (Oracle)
Remove the TCP layering violation by allowing per-VMA locks on all VMAs. The fault path will immediately fail in handle_mm_fault(). There may be a small performance reduction from this patch as a little unnecessary work will be done on each page fault. See later patches for the improvement. Link: https://lkml.kernel.org/r/20230724185410.1124082-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18lwt: Check LWTUNNEL_XMIT_CONTINUE strictlyYan Zhai
LWTUNNEL_XMIT_CONTINUE is implicitly assumed in ip(6)_finish_output2, such that any positive return value from a xmit hook could cause unexpected continue behavior, despite that related skb may have been freed. This could be error-prone for future xmit hook ops. One of the possible errors is to return statuses of dst_output directly. To make the code safer, redefine LWTUNNEL_XMIT_CONTINUE value to distinguish from dst_output statuses and check the continue condition explicitly. Fixes: 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure") Suggested-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Yan Zhai <yan@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/96b939b85eda00e8df4f7c080f770970a4c5f698.1692326837.git.yan@cloudflare.com
2023-08-18lwt: Fix return values of BPF xmit opsYan Zhai
BPF encap ops can return different types of positive values, such like NET_RX_DROP, NET_XMIT_CN, NETDEV_TX_BUSY, and so on, from function skb_do_redirect and bpf_lwt_xmit_reroute. At the xmit hook, such return values would be treated implicitly as LWTUNNEL_XMIT_CONTINUE in ip(6)_finish_output2. When this happens, skbs that have been freed would continue to the neighbor subsystem, causing use-after-free bug and kernel crashes. To fix the incorrect behavior, skb_do_redirect return values can be simply discarded, the same as tc-egress behavior. On the other hand, bpf_lwt_xmit_reroute returns useful errors to local senders, e.g. PMTU information. Thus convert its return values to avoid the conflict with LWTUNNEL_XMIT_CONTINUE. Fixes: 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure") Reported-by: Jordan Griege <jgriege@cloudflare.com> Suggested-by: Martin KaFai Lau <martin.lau@linux.dev> Suggested-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Yan Zhai <yan@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/0d2b878186cfe215fec6b45769c1cd0591d3628d.1692326837.git.yan@cloudflare.com
2023-08-18Merge tag 'net-6.5-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from ipsec and netfilter. No known outstanding regressions. Fixes to fixes: - virtio-net: set queues after driver_ok, avoid a potential race added by recent fix - Revert "vlan: Fix VLAN 0 memory leak", it may lead to a warning when VLAN 0 is registered explicitly - nf_tables: - fix false-positive lockdep splat in recent fixes - don't fail inserts if duplicate has expired (fix test failures) - fix races between garbage collection and netns dismantle Current release - new code bugs: - mlx5: Fix mlx5_cmd_update_root_ft() error flow Previous releases - regressions: - phy: fix IRQ-based wake-on-lan over hibernate / power off Previous releases - always broken: - sock: fix misuse of sk_under_memory_pressure() preventing system from exiting global TCP memory pressure if a single cgroup is under pressure - fix the RTO timer retransmitting skb every 1ms if linear option is enabled - af_key: fix sadb_x_filter validation, amment netlink policy - ipsec: fix slab-use-after-free in decode_session6() - macb: in ZynqMP resume always configure PS GTR for non-wakeup source Misc: - netfilter: set default timeout to 3 secs for sctp shutdown send and recv state (from 300ms), align with protocol timers" * tag 'net-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (49 commits) ice: Block switchdev mode when ADQ is active and vice versa qede: fix firmware halt over suspend and resume net: do not allow gso_size to be set to GSO_BY_FRAGS sock: Fix misuse of sk_under_memory_pressure() sfc: don't fail probe if MAE/TC setup fails sfc: don't unregister flow_indr if it was never registered net: dsa: mv88e6xxx: Wait for EEPROM done before HW reset net/mlx5: Fix mlx5_cmd_update_root_ft() error flow net/mlx5e: XDP, Fix fifo overrun on XDP_REDIRECT i40e: fix misleading debug logs iavf: fix FDIR rule fields masks validation ipv6: fix indentation of a config attribute mailmap: add entries for Simon Horman broadcom: b44: Use b44_writephy() return value net: openvswitch: reject negative ifindex team: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves net: phy: broadcom: stub c45 read/write for 54810 netfilter: nft_dynset: disallow object maps netfilter: nf_tables: GC transaction race with netns dismantle netfilter: nf_tables: fix GC transaction races with netns and netlink event exit path ...
2023-08-17netem: use seeded PRNG for correlated loss eventsFrançois Michel
Use prandom_u32_state() instead of get_random_u32() to generate the correlated loss events of netem. Signed-off-by: François Michel <francois.michel@uclouvain.be> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://lore.kernel.org/r/20230815092348.1449179-4-francois.michel@uclouvain.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17netem: use a seeded PRNG for generating random lossesFrançois Michel
Use prandom_u32_state() instead of get_random_u32() to generate the random loss events of netem. The state of the prng is part of the prng attribute of struct netem_sched_data. Signed-off-by: François Michel <francois.michel@uclouvain.be> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://lore.kernel.org/r/20230815092348.1449179-3-francois.michel@uclouvain.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17netem: add prng attribute to netem_sched_dataFrançois Michel
Add prng attribute to struct netem_sched_data and allows setting the seed of the PRNG through netlink using the new TCA_NETEM_PRNG_SEED attribute. The PRNG attribute is not actually used yet. Signed-off-by: François Michel <francois.michel@uclouvain.be> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://lore.kernel.org/r/20230815092348.1449179-2-francois.michel@uclouvain.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-17sock: Fix misuse of sk_under_memory_pressure()Abel Wu
The status of global socket memory pressure is updated when: a) __sk_mem_raise_allocated(): enter: sk_memory_allocated(sk) > sysctl_mem[1] leave: sk_memory_allocated(sk) <= sysctl_mem[0] b) __sk_mem_reduce_allocated(): leave: sk_under_memory_pressure(sk) && sk_memory_allocated(sk) < sysctl_mem[0] So the conditions of leaving global pressure are inconstant, which may lead to the situation that one pressured net-memcg prevents the global pressure from being cleared when there is indeed no global pressure, thus the global constrains are still in effect unexpectedly on the other sockets. This patch fixes this by ignoring the net-memcg's pressure when deciding whether should leave global memory pressure. Fixes: e1aab161e013 ("socket: initial cgroup code.") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Link: https://lore.kernel.org/r/20230816091226.1542-1-wuyun.abel@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-16Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-08-16 We've added 17 non-merge commits during the last 6 day(s) which contain a total of 20 files changed, 1179 insertions(+), 37 deletions(-). The main changes are: 1) Add a BPF hook in sys_socket() to change the protocol ID from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy applications, from Geliang Tang. 2) Follow-up/fallout fix from the SO_REUSEPORT + bpf_sk_assign work to fix a splat on non-fullsock sks in inet[6]_steal_sock, from Lorenz Bauer. 3) Improvements to struct_ops links to avoid forcing presence of update/validate callbacks. Also add bpf_struct_ops fields documentation, from David Vernet. 4) Ensure libbpf sets close-on-exec flag on gzopen, from Marco Vedovati. 5) Several new tcx selftest additions and bpftool link show support for tcx and xdp links, from Daniel Borkmann. 6) Fix a smatch warning on uninitialized symbol in bpf_perf_link_fill_kprobe, from Yafang Shao. 7) BPF selftest fixes e.g. misplaced break in kfunc_call test, from Yipeng Zou. 8) Small cleanup to remove unused declaration bpf_link_new_file, from Yue Haibing. 9) Small typo fix to bpftool's perf help message, from Daniel T. Lee. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Add mptcpify test selftests/bpf: Fix error checks of mptcp open_and_load selftests/bpf: Add two mptcp netns helpers bpf: Add update_socket_protocol hook bpftool: Implement link show support for xdp bpftool: Implement link show support for tcx selftests/bpf: Add selftest for fill_link_info bpf: Fix uninitialized symbol in bpf_perf_link_fill_kprobe() net: Fix slab-out-of-bounds in inet[6]_steal_sock bpf: Document struct bpf_struct_ops fields bpf: Support default .validate() and .update() behavior for struct_ops links selftests/bpf: Add various more tcx test cases selftests/bpf: Clean up fmod_ret in bench_rename test script selftests/bpf: Fix repeat option when kfunc_call verification fails libbpf: Set close-on-exec flag on gzopen bpftool: fix perf help message bpf: Remove unused declaration bpf_link_new_file() ==================== Link: https://lore.kernel.org/r/20230816212840.1539-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-16bpf: Add update_socket_protocol hookGeliang Tang
Add a hook named update_socket_protocol in __sys_socket(), for bpf progs to attach to and update socket protocol. One user case is to force legacy TCP apps to create and use MPTCP sockets instead of TCP ones. Define a fmod_ret set named bpf_mptcp_fmodret_ids, add the hook update_socket_protocol into this set, and register it in bpf_mptcp_kfunc_init(). Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/79 Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/ac84be00f97072a46f8a72b4e2be46cbb7fa5053.1692147782.git.geliang.tang@suse.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-16net/ipv6: Remove expired routes with a separated list of routes.Kui-Feng Lee
FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree can be expensive if the number of routes in a table is big, even if most of them are permanent. Checking routes in a separated list of routes having expiration will avoid this potential issue. Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16Merge tag 'nf-23-08-16' of ↵David S. Miller
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florisn Westphal says: ==================== These are netfilter fixes for the *net* tree. First patch resolves a false-positive lockdep splat: rcu_dereference is used outside of rcu read lock. Let lockdep validate that the transaction mutex is locked. Second patch fixes a kdoc warning added in previous PR. Third patch fixes a memory leak: The catchall element isn't disabled correctly, this allows userspace to deactivate the element again. This results in refcount underflow which in turn prevents memory release. This was always broken since the feature was added in 5.13. Patch 4 fixes an incorrect change in the previous pull request: Adding a duplicate key to a set should work if the duplicate key has expired, restore this behaviour. All from myself. Patch #5 resolves an old historic artifact in sctp conntrack: a 300ms timeout for shutdown_ack. Increase this to 3s. From Xin Long. Patch #6 fixes a sysctl data race in ipvs, two threads can clobber the sysctl value, from Sishuai Gong. This is a day-0 bug that predates git history. Patches 7, 8 and 9, from Pablo Neira Ayuso, are also followups for the previous GC rework in nf_tables: The netlink notifier and the netns exit path must both increment the gc worker seqcount, else worker may encounter stale (free'd) pointers. ================ Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: implement lockless IP_MINTTLEric Dumazet
inet->min_ttl is already read with READ_ONCE(). Implementing IP_MINTTL socket option set/read without holding the socket lock is easy. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: implement lockless IP_TTLEric Dumazet
ip_select_ttl() is racy, because it reads inet->uc_ttl without proper locking. Add READ_ONCE()/WRITE_ONCE() annotations while allowing IP_TTL socket option to be set/read without holding the socket lock. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->defer_connect to inet->inet_flagsEric Dumazet
Make room in struct inet_sock by removing this bit field, using one available bit in inet_flags instead. Also move local_port_range to fill the resulting hole, saving 8 bytes on 64bit arches. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->bind_address_no_port to inet->inet_flagsEric Dumazet
IP_BIND_ADDRESS_NO_PORT socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->nodefrag to inet->inet_flagsEric Dumazet
IP_NODEFRAG socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->is_icsk to inet->inet_flagsEric Dumazet
We move single bit fields to inet->inet_flags to avoid races. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->transparent to inet->inet_flagsEric Dumazet
IP_TRANSPARENT socket option can now be set/read without locking the socket. v2: removed unused issk variable in mptcp_setsockopt_sol_ip_set_transparent() v4: rebased after commit 3f326a821b99 ("mptcp: change the mpc check helper to return a sk") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Paolo Abeni <pabeni@redhat.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->mc_all to inet->inet_fragsEric Dumazet
IP_MULTICAST_ALL socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->mc_loop to inet->inet_fragsEric Dumazet
IP_MULTICAST_LOOP socket option can now be set/read without locking the socket. v3: fix build bot error reported in ipvs set_mcast_loop() Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->hdrincl to inet->inet_flagsEric Dumazet
IP_HDRINCL socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->freebind to inet->inet_flagsEric Dumazet
IP_FREEBIND socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->recverr_rfc4884 to inet->inet_flagsEric Dumazet
IP_RECVERR_RFC4884 socket option can now be set/read without locking the socket. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: move inet->recverr to inet->inet_flagsEric Dumazet
IP_RECVERR socket option can now be set/get without locking the socket. This patch potentially avoid data-races around inet->recverr. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: set/get simple options locklesslyEric Dumazet
Now we have inet->inet_flags, we can set following options without having to hold the socket lock: IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS, IP_PASSSEC, IP_RECVORIGDSTADDR, IP_RECVFRAGSIZE. ip_sock_set_pktinfo() no longer hold the socket lock. Similarly we can get the following options whithout holding the socket lock: IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, IP_RECVOPTS, IP_RETOPTS, IP_PASSSEC, IP_RECVORIGDSTADDR, IP_CHECKSUM, IP_RECVFRAGSIZE. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16inet: introduce inet->inet_flagsEric Dumazet
Various inet fields are currently racy. do_ip_setsockopt() and do_ip_getsockopt() are mostly holding the socket lock, but some (fast) paths do not. Use a new inet->inet_flags to hold atomic bits in the series. Remove inet->cmsg_flags, and use instead 9 bits from inet_flags. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16ipv6: fix indentation of a config attributePrasad Pandit
Fix indentation of a type attribute of IPV6_VTI config entry. Signed-off-by: Prasad Pandit <pjp@fedoraproject.org> Signed-off-by: David S. Miller <davem@davemloft.net>