summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2024-01-31netfilter: conntrack: check SCTP_CID_SHUTDOWN_ACK for vtag setting in sctp_newXin Long
The annotation says in sctp_new(): "If it is a shutdown ack OOTB packet, we expect a return shutdown complete, otherwise an ABORT Sec 8.4 (5) and (8)". However, it does not check SCTP_CID_SHUTDOWN_ACK before setting vtag[REPLY] in the conntrack entry(ct). Because of that, if the ct in Router disappears for some reason in [1] with the packet sequence like below: Client > Server: sctp (1) [INIT] [init tag: 3201533963] Server > Client: sctp (1) [INIT ACK] [init tag: 972498433] Client > Server: sctp (1) [COOKIE ECHO] Server > Client: sctp (1) [COOKIE ACK] Client > Server: sctp (1) [DATA] (B)(E) [TSN: 3075057809] Server > Client: sctp (1) [SACK] [cum ack 3075057809] Server > Client: sctp (1) [HB REQ] (the ct in Router disappears somehow) <-------- [1] Client > Server: sctp (1) [HB ACK] Client > Server: sctp (1) [DATA] (B)(E) [TSN: 3075057810] Client > Server: sctp (1) [DATA] (B)(E) [TSN: 3075057810] Client > Server: sctp (1) [HB REQ] Client > Server: sctp (1) [DATA] (B)(E) [TSN: 3075057810] Client > Server: sctp (1) [HB REQ] Client > Server: sctp (1) [ABORT] when processing HB ACK packet in Router it calls sctp_new() to initialize the new ct with vtag[REPLY] set to HB_ACK packet's vtag. Later when sending DATA from Client, all the SACKs from Server will get dropped in Router, as the SACK packet's vtag does not match vtag[REPLY] in the ct. The worst thing is the vtag in this ct will never get fixed by the upcoming packets from Server. This patch fixes it by checking SCTP_CID_SHUTDOWN_ACK before setting vtag[REPLY] in the ct in sctp_new() as the annotation says. With this fix, it will leave vtag[REPLY] in ct to 0 in the case above, and the next HB REQ/ACK from Server is able to fix the vtag as its value is 0 in nf_conntrack_sctp_packet(). Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-31netfilter: nf_tables: restrict tunnel object to NFPROTO_NETDEVPablo Neira Ayuso
Bail out on using the tunnel dst template from other than netdev family. Add the infrastructure to check for the family in objects. Fixes: af308b94a2a4 ("netfilter: nf_tables: add tunnel support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-31netfilter: conntrack: correct window scaling with retransmitted SYNRyan Schaefer
commit c7aab4f17021 ("netfilter: nf_conntrack_tcp: re-init for syn packets only") introduces a bug where SYNs in ORIGINAL direction on reused 5-tuple result in incorrect window scale negotiation. This commit merged the SYN re-initialization and simultaneous open or SYN retransmits cases. Merging this block added the logic in tcp_init_sender() that performed window scale negotiation to the retransmitted syn case. Previously. this would only result in updating the sender's scale and flags. After the merge the additional logic results in improperly clearing the scale in ORIGINAL direction before any packets in the REPLY direction are received. This results in packets incorrectly being marked invalid for being out-of-window. This can be reproduced with the following trace: Packet Sequence: > Flags [S], seq 1687765604, win 62727, options [.. wscale 7], length 0 > Flags [S], seq 1944817196, win 62727, options [.. wscale 7], length 0 In order to fix the issue, only evaluate window negotiation for packets in the REPLY direction. This was tested with simultaneous open, fast open, and the above reproduction. Fixes: c7aab4f17021 ("netfilter: nf_conntrack_tcp: re-init for syn packets only") Signed-off-by: Ryan Schaefer <ryanschf@amazon.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-24netfilter: nf_tables: validate NFPROTO_* familyPablo Neira Ayuso
Several expressions explicitly refer to NF_INET_* hook definitions from expr->ops->validate, however, family is not validated. Bail out with EOPNOTSUPP in case they are used from unsupported families. Fixes: 0ca743a55991 ("netfilter: nf_tables: add compatibility layer for x_tables") Fixes: a3c90f7a2323 ("netfilter: nf_tables: flow offload expression") Fixes: 2fa841938c64 ("netfilter: nf_tables: introduce routing expression") Fixes: 554ced0a6e29 ("netfilter: nf_tables: add support for native socket matching") Fixes: ad49d86e07a4 ("netfilter: nf_tables: Add synproxy support") Fixes: 4ed8eb6570a4 ("netfilter: nf_tables: Add native tproxy support") Fixes: 6c47260250fc ("netfilter: nf_tables: add xfrm expression") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-24netfilter: nf_tables: reject QUEUE/DROP verdict parametersFlorian Westphal
This reverts commit e0abdadcc6e1. core.c:nf_hook_slow assumes that the upper 16 bits of NF_DROP verdicts contain a valid errno, i.e. -EPERM, -EHOSTUNREACH or similar, or 0. Due to the reverted commit, its possible to provide a positive value, e.g. NF_ACCEPT (1), which results in use-after-free. Its not clear to me why this commit was made. NF_QUEUE is not used by nftables; "queue" rules in nftables will result in use of "nft_queue" expression. If we later need to allow specifiying errno values from userspace (do not know why), this has to call NF_DROP_GETERR and check that "err <= 0" holds true. Fixes: e0abdadcc6e1 ("netfilter: nf_tables: accept QUEUE/DROP verdict parameters") Cc: stable@vger.kernel.org Reported-by: Notselwyn <notselwyn@pwning.tech> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-24netfilter: nf_tables: restrict anonymous set and map names to 16 bytesFlorian Westphal
nftables has two types of sets/maps, one where userspace defines the name, and anonymous sets/maps, where userspace defines a template name. For the latter, kernel requires presence of exactly one "%d". nftables uses "__set%d" and "__map%d" for this. The kernel will expand the format specifier and replaces it with the smallest unused number. As-is, userspace could define a template name that allows to move the set name past the 256 bytes upperlimit (post-expansion). I don't see how this could be a problem, but I would prefer if userspace cannot do this, so add a limit of 16 bytes for the '%d' template name. 16 bytes is the old total upper limit for set names that existed when nf_tables was merged initially. Fixes: 387454901bd6 ("netfilter: nf_tables: Allow set names of up to 255 chars") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-24netfilter: nft_limit: reject configurations that cause integer overflowFlorian Westphal
Reject bogus configs where internal token counter wraps around. This only occurs with very very large requests, such as 17gbyte/s. Its better to reject this rather than having incorrect ratelimit. Fixes: d2168e849ebf ("netfilter: nft_limit: add per-byte limiting") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-24netfilter: nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress basechainPablo Neira Ayuso
Remove netdevice from inet/ingress basechain in case NETDEV_UNREGISTER event is reported, otherwise a stale reference to netdevice remains in the hook list. Fixes: 60a3815da702 ("netfilter: add inet ingress support") Cc: stable@vger.kernel.org Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-24net/sched: flower: Fix chain template offloadIdo Schimmel
When a qdisc is deleted from a net device the stack instructs the underlying driver to remove its flow offload callback from the associated filter block using the 'FLOW_BLOCK_UNBIND' command. The stack then continues to replay the removal of the filters in the block for this driver by iterating over the chains in the block and invoking the 'reoffload' operation of the classifier being used. In turn, the classifier in its 'reoffload' operation prepares and emits a 'FLOW_CLS_DESTROY' command for each filter. However, the stack does not do the same for chain templates and the underlying driver never receives a 'FLOW_CLS_TMPLT_DESTROY' command when a qdisc is deleted. This results in a memory leak [1] which can be reproduced using [2]. Fix by introducing a 'tmplt_reoffload' operation and have the stack invoke it with the appropriate arguments as part of the replay. Implement the operation in the sole classifier that supports chain templates (flower) by emitting the 'FLOW_CLS_TMPLT_{CREATE,DESTROY}' command based on whether a flow offload callback is being bound to a filter block or being unbound from one. As far as I can tell, the issue happens since cited commit which reordered tcf_block_offload_unbind() before tcf_block_flush_all_chains() in __tcf_block_put(). The order cannot be reversed as the filter block is expected to be freed after flushing all the chains. [1] unreferenced object 0xffff888107e28800 (size 2048): comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s) hex dump (first 32 bytes): b1 a6 7c 11 81 88 ff ff e0 5b b3 10 81 88 ff ff ..|......[...... 01 00 00 00 00 00 00 00 e0 aa b0 84 ff ff ff ff ................ backtrace: [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320 [<ffffffff81ab374e>] __kmalloc+0x4e/0x90 [<ffffffff832aec6d>] mlxsw_sp_acl_ruleset_get+0x34d/0x7a0 [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180 [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280 [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340 [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0 [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170 [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0 [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440 [<ffffffff83ac6270>] netlink_unicast+0x540/0x820 [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0 [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80 [<ffffffff8379d29a>] ___sys_sendmsg+0x13a/0x1e0 [<ffffffff8379d50c>] __sys_sendmsg+0x11c/0x1f0 [<ffffffff843b9ce0>] do_syscall_64+0x40/0xe0 unreferenced object 0xffff88816d2c0400 (size 1024): comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s) hex dump (first 32 bytes): 40 00 00 00 00 00 00 00 57 f6 38 be 00 00 00 00 @.......W.8..... 10 04 2c 6d 81 88 ff ff 10 04 2c 6d 81 88 ff ff ..,m......,m.... backtrace: [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320 [<ffffffff81ab36c1>] __kmalloc_node+0x51/0x90 [<ffffffff81a8ed96>] kvmalloc_node+0xa6/0x1f0 [<ffffffff82827d03>] bucket_table_alloc.isra.0+0x83/0x460 [<ffffffff82828d2b>] rhashtable_init+0x43b/0x7c0 [<ffffffff832aed48>] mlxsw_sp_acl_ruleset_get+0x428/0x7a0 [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180 [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280 [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340 [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0 [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170 [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0 [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440 [<ffffffff83ac6270>] netlink_unicast+0x540/0x820 [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0 [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80 [2] # tc qdisc add dev swp1 clsact # tc chain add dev swp1 ingress proto ip chain 1 flower dst_ip 0.0.0.0/32 # tc qdisc del dev swp1 clsact # devlink dev reload pci/0000:06:00.0 Fixes: bbf73830cd48 ("net: sched: traverse chains in block with tcf_get_next_chain()") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-23Merge tag 'wireless-2024-01-22' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Kalle Valo says: ==================== wireless fixes for v6.8-rc2 The most visible fix here is the ath11k crash fix which was introduced in v6.7. We also have a fix for iwlwifi memory corruption and few smaller fixes in the stack. * tag 'wireless-2024-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: mac80211: fix race condition on enabling fast-xmit wifi: iwlwifi: fix a memory corruption wifi: mac80211: fix potential sta-link leak wifi: cfg80211/mac80211: remove dependency on non-existing option wifi: cfg80211: fix missing interfaces when dumping wifi: ath11k: rely on mac80211 debugfs handling for vif wifi: p54: fix GCC format truncation warning with wiphy->fw_version ==================== Link: https://lore.kernel.org/r/20240122153434.E0254C433C7@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-23ipv6: init the accept_queue's spinlocks in inet6_createZhengchao Shao
In commit 198bc90e0e73("tcp: make sure init the accept_queue's spinlocks once"), the spinlocks of accept_queue are initialized only when socket is created in the inet4 scenario. The locks are not initialized when socket is created in the inet6 scenario. The kernel reports the following error: INFO: trying to register non-static key. The code is fine but needs lockdep annotation, or maybe you didn't initialize this object before use? turning off the locking correctness validator. Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: <TASK> dump_stack_lvl (lib/dump_stack.c:107) register_lock_class (kernel/locking/lockdep.c:1289) __lock_acquire (kernel/locking/lockdep.c:5015) lock_acquire.part.0 (kernel/locking/lockdep.c:5756) _raw_spin_lock_bh (kernel/locking/spinlock.c:178) inet_csk_listen_stop (net/ipv4/inet_connection_sock.c:1386) tcp_disconnect (net/ipv4/tcp.c:2981) inet_shutdown (net/ipv4/af_inet.c:935) __sys_shutdown (./include/linux/file.h:32 net/socket.c:2438) __x64_sys_shutdown (net/socket.c:2445) do_syscall_64 (arch/x86/entry/common.c:52) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129) RIP: 0033:0x7f52ecd05a3d Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ab a3 0e 00 f7 d8 64 89 01 48 RSP: 002b:00007f52ecf5dde8 EFLAGS: 00000293 ORIG_RAX: 0000000000000030 RAX: ffffffffffffffda RBX: 00007f52ecf5e640 RCX: 00007f52ecd05a3d RDX: 00007f52ecc8b188 RSI: 0000000000000000 RDI: 0000000000000004 RBP: 00007f52ecf5de20 R08: 00007ffdae45c69f R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000293 R12: 00007f52ecf5e640 R13: 0000000000000000 R14: 00007f52ecc8b060 R15: 00007ffdae45c6e0 Fixes: 198bc90e0e73 ("tcp: make sure init the accept_queue's spinlocks once") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240122102001.2851701-1-shaozhengchao@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-23netlink: fix potential sleeping issue in mqueue_flush_fileZhengchao Shao
I analyze the potential sleeping issue of the following processes: Thread A Thread B ... netlink_create //ref = 1 do_mq_notify ... sock = netlink_getsockbyfilp ... //ref = 2 info->notify_sock = sock; ... ... netlink_sendmsg ... skb = netlink_alloc_large_skb //skb->head is vmalloced ... netlink_unicast ... sk = netlink_getsockbyportid //ref = 3 ... netlink_sendskb ... __netlink_sendskb ... skb_queue_tail //put skb to sk_receive_queue ... sock_put //ref = 2 ... ... ... netlink_release ... deferred_put_nlk_sk //ref = 1 mqueue_flush_file spin_lock remove_notification netlink_sendskb sock_put //ref = 0 sk_free ... __sk_destruct netlink_sock_destruct skb_queue_purge //get skb from sk_receive_queue ... __skb_queue_purge_reason kfree_skb_reason __kfree_skb ... skb_release_all skb_release_head_state netlink_skb_destructor vfree(skb->head) //sleeping while holding spinlock In netlink_sendmsg, if the memory pointed to by skb->head is allocated by vmalloc, and is put to sk_receive_queue queue, also the skb is not freed. When the mqueue executes flush, the sleeping bug will occur. Use vfree_atomic instead of vfree in netlink_skb_destructor to solve the issue. Fixes: c05cdb1b864f ("netlink: allow large data transfers from user-space") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Link: https://lore.kernel.org/r/20240122011807.2110357-1-shaozhengchao@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-23tcp: Add memory barrier to tcp_push()Salvatore Dipietro
On CPUs with weak memory models, reads and updates performed by tcp_push to the sk variables can get reordered leaving the socket throttled when it should not. The tasklet running tcp_wfree() may also not observe the memory updates in time and will skip flushing any packets throttled by tcp_push(), delaying the sending. This can pathologically cause 40ms extra latency due to bad interactions with delayed acks. Adding a memory barrier in tcp_push removes the bug, similarly to the previous commit bf06200e732d ("tcp: tsq: fix nonagle handling"). smp_mb__after_atomic() is used to not incur in unnecessary overhead on x86 since not affected. Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu 22.04 and Apache Tomcat 9.0.83 running the basic servlet below: import java.io.IOException; import java.io.OutputStreamWriter; import java.io.PrintWriter; import javax.servlet.ServletException; import javax.servlet.http.HttpServlet; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletResponse; public class HelloWorldServlet extends HttpServlet { @Override protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { response.setContentType("text/html;charset=utf-8"); OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8"); String s = "a".repeat(3096); osw.write(s,0,s.length()); osw.flush(); } } Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+ values is observed while, with the patch, the extra latency disappears. No patch and tcp_autocorking=1 ./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.60.173:8080/hello/hello ... 50.000% 0.91ms 75.000% 1.13ms 90.000% 1.46ms 99.000% 1.74ms 99.900% 1.89ms 99.990% 41.95ms <<< 40+ ms extra latency 99.999% 48.32ms 100.000% 48.96ms With patch and tcp_autocorking=1 ./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.60.173:8080/hello/hello ... 50.000% 0.90ms 75.000% 1.13ms 90.000% 1.45ms 99.000% 1.72ms 99.900% 1.83ms 99.990% 2.11ms <<< no 40+ ms extra latency 99.999% 2.53ms 100.000% 2.62ms Patch has been also tested on x86 (m7i.2xlarge instance) which it is not affected by this issue and the patch doesn't introduce any additional delay. Fixes: 7aa5470c2c09 ("tcp: tsq: move tsq_flags close to sk_wmem_alloc") Signed-off-by: Salvatore Dipietro <dipiets@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240119190133.43698-1-dipiets@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-22net/rds: Fix UBSAN: array-index-out-of-bounds in rds_cmsg_recvSharath Srinivasan
Syzcaller UBSAN crash occurs in rds_cmsg_recv(), which reads inc->i_rx_lat_trace[j + 1] with index 4 (3 + 1), but with array size of 4 (RDS_RX_MAX_TRACES). Here 'j' is assigned from rs->rs_rx_trace[i] and in-turn from trace.rx_trace_pos[i] in rds_recv_track_latency(), with both arrays sized 3 (RDS_MSG_RX_DGRAM_TRACE_MAX). So fix the off-by-one bounds check in rds_recv_track_latency() to prevent a potential crash in rds_cmsg_recv(). Found by syzcaller: ================================================================= UBSAN: array-index-out-of-bounds in net/rds/recv.c:585:39 index 4 is out of range for type 'u64 [4]' CPU: 1 PID: 8058 Comm: syz-executor228 Not tainted 6.6.0-gd2f51b3516da #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x136/0x150 lib/dump_stack.c:106 ubsan_epilogue lib/ubsan.c:217 [inline] __ubsan_handle_out_of_bounds+0xd5/0x130 lib/ubsan.c:348 rds_cmsg_recv+0x60d/0x700 net/rds/recv.c:585 rds_recvmsg+0x3fb/0x1610 net/rds/recv.c:716 sock_recvmsg_nosec net/socket.c:1044 [inline] sock_recvmsg+0xe2/0x160 net/socket.c:1066 __sys_recvfrom+0x1b6/0x2f0 net/socket.c:2246 __do_sys_recvfrom net/socket.c:2264 [inline] __se_sys_recvfrom net/socket.c:2260 [inline] __x64_sys_recvfrom+0xe0/0x1b0 net/socket.c:2260 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b ================================================================== Fixes: 3289025aedc0 ("RDS: add receive message trace used by application") Reported-by: Chenyuan Yang <chenyuan0y@gmail.com> Closes: https://lore.kernel.org/linux-rdma/CALGdzuoVdq-wtQ4Az9iottBqC5cv9ZhcE5q8N7LfYFvkRsOVcw@mail.gmail.com/ Signed-off-by: Sharath Srinivasan <sharath.srinivasan@oracle.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-22net: fix removing a namespace with conflicting altnamesJakub Kicinski
Mark reports a BUG() when a net namespace is removed. kernel BUG at net/core/dev.c:11520! Physical interfaces moved outside of init_net get "refunded" to init_net when that namespace disappears. The main interface name may get overwritten in the process if it would have conflicted. We need to also discard all conflicting altnames. Recent fixes addressed ensuring that altnames get moved with the main interface, which surfaced this problem. Reported-by: Марк Коренберг <socketpair@gmail.com> Link: https://lore.kernel.org/all/CAEmTpZFZ4Sv3KwqFOY2WKDHeZYdi0O7N5H1nTvcGp=SAEavtDg@mail.gmail.com/ Fixes: 7663d522099e ("net: check for altname conflicts when changing netdev's netns") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-21udp: fix busy pollingEric Dumazet
Generic sk_busy_loop_end() only looks at sk->sk_receive_queue for presence of packets. Problem is that for UDP sockets after blamed commit, some packets could be present in another queue: udp_sk(sk)->reader_queue In some cases, a busy poller could spin until timeout expiration, even if some packets are available in udp_sk(sk)->reader_queue. v3: - make sk_busy_loop_end() nicer (Willem) v2: - add a READ_ONCE(sk->sk_family) in sk_is_inet() to avoid KCSAN splats. - add a sk_is_inet() check in sk_is_udp() (Willem feedback) - add a sk_is_inet() check in sk_is_tcp(). Fixes: 2276f58ac589 ("udp: use a separate rx queue for packet reception") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-19llc: Drop support for ETH_P_TR_802_2.Kuniyuki Iwashima
syzbot reported an uninit-value bug below. [0] llc supports ETH_P_802_2 (0x0004) and used to support ETH_P_TR_802_2 (0x0011), and syzbot abused the latter to trigger the bug. write$tun(r0, &(0x7f0000000040)={@val={0x0, 0x11}, @val, @mpls={[], @llc={@snap={0xaa, 0x1, ')', "90e5dd"}}}}, 0x16) llc_conn_handler() initialises local variables {saddr,daddr}.mac based on skb in llc_pdu_decode_sa()/llc_pdu_decode_da() and passes them to __llc_lookup(). However, the initialisation is done only when skb->protocol is htons(ETH_P_802_2), otherwise, __llc_lookup_established() and __llc_lookup_listener() will read garbage. The missing initialisation existed prior to commit 211ed865108e ("net: delete all instances of special processing for token ring"). It removed the part to kick out the token ring stuff but forgot to close the door allowing ETH_P_TR_802_2 packets to sneak into llc_rcv(). Let's remove llc_tr_packet_type and complete the deprecation. [0]: BUG: KMSAN: uninit-value in __llc_lookup_established+0xe9d/0xf90 __llc_lookup_established+0xe9d/0xf90 __llc_lookup net/llc/llc_conn.c:611 [inline] llc_conn_handler+0x4bd/0x1360 net/llc/llc_conn.c:791 llc_rcv+0xfbb/0x14a0 net/llc/llc_input.c:206 __netif_receive_skb_one_core net/core/dev.c:5527 [inline] __netif_receive_skb+0x1a6/0x5a0 net/core/dev.c:5641 netif_receive_skb_internal net/core/dev.c:5727 [inline] netif_receive_skb+0x58/0x660 net/core/dev.c:5786 tun_rx_batched+0x3ee/0x980 drivers/net/tun.c:1555 tun_get_user+0x53af/0x66d0 drivers/net/tun.c:2002 tun_chr_write_iter+0x3af/0x5d0 drivers/net/tun.c:2048 call_write_iter include/linux/fs.h:2020 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x8ef/0x1490 fs/read_write.c:584 ksys_write+0x20f/0x4c0 fs/read_write.c:637 __do_sys_write fs/read_write.c:649 [inline] __se_sys_write fs/read_write.c:646 [inline] __x64_sys_write+0x93/0xd0 fs/read_write.c:646 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b Local variable daddr created at: llc_conn_handler+0x53/0x1360 net/llc/llc_conn.c:783 llc_rcv+0xfbb/0x14a0 net/llc/llc_input.c:206 CPU: 1 PID: 5004 Comm: syz-executor994 Not tainted 6.6.0-syzkaller-14500-g1c41041124bd #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023 Fixes: 211ed865108e ("net: delete all instances of special processing for token ring") Reported-by: syzbot+b5ad66046b913bc04c6f@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b5ad66046b913bc04c6f Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240119015515.61898-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-19llc: make llc_ui_sendmsg() more robust against bonding changesEric Dumazet
syzbot was able to trick llc_ui_sendmsg(), allocating an skb with no headroom, but subsequently trying to push 14 bytes of Ethernet header [1] Like some others, llc_ui_sendmsg() releases the socket lock before calling sock_alloc_send_skb(). Then it acquires it again, but does not redo all the sanity checks that were performed. This fix: - Uses LL_RESERVED_SPACE() to reserve space. - Check all conditions again after socket lock is held again. - Do not account Ethernet header for mtu limitation. [1] skbuff: skb_under_panic: text:ffff800088baa334 len:1514 put:14 head:ffff0000c9c37000 data:ffff0000c9c36ff2 tail:0x5dc end:0x6c0 dev:bond0 kernel BUG at net/core/skbuff.c:193 ! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 6875 Comm: syz-executor.0 Not tainted 6.7.0-rc8-syzkaller-00101-g0802e17d9aca-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : skb_panic net/core/skbuff.c:189 [inline] pc : skb_under_panic+0x13c/0x140 net/core/skbuff.c:203 lr : skb_panic net/core/skbuff.c:189 [inline] lr : skb_under_panic+0x13c/0x140 net/core/skbuff.c:203 sp : ffff800096f97000 x29: ffff800096f97010 x28: ffff80008cc8d668 x27: dfff800000000000 x26: ffff0000cb970c90 x25: 00000000000005dc x24: ffff0000c9c36ff2 x23: ffff0000c9c37000 x22: 00000000000005ea x21: 00000000000006c0 x20: 000000000000000e x19: ffff800088baa334 x18: 1fffe000368261ce x17: ffff80008e4ed000 x16: ffff80008a8310f8 x15: 0000000000000001 x14: 1ffff00012df2d58 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000001 x10: 0000000000ff0100 x9 : e28a51f1087e8400 x8 : e28a51f1087e8400 x7 : ffff80008028f8d0 x6 : 0000000000000000 x5 : 0000000000000001 x4 : 0000000000000001 x3 : ffff800082b78714 x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000089 Call trace: skb_panic net/core/skbuff.c:189 [inline] skb_under_panic+0x13c/0x140 net/core/skbuff.c:203 skb_push+0xf0/0x108 net/core/skbuff.c:2451 eth_header+0x44/0x1f8 net/ethernet/eth.c:83 dev_hard_header include/linux/netdevice.h:3188 [inline] llc_mac_hdr_init+0x110/0x17c net/llc/llc_output.c:33 llc_sap_action_send_xid_c+0x170/0x344 net/llc/llc_s_ac.c:85 llc_exec_sap_trans_actions net/llc/llc_sap.c:153 [inline] llc_sap_next_state net/llc/llc_sap.c:182 [inline] llc_sap_state_process+0x1ec/0x774 net/llc/llc_sap.c:209 llc_build_and_send_xid_pkt+0x12c/0x1c0 net/llc/llc_sap.c:270 llc_ui_sendmsg+0x7bc/0xb1c net/llc/af_llc.c:997 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] sock_sendmsg+0x194/0x274 net/socket.c:767 splice_to_socket+0x7cc/0xd58 fs/splice.c:881 do_splice_from fs/splice.c:933 [inline] direct_splice_actor+0xe4/0x1c0 fs/splice.c:1142 splice_direct_to_actor+0x2a0/0x7e4 fs/splice.c:1088 do_splice_direct+0x20c/0x348 fs/splice.c:1194 do_sendfile+0x4bc/0xc70 fs/read_write.c:1254 __do_sys_sendfile64 fs/read_write.c:1322 [inline] __se_sys_sendfile64 fs/read_write.c:1308 [inline] __arm64_sys_sendfile64+0x160/0x3b4 fs/read_write.c:1308 __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:51 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:136 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:155 el0_svc+0x54/0x158 arch/arm64/kernel/entry-common.c:678 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:696 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:595 Code: aa1803e6 aa1903e7 a90023f5 94792f6a (d4210000) Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-and-tested-by: syzbot+2a7024e9502df538e8ef@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240118183625.4007013-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-19vlan: skip nested type that is not IFLA_VLAN_QOS_MAPPINGLin Ma
In the vlan_changelink function, a loop is used to parse the nested attributes IFLA_VLAN_EGRESS_QOS and IFLA_VLAN_INGRESS_QOS in order to obtain the struct ifla_vlan_qos_mapping. These two nested attributes are checked in the vlan_validate_qos_map function, which calls nla_validate_nested_deprecated with the vlan_map_policy. However, this deprecated validator applies a LIBERAL strictness, allowing the presence of an attribute with the type IFLA_VLAN_QOS_UNSPEC. Consequently, the loop in vlan_changelink may parse an attribute of type IFLA_VLAN_QOS_UNSPEC and believe it carries a payload of struct ifla_vlan_qos_mapping, which is not necessarily true. To address this issue and ensure compatibility, this patch introduces two type checks that skip attributes whose type is not IFLA_VLAN_QOS_MAPPING. Fixes: 07b5b17e157b ("[VLAN]: Use rtnl_link API") Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240118130306.1644001-1-linma@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-19tcp: make sure init the accept_queue's spinlocks onceZhengchao Shao
When I run syz's reproduction C program locally, it causes the following issue: pvqspinlock: lock 0xffff9d181cd5c660 has corrupted value 0x0! WARNING: CPU: 19 PID: 21160 at __pv_queued_spin_unlock_slowpath (kernel/locking/qspinlock_paravirt.h:508) Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:__pv_queued_spin_unlock_slowpath (kernel/locking/qspinlock_paravirt.h:508) Code: 73 56 3a ff 90 c3 cc cc cc cc 8b 05 bb 1f 48 01 85 c0 74 05 c3 cc cc cc cc 8b 17 48 89 fe 48 c7 c7 30 20 ce 8f e8 ad 56 42 ff <0f> 0b c3 cc cc cc cc 0f 0b 0f 1f 40 00 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffa8d200604cb8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9d1ef60e0908 RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9d1ef60e0900 RBP: ffff9d181cd5c280 R08: 0000000000000000 R09: 00000000ffff7fff R10: ffffa8d200604b68 R11: ffffffff907dcdc8 R12: 0000000000000000 R13: ffff9d181cd5c660 R14: ffff9d1813a3f330 R15: 0000000000001000 FS: 00007fa110184640(0000) GS:ffff9d1ef60c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000000 CR3: 000000011f65e000 CR4: 00000000000006f0 Call Trace: <IRQ> _raw_spin_unlock (kernel/locking/spinlock.c:186) inet_csk_reqsk_queue_add (net/ipv4/inet_connection_sock.c:1321) inet_csk_complete_hashdance (net/ipv4/inet_connection_sock.c:1358) tcp_check_req (net/ipv4/tcp_minisocks.c:868) tcp_v4_rcv (net/ipv4/tcp_ipv4.c:2260) ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205) ip_local_deliver_finish (net/ipv4/ip_input.c:234) __netif_receive_skb_one_core (net/core/dev.c:5529) process_backlog (./include/linux/rcupdate.h:779) __napi_poll (net/core/dev.c:6533) net_rx_action (net/core/dev.c:6604) __do_softirq (./arch/x86/include/asm/jump_label.h:27) do_softirq (kernel/softirq.c:454 kernel/softirq.c:441) </IRQ> <TASK> __local_bh_enable_ip (kernel/softirq.c:381) __dev_queue_xmit (net/core/dev.c:4374) ip_finish_output2 (./include/net/neighbour.h:540 net/ipv4/ip_output.c:235) __ip_queue_xmit (net/ipv4/ip_output.c:535) __tcp_transmit_skb (net/ipv4/tcp_output.c:1462) tcp_rcv_synsent_state_process (net/ipv4/tcp_input.c:6469) tcp_rcv_state_process (net/ipv4/tcp_input.c:6657) tcp_v4_do_rcv (net/ipv4/tcp_ipv4.c:1929) __release_sock (./include/net/sock.h:1121 net/core/sock.c:2968) release_sock (net/core/sock.c:3536) inet_wait_for_connect (net/ipv4/af_inet.c:609) __inet_stream_connect (net/ipv4/af_inet.c:702) inet_stream_connect (net/ipv4/af_inet.c:748) __sys_connect (./include/linux/file.h:45 net/socket.c:2064) __x64_sys_connect (net/socket.c:2073 net/socket.c:2070 net/socket.c:2070) do_syscall_64 (arch/x86/entry/common.c:51 arch/x86/entry/common.c:82) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129) RIP: 0033:0x7fa10ff05a3d Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ab a3 0e 00 f7 d8 64 89 01 48 RSP: 002b:00007fa110183de8 EFLAGS: 00000202 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000020000054 RCX: 00007fa10ff05a3d RDX: 000000000000001c RSI: 0000000020000040 RDI: 0000000000000003 RBP: 00007fa110183e20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 00007fa110184640 R13: 0000000000000000 R14: 00007fa10fe8b060 R15: 00007fff73e23b20 </TASK> The issue triggering process is analyzed as follows: Thread A Thread B tcp_v4_rcv //receive ack TCP packet inet_shutdown tcp_check_req tcp_disconnect //disconnect sock ... tcp_set_state(sk, TCP_CLOSE) inet_csk_complete_hashdance ... inet_csk_reqsk_queue_add inet_listen //start listen spin_lock(&queue->rskq_lock) inet_csk_listen_start ... reqsk_queue_alloc ... spin_lock_init spin_unlock(&queue->rskq_lock) //warning When the socket receives the ACK packet during the three-way handshake, it will hold spinlock. And then the user actively shutdowns the socket and listens to the socket immediately, the spinlock will be initialized. When the socket is going to release the spinlock, a warning is generated. Also the same issue to fastopenq.lock. Move init spinlock to inet_create and inet_accept to make sure init the accept_queue's spinlocks once. Fixes: fff1f3001cc5 ("tcp: add a spinlock to protect struct request_sock_queue") Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path") Reported-by: Ming Shu <sming56@aliyun.com> Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240118012019.1751966-1-shaozhengchao@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-19net/smc: fix illegal rmb_desc access in SMC-D connection dumpWen Gu
A crash was found when dumping SMC-D connections. It can be reproduced by following steps: - run nginx/wrk test: smc_run nginx smc_run wrk -t 16 -c 1000 -d <duration> -H 'Connection: Close' <URL> - continuously dump SMC-D connections in parallel: watch -n 1 'smcss -D' BUG: kernel NULL pointer dereference, address: 0000000000000030 CPU: 2 PID: 7204 Comm: smcss Kdump: loaded Tainted: G E 6.7.0+ #55 RIP: 0010:__smc_diag_dump.constprop.0+0x5e5/0x620 [smc_diag] Call Trace: <TASK> ? __die+0x24/0x70 ? page_fault_oops+0x66/0x150 ? exc_page_fault+0x69/0x140 ? asm_exc_page_fault+0x26/0x30 ? __smc_diag_dump.constprop.0+0x5e5/0x620 [smc_diag] ? __kmalloc_node_track_caller+0x35d/0x430 ? __alloc_skb+0x77/0x170 smc_diag_dump_proto+0xd0/0xf0 [smc_diag] smc_diag_dump+0x26/0x60 [smc_diag] netlink_dump+0x19f/0x320 __netlink_dump_start+0x1dc/0x300 smc_diag_handler_dump+0x6a/0x80 [smc_diag] ? __pfx_smc_diag_dump+0x10/0x10 [smc_diag] sock_diag_rcv_msg+0x121/0x140 ? __pfx_sock_diag_rcv_msg+0x10/0x10 netlink_rcv_skb+0x5a/0x110 sock_diag_rcv+0x28/0x40 netlink_unicast+0x22a/0x330 netlink_sendmsg+0x1f8/0x420 __sock_sendmsg+0xb0/0xc0 ____sys_sendmsg+0x24e/0x300 ? copy_msghdr_from_user+0x62/0x80 ___sys_sendmsg+0x7c/0xd0 ? __do_fault+0x34/0x160 ? do_read_fault+0x5f/0x100 ? do_fault+0xb0/0x110 ? __handle_mm_fault+0x2b0/0x6c0 __sys_sendmsg+0x4d/0x80 do_syscall_64+0x69/0x180 entry_SYSCALL_64_after_hwframe+0x6e/0x76 It is possible that the connection is in process of being established when we dump it. Assumed that the connection has been registered in a link group by smc_conn_create() but the rmb_desc has not yet been initialized by smc_buf_create(), thus causing the illegal access to conn->rmb_desc. So fix it by checking before dump. Fixes: 4b1b7d3b30a6 ("net/smc: add SMC-D diag support") Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-18Merge tag 'net-6.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf and netfilter. Previous releases - regressions: - Revert "net: rtnetlink: Enslave device before bringing it up", breaks the case inverse to the one it was trying to fix - net: dsa: fix oob access in DSA's netdevice event handler dereference netdev_priv() before check its a DSA port - sched: track device in tcf_block_get/put_ext() only for clsact binder types - net: tls, fix WARNING in __sk_msg_free when record becomes full during splice and MORE hint set - sfp-bus: fix SFP mode detect from bitrate - drv: stmmac: prevent DSA tags from breaking COE Previous releases - always broken: - bpf: fix no forward progress in in bpf_iter_udp if output buffer is too small - bpf: reject variable offset alu on registers with a type of PTR_TO_FLOW_KEYS to prevent oob access - netfilter: tighten input validation - net: add more sanity check in virtio_net_hdr_to_skb() - rxrpc: fix use of Don't Fragment flag on RESPONSE packets, avoid infinite loop - amt: do not use the portion of skb->cb area which may get clobbered - mptcp: improve validation of the MPTCPOPT_MP_JOIN MCTCP option Misc: - spring cleanup of inactive maintainers" * tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits) i40e: Include types.h to some headers ipv6: mcast: fix data-race in ipv6_mc_down / mld_ifc_work selftests: mlxsw: qos_pfc: Adjust the test to support 8 lanes selftests: mlxsw: qos_pfc: Remove wrong description mlxsw: spectrum_router: Register netdevice notifier before nexthop mlxsw: spectrum_acl_tcam: Fix stack corruption mlxsw: spectrum_acl_tcam: Fix NULL pointer dereference in error path mlxsw: spectrum_acl_erp: Fix error flow of pool allocation failure ethtool: netlink: Add missing ethnl_ops_begin/complete selftests: bonding: Add more missing config options selftests: netdevsim: add a config file libbpf: warn on unexpected __arg_ctx type when rewriting BTF selftests/bpf: add tests confirming type logic in kernel for __arg_ctx bpf: enforce types for __arg_ctx-tagged arguments in global subprogs bpf: extract bpf_ctx_convert_map logic and make it more reusable libbpf: feature-detect arg:ctx tag support in kernel ipvs: avoid stat macros calls from preemptible context netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description netfilter: nf_tables: skip dead set elements in netlink dump netfilter: nf_tables: do not allow mismatch field size and set key length ...
2024-01-18Merge tag 'nf-24-01-18' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains Netfilter fixes for net. Slightly larger than usual because this batch includes several patches to tighten the nf_tables control plane to reject inconsistent configuration: 1) Restrict NFTA_SET_POLICY to NFT_SET_POL_PERFORMANCE and NFT_SET_POL_MEMORY. 2) Bail out if a nf_tables expression registers more than 16 netlink attributes which is what struct nft_expr_info allows. 3) Bail out if NFT_EXPR_STATEFUL provides no .clone interface, remove existing fallback to memcpy() when cloning which might accidentally duplicate memory reference to the same object. 4) Fix br_netfilter interaction with neighbour layer. This requires three preparation patches: - Use nf_bridge_get_physinif() in nfnetlink_log - Use nf_bridge_info_exists() to check in br_netfilter context is available in nf_queue. - Pass net to nf_bridge_get_physindev() And finally, the fix which replaces physindev with physinif in nf_bridge_info. Patches from Pavel Tikhomirov. 5) Catch-all deactivation happens in the transaction, hence this oneliner to check for the next generation. This bug uncovered after the removal of the _BUSY bit, which happened in set elements back in summer 2023. 6) Ensure set (total) key length size and concat field length description is consistent, otherwise bail out. 7) Skip set element with the _DEAD flag on from the netlink dump path. A tests occasionally shows that dump is mismatching because GC might lose race to get rid of this element while a netlink dump is in progress. 8) Reject NFT_SET_CONCAT for field_count < 1. 9) Use IP6_INC_STATS in ipvs to fix preemption BUG splat, patch from Fedor Pchelkin. * tag 'nf-24-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: ipvs: avoid stat macros calls from preemptible context netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description netfilter: nf_tables: skip dead set elements in netlink dump netfilter: nf_tables: do not allow mismatch field size and set key length netfilter: nf_tables: check if catch-all set element is active in next generation netfilter: bridge: replace physindev with physinif in nf_bridge_info netfilter: propagate net to nf_bridge_get_physindev netfilter: nf_queue: remove excess nf_bridge variable netfilter: nfnetlink_log: use proper helper for fetching physinif netfilter: nft_limit: do not ignore unsupported flags netfilter: nf_tables: bail out if stateful expression provides no .clone netfilter: nf_tables: validate .maxattr at expression registration netfilter: nf_tables: reject invalid set policy ==================== Link: https://lore.kernel.org/r/20240118161726.14838-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-18Merge tag 'tty-6.8-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty / serial updates from Greg KH: "Here is the big set of tty and serial driver changes for 6.8-rc1. As usual, Jiri has a bunch of refactoring and cleanups for the tty core and drivers in here, along with the usual set of rs485 updates (someday this might work properly...) Along with those, in here are changes for: - sc16is7xx serial driver updates - platform driver removal api updates - amba-pl011 driver updates - tty driver binding updates - other small tty/serial driver updates and changes All of these have been in linux-next for a while with no reported issues" * tag 'tty-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (197 commits) serial: sc16is7xx: refactor EFR lock serial: sc16is7xx: reorder code to remove prototype declarations serial: sc16is7xx: refactor FIFO access functions to increase commonality serial: sc16is7xx: drop unneeded MODULE_ALIAS serial: sc16is7xx: replace hardcoded divisor value with BIT() macro serial: sc16is7xx: add explicit return for some switch default cases serial: sc16is7xx: add macro for max number of UART ports serial: sc16is7xx: add driver name to struct uart_driver serial: sc16is7xx: use i2c_get_match_data() serial: sc16is7xx: use spi_get_device_match_data() serial: sc16is7xx: use DECLARE_BITMAP for sc16is7xx_lines bitfield serial: sc16is7xx: improve do/while loop in sc16is7xx_irq() serial: sc16is7xx: remove obsolete loop in sc16is7xx_port_irq() serial: sc16is7xx: set safe default SPI clock frequency serial: sc16is7xx: add check for unsupported SPI modes during probe serial: sc16is7xx: fix invalid sc16is7xx_lines bitfield in case of probe error serial: 8250_exar: Set missing rs485_supported flag serial: omap: do not override settings for RS485 support serial: core, imx: do not set RS485 enabled if it is not supported serial: core: make sure RS485 cannot be enabled when it is not supported ...
2024-01-18Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-01-18 We've added 10 non-merge commits during the last 5 day(s) which contain a total of 12 files changed, 806 insertions(+), 51 deletions(-). The main changes are: 1) Fix an issue in bpf_iter_udp under backward progress which prevents user space process from finishing iteration, from Martin KaFai Lau. 2) Fix BPF verifier to reject variable offset alu on registers with a type of PTR_TO_FLOW_KEYS to prevent oob access, from Hao Sun. 3) Follow up fixes for kernel- and libbpf-side logic around handling arg:ctx tagged arguments of BPF global subprogs, from Andrii Nakryiko. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: libbpf: warn on unexpected __arg_ctx type when rewriting BTF selftests/bpf: add tests confirming type logic in kernel for __arg_ctx bpf: enforce types for __arg_ctx-tagged arguments in global subprogs bpf: extract bpf_ctx_convert_map logic and make it more reusable libbpf: feature-detect arg:ctx tag support in kernel selftests/bpf: Add test for alu on PTR_TO_FLOW_KEYS bpf: Reject variable offset alu on PTR_TO_FLOW_KEYS selftests/bpf: Test udp and tcp iter batching bpf: Avoid iter->offset making backward progress in bpf_iter_udp bpf: iter_udp: Retry with a larger batch size without going back to the previous bucket ==================== Link: https://lore.kernel.org/r/20240118153936.11769-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-18ipv6: mcast: fix data-race in ipv6_mc_down / mld_ifc_workNikita Zhandarovich
idev->mc_ifc_count can be written over without proper locking. Originally found by syzbot [1], fix this issue by encapsulating calls to mld_ifc_stop_work() (and mld_gq_stop_work() for good measure) with mutex_lock() and mutex_unlock() accordingly as these functions should only be called with mc_lock per their declarations. [1] BUG: KCSAN: data-race in ipv6_mc_down / mld_ifc_work write to 0xffff88813a80c832 of 1 bytes by task 3771 on cpu 0: mld_ifc_stop_work net/ipv6/mcast.c:1080 [inline] ipv6_mc_down+0x10a/0x280 net/ipv6/mcast.c:2725 addrconf_ifdown+0xe32/0xf10 net/ipv6/addrconf.c:3949 addrconf_notify+0x310/0x980 notifier_call_chain kernel/notifier.c:93 [inline] raw_notifier_call_chain+0x6b/0x1c0 kernel/notifier.c:461 __dev_notify_flags+0x205/0x3d0 dev_change_flags+0xab/0xd0 net/core/dev.c:8685 do_setlink+0x9f6/0x2430 net/core/rtnetlink.c:2916 rtnl_group_changelink net/core/rtnetlink.c:3458 [inline] __rtnl_newlink net/core/rtnetlink.c:3717 [inline] rtnl_newlink+0xbb3/0x1670 net/core/rtnetlink.c:3754 rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6558 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2545 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6576 netlink_unicast_kernel net/netlink/af_netlink.c:1342 [inline] netlink_unicast+0x589/0x650 net/netlink/af_netlink.c:1368 netlink_sendmsg+0x66e/0x770 net/netlink/af_netlink.c:1910 ... write to 0xffff88813a80c832 of 1 bytes by task 22 on cpu 1: mld_ifc_work+0x54c/0x7b0 net/ipv6/mcast.c:2653 process_one_work kernel/workqueue.c:2627 [inline] process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2700 worker_thread+0x525/0x730 kernel/workqueue.c:2781 ... Fixes: 2d9a93b4902b ("mld: convert from timer to delayed work") Reported-by: syzbot+a9400cabb1d784e49abf@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/000000000000994e09060ebcdffb@google.com/ Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Acked-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20240117172102.12001-1-n.zhandarovich@fintech.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-18wifi: mac80211: fix race condition on enabling fast-xmitFelix Fietkau
fast-xmit must only be enabled after the sta has been uploaded to the driver, otherwise it could end up passing the not-yet-uploaded sta via drv_tx calls to the driver, leading to potential crashes because of uninitialized drv_priv data. Add a missing sta->uploaded check and re-check fast xmit after inserting a sta. Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://msgid.link/20240104181059.84032-1-nbd@nbd.name Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-01-18wifi: mac80211: fix potential sta-link leakJohannes Berg
When a station is allocated, links are added but not set to valid yet (e.g. during connection to an AP MLD), we might remove the station without ever marking links valid, and leak them. Fix that. Fixes: cb71f1d136a6 ("wifi: mac80211: add sta link addition/removal") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240111181514.6573998beaf8.I09ac2e1d41c80f82a5a616b8bd1d9d8dd709a6a6@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-01-18wifi: cfg80211/mac80211: remove dependency on non-existing optionLukas Bulwahn
Commit ffbd0c8c1e7f ("wifi: mac80211: add an element parsing unit test") and commit 730eeb17bbdd ("wifi: cfg80211: add first kunit tests, for element defrag") add new configs that depend on !KERNEL_6_2, but the config option KERNEL_6_2 does not exist in the tree. This dependency is used for handling backporting to restrict the option to certain kernels but this really should not be carried around the mainline kernel tree. Clean up this needless dependency on the non-existing option KERNEL_6_2. Link: https://lore.kernel.org/lkml/CAKXUXMyfrM6amOR7Ysim3WNQ-Ckf9HJDqRhAoYmLXujo1UV+yA@mail.gmail.com/ Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-01-18wifi: cfg80211: fix missing interfaces when dumpingMichal Kazior
The nl80211_dump_interface() supports resumption in case nl80211_send_iface() doesn't have the resources to complete its work. The logic would store the progress as iteration offsets for rdev and wdev loops. However the logic did not properly handle resumption for non-last rdev. Assuming a system with 2 rdevs, with 2 wdevs each, this could happen: dump(cb=[0, 0]): if_start=cb[1] (=0) send rdev0.wdev0 -> ok send rdev0.wdev1 -> yield cb[1] = 1 dump(cb=[0, 1]): if_start=cb[1] (=1) send rdev0.wdev1 -> ok // since if_start=1 the rdev0.wdev0 got skipped // through if_idx < if_start send rdev1.wdev1 -> ok The if_start needs to be reset back to 0 upon wdev loop end. The problem is actually hard to hit on a desktop, and even on most routers. The prerequisites for this manifesting was: - more than 1 wiphy - a few handful of interfaces - dump without rdev or wdev filter I was seeing this with 4 wiphys 9 interfaces each. It'd miss 6 interfaces from the last wiphy reported to userspace. Signed-off-by: Michal Kazior <michal@plume.com> Link: https://msgid.link/20240116142340.89678-1-kazikcz@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-01-18ethtool: netlink: Add missing ethnl_ops_begin/completeLudvig Pärsson
Accessing an ethernet device that is powered off or clock gated might cause the CPU to hang. Add ethnl_ops_begin/complete in ethnl_set_features() to protect against this. Fixes: 0980bfcd6954 ("ethtool: set netdev features with FEATURES_SET request") Signed-off-by: Ludvig Pärsson <ludvig.parsson@axis.com> Link: https://lore.kernel.org/r/20240117-etht2-v2-1-1a96b6e8c650@axis.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-17ipvs: avoid stat macros calls from preemptible contextFedor Pchelkin
Inside decrement_ttl() upon discovering that the packet ttl has exceeded, __IP_INC_STATS and __IP6_INC_STATS macros can be called from preemptible context having the following backtrace: check_preemption_disabled: 48 callbacks suppressed BUG: using __this_cpu_add() in preemptible [00000000] code: curl/1177 caller is decrement_ttl+0x217/0x830 CPU: 5 PID: 1177 Comm: curl Not tainted 6.7.0+ #34 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0xbd/0xe0 check_preemption_disabled+0xd1/0xe0 decrement_ttl+0x217/0x830 __ip_vs_get_out_rt+0x4e0/0x1ef0 ip_vs_nat_xmit+0x205/0xcd0 ip_vs_in_hook+0x9b1/0x26a0 nf_hook_slow+0xc2/0x210 nf_hook+0x1fb/0x770 __ip_local_out+0x33b/0x640 ip_local_out+0x2a/0x490 __ip_queue_xmit+0x990/0x1d10 __tcp_transmit_skb+0x288b/0x3d10 tcp_connect+0x3466/0x5180 tcp_v4_connect+0x1535/0x1bb0 __inet_stream_connect+0x40d/0x1040 inet_stream_connect+0x57/0xa0 __sys_connect_file+0x162/0x1a0 __sys_connect+0x137/0x160 __x64_sys_connect+0x72/0xb0 do_syscall_64+0x6f/0x140 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7fe6dbbc34e0 Use the corresponding preemption-aware variants: IP_INC_STATS and IP6_INC_STATS. Found by Linux Verification Center (linuxtesting.org). Fixes: 8d8e20e2d7bb ("ipvs: Decrement ttl") Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-17netfilter: nf_tables: reject NFT_SET_CONCAT with not field length descriptionPablo Neira Ayuso
It is still possible to set on the NFT_SET_CONCAT flag by specifying a set size and no field description, report EINVAL in such case. Fixes: 1b6345d4160e ("netfilter: nf_tables: check NFT_SET_CONCAT flag if field_count is specified") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-17netfilter: nf_tables: skip dead set elements in netlink dumpPablo Neira Ayuso
Delete from packet path relies on the garbage collector to purge elements with NFT_SET_ELEM_DEAD_BIT on. Skip these dead elements from nf_tables_dump_setelem() path, I very rarely see tests/shell/testcases/maps/typeof_maps_add_delete reports [DUMP FAILED] showing a mismatch in the expected output with an element that should not be there. If the netlink dump happens before GC worker run, it might show dead elements in the ruleset listing. nft_rhash_get() already skips dead elements in nft_rhash_cmp(), therefore, it already does not show the element when getting a single element via netlink control plane. Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-17netfilter: nf_tables: do not allow mismatch field size and set key lengthPablo Neira Ayuso
The set description provides the size of each field in the set whose sum should not mismatch the set key length, bail out otherwise. I did not manage to crash nft_set_pipapo with mismatch fields and set key length so far, but this is UB which must be disallowed. Fixes: f3a2181e16f1 ("netfilter: nf_tables: Support for sets with multiple ranged fields") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-17netfilter: nf_tables: check if catch-all set element is active in next ↵Pablo Neira Ayuso
generation When deactivating the catch-all set element, check the state in the next generation that represents this transaction. This bug uncovered after the recent removal of the element busy mark a2dd0233cbc4 ("netfilter: nf_tables: remove busy mark and gc batch API"). Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Cc: stable@vger.kernel.org Reported-by: lonial con <kongln9170@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-17netfilter: bridge: replace physindev with physinif in nf_bridge_infoPavel Tikhomirov
An skb can be added to a neigh->arp_queue while waiting for an arp reply. Where original skb's skb->dev can be different to neigh's neigh->dev. For instance in case of bridging dnated skb from one veth to another, the skb would be added to a neigh->arp_queue of the bridge. As skb->dev can be reset back to nf_bridge->physindev and used, and as there is no explicit mechanism that prevents this physindev from been freed under us (for instance neigh_flush_dev doesn't cleanup skbs from different device's neigh queue) we can crash on e.g. this stack: arp_process neigh_update skb = __skb_dequeue(&neigh->arp_queue) neigh_resolve_output(..., skb) ... br_nf_dev_xmit br_nf_pre_routing_finish_bridge_slow skb->dev = nf_bridge->physindev br_handle_frame_finish Let's use plain ifindex instead of net_device link. To peek into the original net_device we will use dev_get_by_index_rcu(). Thus either we get device and are safe to use it or we don't get it and drop skb. Fixes: c4e70a87d975 ("netfilter: bridge: rename br_netfilter.c to br_netfilter_hooks.c") Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-17netfilter: propagate net to nf_bridge_get_physindevPavel Tikhomirov
This is a preparation patch for replacing physindev with physinif on nf_bridge_info structure. We will use dev_get_by_index_rcu to resolve device, when needed, and it requires net to be available. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-17netfilter: nf_queue: remove excess nf_bridge variablePavel Tikhomirov
We don't really need nf_bridge variable here. And nf_bridge_info_exists is better replacement for nf_bridge_info_get in case we are only checking for existence. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-17netfilter: nfnetlink_log: use proper helper for fetching physinifPavel Tikhomirov
We don't use physindev in __build_packet_message except for getting physinif from it. So let's switch to nf_bridge_get_physinif to get what we want directly. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-17netfilter: nft_limit: do not ignore unsupported flagsPablo Neira Ayuso
Bail out if userspace provides unsupported flags, otherwise future extensions to the limit expression will be silently ignored by the kernel. Fixes: c7862a5f0de5 ("netfilter: nft_limit: allow to invert matching criteria") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-17netfilter: nf_tables: bail out if stateful expression provides no .clonePablo Neira Ayuso
All existing NFT_EXPR_STATEFUL provide a .clone interface, remove fallback to copy content of stateful expression since this is never exercised and bail out if .clone interface is not defined. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-17netfilter: nf_tables: validate .maxattr at expression registrationPablo Neira Ayuso
struct nft_expr_info allows to store up to NFT_EXPR_MAXATTR (16) attributes when parsing netlink attributes. Rise a warning in case there is ever a nft expression whose .maxattr goes beyond this number of expressions, in such case, struct nft_expr_info needs to be updated. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-17netfilter: nf_tables: reject invalid set policyPablo Neira Ayuso
Report -EINVAL in case userspace provides a unsupported set backend policy. Fixes: c50b960ccc59 ("netfilter: nf_tables: implement proper set selection") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-01-17mptcp: relax check on MPC passive fallbackPaolo Abeni
While testing the blamed commit below, I was able to miss (!) packetdrill failures in the fastopen test-cases. On passive fastopen the child socket is created by incoming TCP MPC syn, allow for both MPC_SYN and MPC_ACK header. Fixes: 724b00c12957 ("mptcp: refine opt_mp_capable determination") Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-14net: tls, fix WARNIING in __sk_msg_freeJohn Fastabend
A splice with MSG_SPLICE_PAGES will cause tls code to use the tls_sw_sendmsg_splice path in the TLS sendmsg code to move the user provided pages from the msg into the msg_pl. This will loop over the msg until msg_pl is full, checked by sk_msg_full(msg_pl). The user can also set the MORE flag to hint stack to delay sending until receiving more pages and ideally a full buffer. If the user adds more pages to the msg than can fit in the msg_pl scatterlist (MAX_MSG_FRAGS) we should ignore the MORE flag and send the buffer anyways. What actually happens though is we abort the msg to msg_pl scatterlist setup and then because we forget to set 'full record' indicating we can no longer consume data without a send we fallthrough to the 'continue' path which will check if msg_data_left(msg) has more bytes to send and then attempts to fit them in the already full msg_pl. Then next iteration of sender doing send will encounter a full msg_pl and throw the warning in the syzbot report. To fix simply check if we have a full_record in splice code path and if not send the msg regardless of MORE flag. Reported-and-tested-by: syzbot+f2977222e0e95cec15c8@syzkaller.appspotmail.com Reported-by: Edward Adam Davis <eadavis@qq.com> Fixes: fe1e81d4f73b ("tls/sw: Support MSG_SPLICE_PAGES") Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-13bpf: Avoid iter->offset making backward progress in bpf_iter_udpMartin KaFai Lau
There is a bug in the bpf_iter_udp_batch() function that stops the userspace from making forward progress. The case that triggers the bug is the userspace passed in a very small read buffer. When the bpf prog does bpf_seq_printf, the userspace read buffer is not enough to capture the whole bucket. When the read buffer is not large enough, the kernel will remember the offset of the bucket in iter->offset such that the next userspace read() can continue from where it left off. The kernel will skip the number (== "iter->offset") of sockets in the next read(). However, the code directly decrements the "--iter->offset". This is incorrect because the next read() may not consume the whole bucket either and then the next-next read() will start from offset 0. The net effect is the userspace will keep reading from the beginning of a bucket and the process will never finish. "iter->offset" must always go forward until the whole bucket is consumed. This patch fixes it by using a local variable "resume_offset" and "resume_bucket". "iter->offset" is always reset to 0 before it may be used. "iter->offset" will be advanced to the "resume_offset" when it continues from the "resume_bucket" (i.e. "state->bucket == resume_bucket"). This brings it closer to the bpf_iter_tcp's offset handling which does not suffer the same bug. Cc: Aditi Ghag <aditi.ghag@isovalent.com> Fixes: c96dac8d369f ("bpf: udp: Implement batching for sockets iterator") Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Aditi Ghag <aditi.ghag@isovalent.com> Link: https://lore.kernel.org/r/20240112190530.3751661-3-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-13bpf: iter_udp: Retry with a larger batch size without going back to the ↵Martin KaFai Lau
previous bucket The current logic is to use a default size 16 to batch the whole bucket. If it is too small, it will retry with a larger batch size. The current code accidentally does a state->bucket-- before retrying. This goes back to retry with the previous bucket which has already been done. This patch fixed it. It is hard to create a selftest. I added a WARN_ON(state->bucket < 0), forced a particular port to be hashed to the first bucket, created >16 sockets, and observed the for-loop went back to the "-1" bucket. Cc: Aditi Ghag <aditi.ghag@isovalent.com> Fixes: c96dac8d369f ("bpf: udp: Implement batching for sockets iterator") Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: Aditi Ghag <aditi.ghag@isovalent.com> Link: https://lore.kernel.org/r/20240112190530.3751661-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-01-13net: sched: track device in tcf_block_get/put_ext() only for clsact binder typesJiri Pirko
Clsact/ingress qdisc is not the only one using shared block, red is also using it. The device tracking was originally introduced by commit 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra") for clsact/ingress only. Commit 94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()") mistakenly enabled that for red as well. Fix that by adding a check for the binder type being clsact when adding device to the block->ports xarray. Reported-by: Ido Schimmel <idosch@idosch.org> Closes: https://lore.kernel.org/all/ZZ6JE0odnu1lLPtu@shredder/ Fixes: 94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-13udp: annotate data-races around up->pendingEric Dumazet
up->pending can be read without holding the socket lock, as pointed out by syzbot [1] Add READ_ONCE() in lockless contexts, and WRITE_ONCE() on write side. [1] BUG: KCSAN: data-race in udpv6_sendmsg / udpv6_sendmsg write to 0xffff88814e5eadf0 of 4 bytes by task 15547 on cpu 1: udpv6_sendmsg+0x1405/0x1530 net/ipv6/udp.c:1596 inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] __sys_sendto+0x257/0x310 net/socket.c:2192 __do_sys_sendto net/socket.c:2204 [inline] __se_sys_sendto net/socket.c:2200 [inline] __x64_sys_sendto+0x78/0x90 net/socket.c:2200 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b read to 0xffff88814e5eadf0 of 4 bytes by task 15551 on cpu 0: udpv6_sendmsg+0x22c/0x1530 net/ipv6/udp.c:1373 inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:657 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg net/socket.c:745 [inline] ____sys_sendmsg+0x37c/0x4d0 net/socket.c:2586 ___sys_sendmsg net/socket.c:2640 [inline] __sys_sendmmsg+0x269/0x500 net/socket.c:2726 __do_sys_sendmmsg net/socket.c:2755 [inline] __se_sys_sendmmsg net/socket.c:2752 [inline] __x64_sys_sendmmsg+0x57/0x60 net/socket.c:2752 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b value changed: 0x00000000 -> 0x0000000a Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 15551 Comm: syz-executor.1 Tainted: G W 6.7.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+8d482d0e407f665d9d10@syzkaller.appspotmail.com Link: https://lore.kernel.org/netdev/0000000000009e46c3060ebcdffd@google.com/ Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>