summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2023-06-09mm/gup: remove vmas parameter from pin_user_pages()Lorenzo Stoakes
We are now in a position where no caller of pin_user_pages() requires the vmas parameter at all, so eliminate this parameter from the function and all callers. This clears the way to removing the vmas parameter from GUP altogether. Link: https://lkml.kernel.org/r/195a99ae949c9f5cb589d2222b736ced96ec199a.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> [qib] Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com> [drivers/media] Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian König <christian.koenig@amd.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-25Merge tag 'net-6.4-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bluetooth and bpf. Current release - regressions: - net: fix skb leak in __skb_tstamp_tx() - eth: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs Current release - new code bugs: - handshake: - fix sock->file allocation - fix handshake_dup() ref counting - bluetooth: - fix potential double free caused by hci_conn_unlink - fix UAF in hci_conn_hash_flush Previous releases - regressions: - core: fix stack overflow when LRO is disabled for virtual interfaces - tls: fix strparser rx issues - bpf: - fix many sockmap/TCP related issues - fix a memory leak in the LRU and LRU_PERCPU hash maps - init the offload table earlier - eth: mlx5e: - do as little as possible in napi poll when budget is 0 - fix using eswitch mapping in nic mode - fix deadlock in tc route query code Previous releases - always broken: - udplite: fix NULL pointer dereference in __sk_mem_raise_allocated() - raw: fix output xfrm lookup wrt protocol - smc: reset connection when trying to use SMCRv2 fails - phy: mscc: enable VSC8501/2 RGMII RX clock - eth: octeontx2-pf: fix TSOv6 offload - eth: cdc_ncm: deal with too low values of dwNtbOutMaxSize" * tag 'net-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (79 commits) udplite: Fix NULL pointer dereference in __sk_mem_raise_allocated(). net: phy: mscc: enable VSC8501/2 RGMII RX clock net: phy: mscc: remove unnecessary phydev locking net: phy: mscc: add support for VSC8501 net: phy: mscc: add VSC8502 to MODULE_DEVICE_TABLE net/handshake: Enable the SNI extension to work properly net/handshake: Unpin sock->file if a handshake is cancelled net/handshake: handshake_genl_notify() shouldn't ignore @flags net/handshake: Fix uninitialized local variable net/handshake: Fix handshake_dup() ref counting net/handshake: Remove unneeded check from handshake_dup() ipv6: Fix out-of-bounds access in ipv6_find_tlv() net: ethernet: mtk_eth_soc: fix QoS on DSA MAC on non MTK_NETSYS_V2 SoCs docs: netdev: document the existence of the mail bot net: fix skb leak in __skb_tstamp_tx() r8169: Use a raw_spinlock_t for the register locks. page_pool: fix inconsistency for page_pool_ring_[un]lock() bpf, sockmap: Test progs verifier error with latest clang bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer ...
2023-05-25udplite: Fix NULL pointer dereference in __sk_mem_raise_allocated().Kuniyuki Iwashima
syzbot reported [0] a null-ptr-deref in sk_get_rmem0() while using IPPROTO_UDPLITE (0x88): 14:25:52 executing program 1: r0 = socket$inet6(0xa, 0x80002, 0x88) We had a similar report [1] for probably sk_memory_allocated_add() in __sk_mem_raise_allocated(), and commit c915fe13cbaa ("udplite: fix NULL pointer dereference") fixed it by setting .memory_allocated for udplite_prot and udplitev6_prot. To fix the variant, we need to set either .sysctl_wmem_offset or .sysctl_rmem. Now UDP and UDPLITE share the same value for .memory_allocated, so we use the same .sysctl_wmem_offset for UDP and UDPLITE. [0]: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 PID: 6829 Comm: syz-executor.1 Not tainted 6.4.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/28/2023 RIP: 0010:sk_get_rmem0 include/net/sock.h:2907 [inline] RIP: 0010:__sk_mem_raise_allocated+0x806/0x17a0 net/core/sock.c:3006 Code: c1 ea 03 80 3c 02 00 0f 85 23 0f 00 00 48 8b 44 24 08 48 8b 98 38 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1 ea 03 <0f> b6 14 02 48 89 d8 83 e0 07 83 c0 03 38 d0 0f 8d 6f 0a 00 00 8b RSP: 0018:ffffc90005d7f450 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc90004d92000 RDX: 0000000000000000 RSI: ffffffff88066482 RDI: ffffffff8e2ccbb8 RBP: ffff8880173f7000 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000030000 R13: 0000000000000001 R14: 0000000000000340 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff8880b9800000(0063) knlGS:00000000f7f1cb40 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 000000002e82f000 CR3: 0000000034ff0000 CR4: 00000000003506f0 Call Trace: <TASK> __sk_mem_schedule+0x6c/0xe0 net/core/sock.c:3077 udp_rmem_schedule net/ipv4/udp.c:1539 [inline] __udp_enqueue_schedule_skb+0x776/0xb30 net/ipv4/udp.c:1581 __udpv6_queue_rcv_skb net/ipv6/udp.c:666 [inline] udpv6_queue_rcv_one_skb+0xc39/0x16c0 net/ipv6/udp.c:775 udpv6_queue_rcv_skb+0x194/0xa10 net/ipv6/udp.c:793 __udp6_lib_mcast_deliver net/ipv6/udp.c:906 [inline] __udp6_lib_rcv+0x1bda/0x2bd0 net/ipv6/udp.c:1013 ip6_protocol_deliver_rcu+0x2e7/0x1250 net/ipv6/ip6_input.c:437 ip6_input_finish+0x150/0x2f0 net/ipv6/ip6_input.c:482 NF_HOOK include/linux/netfilter.h:303 [inline] NF_HOOK include/linux/netfilter.h:297 [inline] ip6_input+0xa0/0xd0 net/ipv6/ip6_input.c:491 ip6_mc_input+0x40b/0xf50 net/ipv6/ip6_input.c:585 dst_input include/net/dst.h:468 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline] NF_HOOK include/linux/netfilter.h:303 [inline] NF_HOOK include/linux/netfilter.h:297 [inline] ipv6_rcv+0x250/0x380 net/ipv6/ip6_input.c:309 __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5491 __netif_receive_skb+0x1f/0x1c0 net/core/dev.c:5605 netif_receive_skb_internal net/core/dev.c:5691 [inline] netif_receive_skb+0x133/0x7a0 net/core/dev.c:5750 tun_rx_batched+0x4b3/0x7a0 drivers/net/tun.c:1553 tun_get_user+0x2452/0x39c0 drivers/net/tun.c:1989 tun_chr_write_iter+0xdf/0x200 drivers/net/tun.c:2035 call_write_iter include/linux/fs.h:1868 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x945/0xd50 fs/read_write.c:584 ksys_write+0x12b/0x250 fs/read_write.c:637 do_syscall_32_irqs_on arch/x86/entry/common.c:112 [inline] __do_fast_syscall_32+0x65/0xf0 arch/x86/entry/common.c:178 do_fast_syscall_32+0x33/0x70 arch/x86/entry/common.c:203 entry_SYSENTER_compat_after_hwframe+0x70/0x82 RIP: 0023:0xf7f21579 Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00 RSP: 002b:00000000f7f1c590 EFLAGS: 00000282 ORIG_RAX: 0000000000000004 RAX: ffffffffffffffda RBX: 00000000000000c8 RCX: 0000000020000040 RDX: 0000000000000083 RSI: 00000000f734e000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000296 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Modules linked in: Link: https://lore.kernel.org/netdev/CANaxB-yCk8hhP68L4Q2nFOJht8sqgXGGQO2AftpHs0u1xyGG5A@mail.gmail.com/ [1] Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema") Reported-by: syzbot+444ca0907e96f7c5e48b@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=444ca0907e96f7c5e48b Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230523163305.66466-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-24net/handshake: Enable the SNI extension to work properlyChuck Lever
Enable the upper layer protocol to specify the SNI peername. This avoids the need for tlshd to use a DNS lookup, which can return a hostname that doesn't match the incoming certificate's SubjectName. Fixes: 2fd5532044a8 ("net/handshake: Add a kernel API for requesting a TLSv1.3 handshake") Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24net/handshake: Unpin sock->file if a handshake is cancelledChuck Lever
If user space never calls DONE, sock->file's reference count remains elevated. Enable sock->file to be freed eventually in this case. Reported-by: Jakub Kacinski <kuba@kernel.org> Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24net/handshake: handshake_genl_notify() shouldn't ignore @flagsChuck Lever
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24net/handshake: Fix uninitialized local variableChuck Lever
trace_handshake_cmd_done_err() simply records the pointer in @req, so initializing it to NULL is sufficient and safe. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24net/handshake: Fix handshake_dup() ref countingChuck Lever
If get_unused_fd_flags() fails, we ended up calling fput(sock->file) twice. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Suggested-by: Paolo Abeni <pabeni@redhat.com> Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24net/handshake: Remove unneeded check from handshake_dup()Chuck Lever
handshake_req_submit() now verifies that the socket has a file. Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-05-24 We've added 19 non-merge commits during the last 10 day(s) which contain a total of 20 files changed, 738 insertions(+), 448 deletions(-). The main changes are: 1) Batch of BPF sockmap fixes found when running against NGINX TCP tests, from John Fastabend. 2) Fix a memleak in the LRU{,_PERCPU} hash map when bucket locking fails, from Anton Protopopov. 3) Init the BPF offload table earlier than just late_initcall, from Jakub Kicinski. 4) Fix ctx access mask generation for 32-bit narrow loads of 64-bit fields, from Will Deacon. 5) Remove a now unsupported __fallthrough in BPF samples, from Andrii Nakryiko. 6) Fix a typo in pkg-config call for building sign-file, from Jeremy Sowden. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf, sockmap: Test progs verifier error with latest clang bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer with drops bpf, sockmap: Test FIONREAD returns correct bytes in rx buffer bpf, sockmap: Test shutdown() correctly exits epoll and recv()=0 bpf, sockmap: Build helper to create connected socket pair bpf, sockmap: Pull socket helpers out of listen test for general use bpf, sockmap: Incorrectly handling copied_seq bpf, sockmap: Wake up polling after data copy bpf, sockmap: TCP data stall on recv before accept bpf, sockmap: Handle fin correctly bpf, sockmap: Improved check for empty queue bpf, sockmap: Reschedule is now done through backlog bpf, sockmap: Convert schedule_work into delayed_work bpf, sockmap: Pass skb ownership through read_skb bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps bpf: Fix mask generation for 32-bit narrow loads of 64-bit fields samples/bpf: Drop unnecessary fallthrough bpf: netdev: init the offload table earlier selftests/bpf: Fix pkg-config call building sign-file ==================== Link: https://lore.kernel.org/r/20230524170839.13905-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-24ipv6: Fix out-of-bounds access in ipv6_find_tlv()Gavrilov Ilia
optlen is fetched without checking whether there is more than one byte to parse. It can lead to out-of-bounds access. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with SVACE. Fixes: c61a40432509 ("[IPV6]: Find option offset by type.") Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-23net: fix skb leak in __skb_tstamp_tx()Pratyush Yadav
Commit 50749f2dd685 ("tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.") added a call to skb_orphan_frags_rx() to fix leaks with zerocopy skbs. But it ended up adding a leak of its own. When skb_orphan_frags_rx() fails, the function just returns, leaking the skb it just cloned. Free it before returning. This bug was discovered and resolved using Coverity Static Analysis Security Testing (SAST) by Synopsys, Inc. Fixes: 50749f2dd685 ("tcp/udp: Fix memleaks of sk and zerocopy skbs with TX timestamp.") Signed-off-by: Pratyush Yadav <ptyadav@amazon.de> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20230522153020.32422-1-ptyadav@amazon.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23page_pool: fix inconsistency for page_pool_ring_[un]lock()Yunsheng Lin
page_pool_ring_[un]lock() use in_softirq() to decide which spin lock variant to use, and when they are called in the context with in_softirq() being false, spin_lock_bh() is called in page_pool_ring_lock() while spin_unlock() is called in page_pool_ring_unlock(), because spin_lock_bh() has disabled the softirq in page_pool_ring_lock(), which causes inconsistency for spin lock pair calling. This patch fixes it by returning in_softirq state from page_pool_producer_lock(), and use it to decide which spin lock variant to use in page_pool_producer_unlock(). As pool->ring has both producer and consumer lock, so rename it to page_pool_producer_[un]lock() to reflect the actual usage. Also move them to page_pool.c as they are only used there, and remove the 'inline' as the compiler may have better idea to do inlining or not. Fixes: 7886244736a4 ("net: page_pool: Add bulk support for ptr_ring") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/r/20230522031714.5089-1-linyunsheng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-23bpf, sockmap: Incorrectly handling copied_seqJohn Fastabend
The read_skb() logic is incrementing the tcp->copied_seq which is used for among other things calculating how many outstanding bytes can be read by the application. This results in application errors, if the application does an ioctl(FIONREAD) we return zero because this is calculated from the copied_seq value. To fix this we move tcp->copied_seq accounting into the recv handler so that we update these when the recvmsg() hook is called and data is in fact copied into user buffers. This gives an accurate FIONREAD value as expected and improves ACK handling. Before we were calling the tcp_rcv_space_adjust() which would update 'number of bytes copied to user in last RTT' which is wrong for programs returning SK_PASS. The bytes are only copied to the user when recvmsg is handled. Doing the fix for recvmsg is straightforward, but fixing redirect and SK_DROP pkts is a bit tricker. Build a tcp_psock_eat() helper and then call this from skmsg handlers. This fixes another issue where a broken socket with a BPF program doing a resubmit could hang the receiver. This happened because although read_skb() consumed the skb through sock_drop() it did not update the copied_seq. Now if a single reccv socket is redirecting to many sockets (for example for lb) the receiver sk will be hung even though we might expect it to continue. The hang comes from not updating the copied_seq numbers and memory pressure resulting from that. We have a slight layer problem of calling tcp_eat_skb even if its not a TCP socket. To fix we could refactor and create per type receiver handlers. I decided this is more work than we want in the fix and we already have some small tweaks depending on caller that use the helper skb_bpf_strparser(). So we extend that a bit and always set the strparser bit when it is in use and then we can gate the seq_copied updates on this. Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-9-john.fastabend@gmail.com
2023-05-23bpf, sockmap: Wake up polling after data copyJohn Fastabend
When TCP stack has data ready to read sk_data_ready() is called. Sockmap overwrites this with its own handler to call into BPF verdict program. But, the original TCP socket had sock_def_readable that would additionally wake up any user space waiters with sk_wake_async(). Sockmap saved the callback when the socket was created so call the saved data ready callback and then we can wake up any epoll() logic waiting on the read. Note we call on 'copied >= 0' to account for returning 0 when a FIN is received because we need to wake up user for this as well so they can do the recvmsg() -> 0 and detect the shutdown. Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-8-john.fastabend@gmail.com
2023-05-23bpf, sockmap: TCP data stall on recv before acceptJohn Fastabend
A common mechanism to put a TCP socket into the sockmap is to hook the BPF_SOCK_OPS_{ACTIVE_PASSIVE}_ESTABLISHED_CB event with a BPF program that can map the socket info to the correct BPF verdict parser. When the user adds the socket to the map the psock is created and the new ops are assigned to ensure the verdict program will 'see' the sk_buffs as they arrive. Part of this process hooks the sk_data_ready op with a BPF specific handler to wake up the BPF verdict program when data is ready to read. The logic is simple enough (posted here for easy reading) static void sk_psock_verdict_data_ready(struct sock *sk) { struct socket *sock = sk->sk_socket; if (unlikely(!sock || !sock->ops || !sock->ops->read_skb)) return; sock->ops->read_skb(sk, sk_psock_verdict_recv); } The oversight here is sk->sk_socket is not assigned until the application accepts() the new socket. However, its entirely ok for the peer application to do a connect() followed immediately by sends. The socket on the receiver is sitting on the backlog queue of the listening socket until its accepted and the data is queued up. If the peer never accepts the socket or is slow it will eventually hit data limits and rate limit the session. But, important for BPF sockmap hooks when this data is received TCP stack does the sk_data_ready() call but the read_skb() for this data is never called because sk_socket is missing. The data sits on the sk_receive_queue. Then once the socket is accepted if we never receive more data from the peer there will be no further sk_data_ready calls and all the data is still on the sk_receive_queue(). Then user calls recvmsg after accept() and for TCP sockets in sockmap we use the tcp_bpf_recvmsg_parser() handler. The handler checks for data in the sk_msg ingress queue expecting that the BPF program has already run from the sk_data_ready hook and enqueued the data as needed. So we are stuck. To fix do an unlikely check in recvmsg handler for data on the sk_receive_queue and if it exists wake up data_ready. We have the sock locked in both read_skb and recvmsg so should avoid having multiple runners. Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-7-john.fastabend@gmail.com
2023-05-23bpf, sockmap: Handle fin correctlyJohn Fastabend
The sockmap code is returning EAGAIN after a FIN packet is received and no more data is on the receive queue. Correct behavior is to return 0 to the user and the user can then close the socket. The EAGAIN causes many apps to retry which masks the problem. Eventually the socket is evicted from the sockmap because its released from sockmap sock free handling. The issue creates a delay and can cause some errors on application side. To fix this check on sk_msg_recvmsg side if length is zero and FIN flag is set then set return to zero. A selftest will be added to check this condition. Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: William Findlay <will@isovalent.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-6-john.fastabend@gmail.com
2023-05-23bpf, sockmap: Improved check for empty queueJohn Fastabend
We noticed some rare sk_buffs were stepping past the queue when system was under memory pressure. The general theory is to skip enqueueing sk_buffs when its not necessary which is the normal case with a system that is properly provisioned for the task, no memory pressure and enough cpu assigned. But, if we can't allocate memory due to an ENOMEM error when enqueueing the sk_buff into the sockmap receive queue we push it onto a delayed workqueue to retry later. When a new sk_buff is received we then check if that queue is empty. However, there is a problem with simply checking the queue length. When a sk_buff is being processed from the ingress queue but not yet on the sockmap msg receive queue its possible to also recv a sk_buff through normal path. It will check the ingress queue which is zero and then skip ahead of the pkt being processed. Previously we used sock lock from both contexts which made the problem harder to hit, but not impossible. To fix instead of popping the skb from the queue entirely we peek the skb from the queue and do the copy there. This ensures checks to the queue length are non-zero while skb is being processed. Then finally when the entire skb has been copied to user space queue or another socket we pop it off the queue. This way the queue length check allows bypassing the queue only after the list has been completely processed. To reproduce issue we run NGINX compliance test with sockmap running and observe some flakes in our testing that we attributed to this issue. Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Suggested-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: William Findlay <will@isovalent.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-5-john.fastabend@gmail.com
2023-05-23bpf, sockmap: Reschedule is now done through backlogJohn Fastabend
Now that the backlog manages the reschedule() logic correctly we can drop the partial fix to reschedule from recvmsg hook. Rescheduling on recvmsg hook was added to address a corner case where we still had data in the backlog state but had nothing to kick it and reschedule the backlog worker to run and finish copying data out of the state. This had a couple limitations, first it required user space to kick it introducing an unnecessary EBUSY and retry. Second it only handled the ingress case and egress redirects would still be hung. With the correct fix, pushing the reschedule logic down to where the enomem error occurs we can drop this fix. Fixes: bec217197b412 ("skmsg: Schedule psock work if the cached skb exists on the psock") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-4-john.fastabend@gmail.com
2023-05-23bpf, sockmap: Convert schedule_work into delayed_workJohn Fastabend
Sk_buffs are fed into sockmap verdict programs either from a strparser (when the user might want to decide how framing of skb is done by attaching another parser program) or directly through tcp_read_sock. The tcp_read_sock is the preferred method for performance when the BPF logic is a stream parser. The flow for Cilium's common use case with a stream parser is, tcp_read_sock() sk_psock_verdict_recv ret = bpf_prog_run_pin_on_cpu() sk_psock_verdict_apply(sock, skb, ret) // if system is under memory pressure or app is slow we may // need to queue skb. Do this queuing through ingress_skb and // then kick timer to wake up handler skb_queue_tail(ingress_skb, skb) schedule_work(work); The work queue is wired up to sk_psock_backlog(). This will then walk the ingress_skb skb list that holds our sk_buffs that could not be handled, but should be OK to run at some later point. However, its possible that the workqueue doing this work still hits an error when sending the skb. When this happens the skbuff is requeued on a temporary 'state' struct kept with the workqueue. This is necessary because its possible to partially send an skbuff before hitting an error and we need to know how and where to restart when the workqueue runs next. Now for the trouble, we don't rekick the workqueue. This can cause a stall where the skbuff we just cached on the state variable might never be sent. This happens when its the last packet in a flow and no further packets come along that would cause the system to kick the workqueue from that side. To fix we could do simple schedule_work(), but while under memory pressure it makes sense to back off some instead of continue to retry repeatedly. So instead to fix convert schedule_work to schedule_delayed_work and add backoff logic to reschedule from backlog queue on errors. Its not obvious though what a good backoff is so use '1'. To test we observed some flakes whil running NGINX compliance test with sockmap we attributed these failed test to this bug and subsequent issue. >From on list discussion. This commit bec217197b41("skmsg: Schedule psock work if the cached skb exists on the psock") was intended to address similar race, but had a couple cases it missed. Most obvious it only accounted for receiving traffic on the local socket so if redirecting into another socket we could still get an sk_buff stuck here. Next it missed the case where copied=0 in the recv() handler and then we wouldn't kick the scheduler. Also its sub-optimal to require userspace to kick the internal mechanisms of sockmap to wake it up and copy data to user. It results in an extra syscall and requires the app to actual handle the EAGAIN correctly. Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: William Findlay <will@isovalent.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-3-john.fastabend@gmail.com
2023-05-23bpf, sockmap: Pass skb ownership through read_skbJohn Fastabend
The read_skb hook calls consume_skb() now, but this means that if the recv_actor program wants to use the skb it needs to inc the ref cnt so that the consume_skb() doesn't kfree the sk_buff. This is problematic because in some error cases under memory pressure we may need to linearize the sk_buff from sk_psock_skb_ingress_enqueue(). Then we get this, skb_linearize() __pskb_pull_tail() pskb_expand_head() BUG_ON(skb_shared(skb)) Because we incremented users refcnt from sk_psock_verdict_recv() we hit the bug on with refcnt > 1 and trip it. To fix lets simply pass ownership of the sk_buff through the skb_read call. Then we can drop the consume from read_skb handlers and assume the verdict recv does any required kfree. Bug found while testing in our CI which runs in VMs that hit memory constraints rather regularly. William tested TCP read_skb handlers. [ 106.536188] ------------[ cut here ]------------ [ 106.536197] kernel BUG at net/core/skbuff.c:1693! [ 106.536479] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 106.536726] CPU: 3 PID: 1495 Comm: curl Not tainted 5.19.0-rc5 #1 [ 106.537023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.16.0-1 04/01/2014 [ 106.537467] RIP: 0010:pskb_expand_head+0x269/0x330 [ 106.538585] RSP: 0018:ffffc90000138b68 EFLAGS: 00010202 [ 106.538839] RAX: 000000000000003f RBX: ffff8881048940e8 RCX: 0000000000000a20 [ 106.539186] RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff8881048940e8 [ 106.539529] RBP: ffffc90000138be8 R08: 00000000e161fd1a R09: 0000000000000000 [ 106.539877] R10: 0000000000000018 R11: 0000000000000000 R12: ffff8881048940e8 [ 106.540222] R13: 0000000000000003 R14: 0000000000000000 R15: ffff8881048940e8 [ 106.540568] FS: 00007f277dde9f00(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000 [ 106.540954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 106.541227] CR2: 00007f277eeede64 CR3: 000000000ad3e000 CR4: 00000000000006e0 [ 106.541569] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 106.541915] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 106.542255] Call Trace: [ 106.542383] <IRQ> [ 106.542487] __pskb_pull_tail+0x4b/0x3e0 [ 106.542681] skb_ensure_writable+0x85/0xa0 [ 106.542882] sk_skb_pull_data+0x18/0x20 [ 106.543084] bpf_prog_b517a65a242018b0_bpf_skskb_http_verdict+0x3a9/0x4aa9 [ 106.543536] ? migrate_disable+0x66/0x80 [ 106.543871] sk_psock_verdict_recv+0xe2/0x310 [ 106.544258] ? sk_psock_write_space+0x1f0/0x1f0 [ 106.544561] tcp_read_skb+0x7b/0x120 [ 106.544740] tcp_data_queue+0x904/0xee0 [ 106.544931] tcp_rcv_established+0x212/0x7c0 [ 106.545142] tcp_v4_do_rcv+0x174/0x2a0 [ 106.545326] tcp_v4_rcv+0xe70/0xf60 [ 106.545500] ip_protocol_deliver_rcu+0x48/0x290 [ 106.545744] ip_local_deliver_finish+0xa7/0x150 Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Reported-by: William Findlay <will@isovalent.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: William Findlay <will@isovalent.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20230523025618.113937-2-john.fastabend@gmail.com
2023-05-23ipv{4,6}/raw: fix output xfrm lookup wrt protocolNicolas Dichtel
With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the protocol field of the flow structure, build by raw_sendmsg() / rawv6_sendmsg()), is set to IPPROTO_RAW. This breaks the ipsec policy lookup when some policies are defined with a protocol in the selector. For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to specify the protocol. Just accept all values for IPPROTO_RAW socket. For ipv4, the sin_port field of 'struct sockaddr_in' could not be used without breaking backward compatibility (the value of this field was never checked). Let's add a new kind of control message, so that the userland could specify which protocol is used. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") CC: stable@vger.kernel.org Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/20230522120820.1319391-1-nicolas.dichtel@6wind.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-22net/handshake: Fix sock->file allocationChuck Lever
sock->file = sock_alloc_file(sock, O_NONBLOCK, NULL); ^^^^ ^^^^ sock_alloc_file() calls release_sock() on error but the left hand side of the assignment dereferences "sock". This isn't the bug and I didn't report this earlier because there is an assert that it doesn't fail. net/handshake/handshake-test.c:221 handshake_req_submit_test4() error: dereferencing freed memory 'sock' net/handshake/handshake-test.c:233 handshake_req_submit_test4() warn: 'req' was already freed. net/handshake/handshake-test.c:254 handshake_req_submit_test5() error: dereferencing freed memory 'sock' net/handshake/handshake-test.c:290 handshake_req_submit_test6() error: dereferencing freed memory 'sock' net/handshake/handshake-test.c:321 handshake_req_cancel_test1() error: dereferencing freed memory 'sock' net/handshake/handshake-test.c:355 handshake_req_cancel_test2() error: dereferencing freed memory 'sock' net/handshake/handshake-test.c:367 handshake_req_cancel_test2() warn: 'req' was already freed. net/handshake/handshake-test.c:395 handshake_req_cancel_test3() error: dereferencing freed memory 'sock' net/handshake/handshake-test.c:407 handshake_req_cancel_test3() warn: 'req' was already freed. net/handshake/handshake-test.c:451 handshake_req_destroy_test1() error: dereferencing freed memory 'sock' net/handshake/handshake-test.c:463 handshake_req_destroy_test1() warn: 'req' was already freed. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 88232ec1ec5e ("net/handshake: Add Kunit tests for the handshake consumer API") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/168451609436.45209.15407022385441542980.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-22net/handshake: Squelch allocation warning during Kunit testChuck Lever
The "handshake_req_alloc excessive privsize" kunit test is intended to check what happens when the maximum privsize is exceeded. The WARN_ON_ONCE_GFP at mm/page_alloc.c:4744 can be disabled safely for this test. Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Fixes: 88232ec1ec5e ("net/handshake: Add Kunit tests for the handshake consumer API") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/168451636052.47152.9600443326570457947.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-22Merge tag 'nfs-for-6.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client fixes from Anna Schumaker: "Stable Fix: - Don't change task->tk_status after the call to rpc_exit_task Other Bugfixes: - Convert kmap_atomic() to kmap_local_folio() - Fix a potential double free with READ_PLUS" * tag 'nfs-for-6.4-2' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFSv4.2: Fix a potential double free with READ_PLUS SUNRPC: Don't change task->tk_status after the call to rpc_exit_task NFS: Convert kmap_atomic() to kmap_local_folio()
2023-05-22sctp: fix an issue that plpmtu can never go to complete stateXin Long
When doing plpmtu probe, the probe size is growing every time when it receives the ACK during the Search state until the probe fails. When the failure occurs, pl.probe_high is set and it goes to the Complete state. However, if the link pmtu is huge, like 65535 in loopback_dev, the probe eventually keeps using SCTP_MAX_PLPMTU as the probe size and never fails. Because of that, pl.probe_high can not be set, and the plpmtu probe can never go to the Complete state. Fix it by setting pl.probe_high to SCTP_MAX_PLPMTU when the probe size grows to SCTP_MAX_PLPMTU in sctp_transport_pl_recv(). Also, not allow the probe size greater than SCTP_MAX_PLPMTU in the Complete state. Fixes: b87641aff9e7 ("sctp: do state transition when a probe succeeds on HB ACK recv path") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19Bluetooth: Unlink CISes when LE disconnects in hci_conn_delRuihan Li
Currently, hci_conn_del calls hci_conn_unlink for BR/EDR, (e)SCO, and CIS connections, i.e., everything except LE connections. However, if (e)SCO connections are unlinked when BR/EDR disconnects, CIS connections should also be unlinked when LE disconnects. In terms of disconnection behavior, CIS and (e)SCO connections are not too different. One peculiarity of CIS is that when CIS connections are disconnected, the CIS handle isn't deleted, as per [BLUETOOTH CORE SPECIFICATION Version 5.4 | Vol 4, Part E] 7.1.6 Disconnect command: All SCO, eSCO, and CIS connections on a physical link should be disconnected before the ACL connection on the same physical connection is disconnected. If it does not, they will be implicitly disconnected as part of the ACL disconnection. ... Note: As specified in Section 7.7.5, on the Central, the handle for a CIS remains valid even after disconnection and, therefore, the Host can recreate a disconnected CIS at a later point in time using the same connection handle. Since hci_conn_link invokes both hci_conn_get and hci_conn_hold, hci_conn_unlink should perform both hci_conn_put and hci_conn_drop as well. However, currently it performs only hci_conn_put. This patch makes hci_conn_unlink call hci_conn_drop as well, which simplifies the logic in hci_conn_del a bit and may benefit future users of hci_conn_unlink. But it is noted that this change additionally implies that hci_conn_unlink can queue disc_work on conn itself, with the following call stack: hci_conn_unlink(conn) [conn->parent == NULL] -> hci_conn_unlink(child) [child->parent == conn] -> hci_conn_drop(child->parent) -> queue_delayed_work(&conn->disc_work) Queued disc_work after hci_conn_del can be spurious, so during the process of hci_conn_del, it is necessary to make the call to cancel_delayed_work(&conn->disc_work) after invoking hci_conn_unlink. Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn> Co-developed-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-05-19Bluetooth: Fix UAF in hci_conn_hash_flush againRuihan Li
Commit 06149746e720 ("Bluetooth: hci_conn: Add support for linking multiple hcon") reintroduced a previously fixed bug [1] ("KASAN: slab-use-after-free Read in hci_conn_hash_flush"). This bug was originally fixed by commit 5dc7d23e167e ("Bluetooth: hci_conn: Fix possible UAF"). The hci_conn_unlink function was added to avoid invalidating the link traversal caused by successive hci_conn_del operations releasing extra connections. However, currently hci_conn_unlink itself also releases extra connections, resulted in the reintroduced bug. This patch follows a more robust solution for cleaning up all connections, by repeatedly removing the first connection until there are none left. This approach does not rely on the inner workings of hci_conn_del and ensures proper cleanup of all connections. Meanwhile, we need to make sure that hci_conn_del never fails. Indeed it doesn't, as it now always returns zero. To make this a bit clearer, this patch also changes its return type to void. Reported-by: syzbot+8bb72f86fc823817bc5d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-bluetooth/000000000000aa920505f60d25ad@google.com/ Fixes: 06149746e720 ("Bluetooth: hci_conn: Add support for linking multiple hcon") Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn> Co-developed-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-05-19Bluetooth: Refcnt drop must be placed last in hci_conn_unlinkRuihan Li
If hci_conn_put(conn->parent) reduces conn->parent's reference count to zero, it can immediately deallocate conn->parent. At the same time, conn->link->list has its head in conn->parent, causing use-after-free problems in the latter list_del_rcu(&conn->link->list). This problem can be easily solved by reordering the two operations, i.e., first performing the list removal with list_del_rcu and then decreasing the refcnt with hci_conn_put. Reported-by: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Closes: https://lore.kernel.org/linux-bluetooth/CABBYNZ+1kce8_RJrLNOXd_8=Mdpb=2bx4Nto-hFORk=qiOkoCg@mail.gmail.com/ Fixes: 06149746e720 ("Bluetooth: hci_conn: Add support for linking multiple hcon") Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-05-19Bluetooth: Fix potential double free caused by hci_conn_unlinkRuihan Li
The hci_conn_unlink function is being called by hci_conn_del, which means it should not call hci_conn_del with the input parameter conn again. If it does, conn may have already been released when hci_conn_unlink returns, leading to potential UAF and double-free issues. This patch resolves the problem by modifying hci_conn_unlink to release only conn's child links when necessary, but never release conn itself. Reported-by: syzbot+690b90b14f14f43f4688@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-bluetooth/000000000000484a8205faafe216@google.com/ Fixes: 06149746e720 ("Bluetooth: hci_conn: Add support for linking multiple hcon") Signed-off-by: Ruihan Li <lrh2000@pku.edu.cn> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Reported-by: syzbot+690b90b14f14f43f4688@syzkaller.appspotmail.com Reported-by: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Reported-by: syzbot+8bb72f86fc823817bc5d@syzkaller.appspotmail.com
2023-05-19SUNRPC: Don't change task->tk_status after the call to rpc_exit_taskTrond Myklebust
Some calls to rpc_exit_task() may deliberately change the value of task->tk_status, for instance because it gets checked by the RPC call's rpc_release() callback. That makes it wrong to reset the value to task->tk_rpc_status. In particular this causes a bug where the rpc_call_done() callback tries to fail over a set of pNFS/flexfiles writes to a different IP address, but the reset of task->tk_status causes nfs_commit_release_pages() to immediately mark the file as having a fatal error. Fixes: 39494194f93b ("SUNRPC: Fix races with rpc_killall_tasks()") Cc: stable@vger.kernel.org # 6.1.x Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2023-05-19net/smc: Reset connection when trying to use SMCRv2 fails.Wen Gu
We found a crash when using SMCRv2 with 2 Mellanox ConnectX-4. It can be reproduced by: - smc_run nginx - smc_run wrk -t 32 -c 500 -d 30 http://<ip>:<port> BUG: kernel NULL pointer dereference, address: 0000000000000014 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 8000000108713067 P4D 8000000108713067 PUD 151127067 PMD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 4 PID: 2441 Comm: kworker/4:249 Kdump: loaded Tainted: G W E 6.4.0-rc1+ #42 Workqueue: smc_hs_wq smc_listen_work [smc] RIP: 0010:smc_clc_send_confirm_accept+0x284/0x580 [smc] RSP: 0018:ffffb8294b2d7c78 EFLAGS: 00010a06 RAX: ffff8f1873238880 RBX: ffffb8294b2d7dc8 RCX: 0000000000000000 RDX: 00000000000000b4 RSI: 0000000000000001 RDI: 0000000000b40c00 RBP: ffffb8294b2d7db8 R08: ffff8f1815c5860c R09: 0000000000000000 R10: 0000000000000400 R11: 0000000000000000 R12: ffff8f1846f56180 R13: ffff8f1815c5860c R14: 0000000000000001 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff8f1aefd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000014 CR3: 00000001027a0001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? mlx5_ib_map_mr_sg+0xa1/0xd0 [mlx5_ib] ? smcr_buf_map_link+0x24b/0x290 [smc] ? __smc_buf_create+0x4ee/0x9b0 [smc] smc_clc_send_accept+0x4c/0xb0 [smc] smc_listen_work+0x346/0x650 [smc] ? __schedule+0x279/0x820 process_one_work+0x1e5/0x3f0 worker_thread+0x4d/0x2f0 ? __pfx_worker_thread+0x10/0x10 kthread+0xe5/0x120 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2c/0x50 </TASK> During the CLC handshake, server sequentially tries available SMCRv2 and SMCRv1 devices in smc_listen_work(). If an SMCRv2 device is found. SMCv2 based link group and link will be assigned to the connection. Then assumed that some buffer assignment errors happen later in the CLC handshake, such as RMB registration failure, server will give up SMCRv2 and try SMCRv1 device instead. But the resources assigned to the connection won't be reset. When server tries SMCRv1 device, the connection creation process will be executed again. Since conn->lnk has been assigned when trying SMCRv2, it will not be set to the correct SMCRv1 link in smcr_lgr_conn_assign_link(). So in such situation, conn->lgr points to correct SMCRv1 link group but conn->lnk points to the SMCRv2 link mistakenly. Then in smc_clc_send_confirm_accept(), conn->rmb_desc->mr[link->link_idx] will be accessed. Since the link->link_idx is not correct, the related MR may not have been initialized, so crash happens. | Try SMCRv2 device first | |-> conn->lgr: assign existed SMCRv2 link group; | |-> conn->link: assign existed SMCRv2 link (link_idx may be 1 in SMC_LGR_SYMMETRIC); | |-> sndbuf & RMB creation fails, quit; | | Try SMCRv1 device then | |-> conn->lgr: create SMCRv1 link group and assign; | |-> conn->link: keep SMCRv2 link mistakenly; | |-> sndbuf & RMB creation succeed, only RMB->mr[link_idx = 0] | initialized. | | Then smc_clc_send_confirm_accept() accesses | conn->rmb_desc->mr[conn->link->link_idx, which is 1], then crash. v This patch tries to fix this by cleaning conn->lnk before assigning link. In addition, it is better to reset the connection and clean the resources assigned if trying SMCRv2 failed in buffer creation or registration. Fixes: e49300a6bf62 ("net/smc: add listen processing for SMC-Rv2") Link: https://lore.kernel.org/r/20220523055056.2078994-1-liuyacan@corp.netease.com/ Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19tls: rx: strp: don't use GFP_KERNEL in softirq contextJakub Kicinski
When receive buffer is small, or the TCP rx queue looks too complicated to bother using it directly - we allocate a new skb and copy data into it. We already use sk->sk_allocation... but nothing actually sets it to GFP_ATOMIC on the ->sk_data_ready() path. Users of HW offload are far more likely to experience problems due to scheduling while atomic. "Copy mode" is very rarely triggered with SW crypto. Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19tls: rx: strp: preserve decryption status of skbs when neededJakub Kicinski
When receive buffer is small we try to copy out the data from TCP into a skb maintained by TLS to prevent connection from stalling. Unfortunately if a single record is made up of a mix of decrypted and non-decrypted skbs combining them into a single skb leads to loss of decryption status, resulting in decryption errors or data corruption. Similarly when trying to use TCP receive queue directly we need to make sure that all the skbs within the record have the same status. If we don't the mixed status will be detected correctly but we'll CoW the anchor, again collapsing it into a single paged skb without decrypted status preserved. So the "fixup" code will not know which parts of skb to re-encrypt. Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19tls: rx: strp: factor out copying skb dataJakub Kicinski
We'll need to copy input skbs individually in the next patch. Factor that code out (without assuming we're copying a full record). Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19tls: rx: strp: fix determining record length in copy modeJakub Kicinski
We call tls_rx_msg_size(skb) before doing skb->len += chunk. So the tls_rx_msg_size() code will see old skb->len, most likely leading to an over-read. Worst case we will over read an entire record, next iteration will try to trim the skb but may end up turning frag len negative or discarding the subsequent record (since we already told TCP we've read it during previous read but now we'll trim it out of the skb). Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19tls: rx: strp: force mixed decrypted records into copy modeJakub Kicinski
If a record is partially decrypted we'll have to CoW it, anyway, so go into copy mode and allocate a writable skb right away. This will make subsequent fix simpler because we won't have to teach tls_strp_msg_make_copy() how to copy skbs while preserving decrypt status. Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19tls: rx: strp: set the skb->len of detached / CoW'ed skbsJakub Kicinski
alloc_skb_with_frags() fills in page frag sizes but does not set skb->len and skb->data_len. Set those correctly otherwise device offload will most likely generate an empty skb and hit the BUG() at the end of __skb_nsg(). Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-19tls: rx: device: fix checking decryption statusJakub Kicinski
skb->len covers the entire skb, including the frag_list. In fact we're guaranteed that rxm->full_len <= skb->len, so since the change under Fixes we were not checking decrypt status of any skb but the first. Note that the skb_pagelen() added here may feel a bit costly, but it's removed by subsequent fixes, anyway. Reported-by: Tariq Toukan <tariqt@nvidia.com> Fixes: 86b259f6f888 ("tls: rx: device: bound the frag walk") Tested-by: Shai Amiram <samiram@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-18Merge tag 'net-6.4-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from can, xfrm, bluetooth and netfilter. Current release - regressions: - ipv6: fix RCU splat in ipv6_route_seq_show() - wifi: iwlwifi: disable RFI feature Previous releases - regressions: - tcp: fix possible sk_priority leak in tcp_v4_send_reset() - tipc: do not update mtu if msg_max is too small in mtu negotiation - netfilter: fix null deref on element insertion - devlink: change per-devlink netdev notifier to static one - phylink: fix ksettings_set() ethtool call - wifi: mac80211: fortify the spinlock against deadlock by interrupt - wifi: brcmfmac: check for probe() id argument being NULL - eth: ice: - fix undersized tx_flags variable - fix ice VF reset during iavf initialization - eth: hns3: fix sending pfc frames after reset issue Previous releases - always broken: - xfrm: release all offloaded policy memory - nsh: use correct mac_offset to unwind gso skb in nsh_gso_segment() - vsock: avoid to close connected socket after the timeout - dsa: rzn1-a5psw: enable management frames for CPU port - eth: virtio_net: fix error unwinding of XDP initialization - eth: tun: fix memory leak for detached NAPI queue. Misc: - MAINTAINERS: sctp: move Neil to CREDITS" * tag 'net-6.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (107 commits) MAINTAINERS: skip CCing netdev for Bluetooth patches mdio_bus: unhide mdio_bus_init prototype bridge: always declare tunnel functions atm: hide unused procfs functions net: isa: include net/Space.h Revert "ARM: dts: stm32: add CAN support on stm32f746" netfilter: nft_set_rbtree: fix null deref on element insertion netfilter: nf_tables: fix nft_trans type confusion netfilter: conntrack: define variables exp_nat_nla_policy and any_addr with CONFIG_NF_NAT net: wwan: t7xx: Ensure init is completed before system sleep net: selftests: Fix optstring net: pcs: xpcs: fix C73 AN not getting enabled net: wwan: iosm: fix NULL pointer dereference when removing device vlan: fix a potential uninit-value in vlan_dev_hard_start_xmit() mailmap: add entries for Nikolay Aleksandrov igb: fix bit_shift to be in [1..8] range net: dsa: mv88e6xxx: Fix mv88e6393x EPC write command offset cassini: Fix a memory leak in the error handling path of cas_init_one() tun: Fix memory leak for detached NAPI queue. can: kvaser_pciefd: Disable interrupts in probe error path ...
2023-05-17Merge tag 'nf-23-05-17' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florian Westphal says: ==================== Netfilter fixes for net 1. Silence warning about unused variable when CONFIG_NF_NAT=n, from Tom Rix. 2. nftables: Fix possible out-of-bounds access, from myself. 3. nftables: fix null deref+UAF during element insertion into rbtree, also from myself. * tag 'nf-23-05-17' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_set_rbtree: fix null deref on element insertion netfilter: nf_tables: fix nft_trans type confusion netfilter: conntrack: define variables exp_nat_nla_policy and any_addr with CONFIG_NF_NAT ==================== Link: https://lore.kernel.org/r/20230517123756.7353-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-17Merge tag 'wireless-2023-05-17' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Kalle Valo says: ==================== wireless fixes for v6.4 A lot of fixes this time, for both the stack and the drivers. The brcmfmac resume fix has been reported by several people so I would say it's the most important here. The iwlwifi RFI workaround is also something which was reported as a regression recently. * tag 'wireless-2023-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (31 commits) wifi: b43: fix incorrect __packed annotation wifi: rtw88: sdio: Always use two consecutive bytes for word operations mac80211_hwsim: fix memory leak in hwsim_new_radio_nl wifi: iwlwifi: mvm: Add locking to the rate read flow wifi: iwlwifi: Don't use valid_links to iterate sta links wifi: iwlwifi: mvm: don't trust firmware n_channels wifi: iwlwifi: mvm: fix OEM's name in the tas approved list wifi: iwlwifi: fix OEM's name in the ppag approved list wifi: iwlwifi: mvm: fix initialization of a return value wifi: iwlwifi: mvm: fix access to fw_id_to_mac_id wifi: iwlwifi: fw: fix DBGI dump wifi: iwlwifi: mvm: fix number of concurrent link checks wifi: iwlwifi: mvm: fix cancel_delayed_work_sync() deadlock wifi: iwlwifi: mvm: don't double-init spinlock wifi: iwlwifi: mvm: always free dup_data wifi: mac80211: recalc chanctx mindef before assigning wifi: mac80211: consider reserved chanctx for mindef wifi: mac80211: simplify chanctx allocation wifi: mac80211: Abort running color change when stopping the AP wifi: mac80211: fix min center freq offset tracing ... ==================== Link: https://lore.kernel.org/r/20230517151914.B0AF6C433EF@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-17bridge: always declare tunnel functionsArnd Bergmann
When CONFIG_BRIDGE_VLAN_FILTERING is disabled, two functions are still defined but have no prototype or caller. This causes a W=1 warning for the missing prototypes: net/bridge/br_netlink_tunnel.c:29:6: error: no previous prototype for 'vlan_tunid_inrange' [-Werror=missing-prototypes] net/bridge/br_netlink_tunnel.c:199:5: error: no previous prototype for 'br_vlan_tunnel_info' [-Werror=missing-prototypes] The functions are already contitional on CONFIG_BRIDGE_VLAN_FILTERING, and I coulnd't easily figure out the right set of #ifdefs, so just move the declarations out of the #ifdef to avoid the warning, at a small cost in code size over a more elaborate fix. Fixes: 188c67dd1906 ("net: bridge: vlan options: add support for tunnel id dumping") Fixes: 569da0822808 ("net: bridge: vlan options: add support for tunnel mapping set/del") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20230516194625.549249-3-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-17atm: hide unused procfs functionsArnd Bergmann
When CONFIG_PROC_FS is disabled, the function declarations for some procfs functions are hidden, but the definitions are still build, as shown by this compiler warning: net/atm/resources.c:403:7: error: no previous prototype for 'atm_dev_seq_start' [-Werror=missing-prototypes] net/atm/resources.c:409:6: error: no previous prototype for 'atm_dev_seq_stop' [-Werror=missing-prototypes] net/atm/resources.c:414:7: error: no previous prototype for 'atm_dev_seq_next' [-Werror=missing-prototypes] Add another #ifdef to leave these out of the build. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20230516194625.549249-2-arnd@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-17Merge tag 'nfsd-6.4-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - A collection of minor bug fixes * tag 'nfsd-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Remove open coding of string copy SUNRPC: Fix trace_svc_register() call site SUNRPC: always free ctxt when freeing deferred request SUNRPC: double free xprt_ctxt while still in use SUNRPC: Fix error handling in svc_setup_socket() SUNRPC: Fix encoding of accepted but unsuccessful RPC replies lockd: define nlm_port_min,max with CONFIG_SYSCTL nfsd: define exports_proc_ops with CONFIG_PROC_FS SUNRPC: Avoid relying on crypto API to derive CBC-CTS output IV
2023-05-17netfilter: nft_set_rbtree: fix null deref on element insertionFlorian Westphal
There is no guarantee that rb_prev() will not return NULL in nft_rbtree_gc_elem(): general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] nft_add_set_elem+0x14b0/0x2990 nf_tables_newsetelem+0x528/0xb30 Furthermore, there is a possible use-after-free while iterating, 'node' can be free'd so we need to cache the next value to use. Fixes: c9e6978e2725 ("netfilter: nft_set_rbtree: Switch to node list walk for overlap detection") Signed-off-by: Florian Westphal <fw@strlen.de>
2023-05-17netfilter: nf_tables: fix nft_trans type confusionFlorian Westphal
nft_trans_FOO objects all share a common nft_trans base structure, but trailing fields depend on the real object size. Access is only safe after trans->msg_type check. Check for rule type first. Found by code inspection. Fixes: 1a94e38d254b ("netfilter: nf_tables: add NFTA_RULE_ID attribute") Signed-off-by: Florian Westphal <fw@strlen.de>
2023-05-17netfilter: conntrack: define variables exp_nat_nla_policy and any_addr with ↵Tom Rix
CONFIG_NF_NAT gcc with W=1 and ! CONFIG_NF_NAT net/netfilter/nf_conntrack_netlink.c:3463:32: error: ‘exp_nat_nla_policy’ defined but not used [-Werror=unused-const-variable=] 3463 | static const struct nla_policy exp_nat_nla_policy[CTA_EXPECT_NAT_MAX+1] = { | ^~~~~~~~~~~~~~~~~~ net/netfilter/nf_conntrack_netlink.c:2979:33: error: ‘any_addr’ defined but not used [-Werror=unused-const-variable=] 2979 | static const union nf_inet_addr any_addr; | ^~~~~~~~ These variables use is controlled by CONFIG_NF_NAT, so should their definitions. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-05-17vlan: fix a potential uninit-value in vlan_dev_hard_start_xmit()Eric Dumazet
syzbot triggered the following splat [1], sending an empty message through pppoe_sendmsg(). When VLAN_FLAG_REORDER_HDR flag is set, vlan_dev_hard_header() does not push extra bytes for the VLAN header, because vlan is offloaded. Unfortunately vlan_dev_hard_start_xmit() first reads veth->h_vlan_proto before testing (vlan->flags & VLAN_FLAG_REORDER_HDR). We need to swap the two conditions. [1] BUG: KMSAN: uninit-value in vlan_dev_hard_start_xmit+0x171/0x7f0 net/8021q/vlan_dev.c:111 vlan_dev_hard_start_xmit+0x171/0x7f0 net/8021q/vlan_dev.c:111 __netdev_start_xmit include/linux/netdevice.h:4883 [inline] netdev_start_xmit include/linux/netdevice.h:4897 [inline] xmit_one net/core/dev.c:3580 [inline] dev_hard_start_xmit+0x253/0xa20 net/core/dev.c:3596 __dev_queue_xmit+0x3c7f/0x5ac0 net/core/dev.c:4246 dev_queue_xmit include/linux/netdevice.h:3053 [inline] pppoe_sendmsg+0xa93/0xb80 drivers/net/ppp/pppoe.c:900 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0xa24/0xe40 net/socket.c:2501 ___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2555 __sys_sendmmsg+0x411/0xa50 net/socket.c:2641 __do_sys_sendmmsg net/socket.c:2670 [inline] __se_sys_sendmmsg net/socket.c:2667 [inline] __x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2667 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Uninit was created at: slab_post_alloc_hook+0x12d/0xb60 mm/slab.h:774 slab_alloc_node mm/slub.c:3452 [inline] kmem_cache_alloc_node+0x543/0xab0 mm/slub.c:3497 kmalloc_reserve+0x148/0x470 net/core/skbuff.c:520 __alloc_skb+0x3a7/0x850 net/core/skbuff.c:606 alloc_skb include/linux/skbuff.h:1277 [inline] sock_wmalloc+0xfe/0x1a0 net/core/sock.c:2583 pppoe_sendmsg+0x3af/0xb80 drivers/net/ppp/pppoe.c:867 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0xa24/0xe40 net/socket.c:2501 ___sys_sendmsg+0x2a1/0x3f0 net/socket.c:2555 __sys_sendmmsg+0x411/0xa50 net/socket.c:2641 __do_sys_sendmmsg net/socket.c:2670 [inline] __se_sys_sendmmsg net/socket.c:2667 [inline] __x64_sys_sendmmsg+0xbc/0x120 net/socket.c:2667 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd CPU: 0 PID: 29770 Comm: syz-executor.0 Not tainted 6.3.0-rc6-syzkaller-gc478e5b17829 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/30/2023 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-16Merge tag 'ipsec-2023-05-16' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2023-05-16 1) Don't check the policy default if we have an allow policy. Fix from Sabrina Dubroca. 2) Fix netdevice refount usage on offload. From Leon Romanovsky. 3) Use netdev_put instead of dev_puti to correctly release the netdev on failure in xfrm_dev_policy_add. From Leon Romanovsky. 4) Revert "Fix XFRM-I support for nested ESP tunnels" This broke Netfilter policy matching. From Martin Willi. 5) Reject optional tunnel/BEET mode templates in outbound policies on netlink and pfkey sockets. From Tobias Brunner. 6) Check if_id in inbound policy/secpath match to make it symetric to the outbound codepath. From Benedict Wong. * tag 'ipsec-2023-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: Check if_id in inbound policy/secpath match af_key: Reject optional tunnel/BEET mode templates in outbound policies xfrm: Reject optional tunnel/BEET mode templates in outbound policies Revert "Fix XFRM-I support for nested ESP tunnels" xfrm: Fix leak of dev tracker xfrm: release all offloaded policy memory xfrm: don't check the default policy if the policy allows the packet ==================== Link: https://lore.kernel.org/r/20230516052405.2677554-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>