summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2023-05-15net: set FMODE_NOWAIT for socketsJens Axboe
The socket read/write functions deal with O_NONBLOCK and IOCB_NOWAIT just fine, so we can flag them as being FMODE_NOWAIT compliant. With this, we can remove socket special casing in io_uring when checking if a file type is sane for nonblocking IO, and it's also the defined way to flag file types as such in the kernel. Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Reviewed-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/20230509151910.183637-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-05-15nfc: llcp: fix possible use of uninitialized variable in nfc_llcp_send_connect()Krzysztof Kozlowski
If sock->service_name is NULL, the local variable service_name_tlv_length will not be assigned by nfc_llcp_build_tlv(), later leading to using value frmo the stack. Smatch warning: net/nfc/llcp_commands.c:442 nfc_llcp_send_connect() error: uninitialized symbol 'service_name_tlv_length'. Fixes: de9e5aeb4f40 ("NFC: llcp: Fix usage of llcp_add_tlv()") Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15tipc: check the bearer min mtu properly when setting it by netlinkXin Long
Checking the bearer min mtu with tipc_udp_mtu_bad() only works for IPv4 UDP bearer, and IPv6 UDP bearer has a different value for the min mtu. This patch checks with encap_hlen + TIPC_MIN_BEARER_MTU for min mtu, which works for both IPv4 and IPv6 UDP bearer. Note that tipc_udp_mtu_bad() is still used to check media min mtu in __tipc_nl_media_set(), as m->mtu currently is only used by the IPv4 UDP bearer as its default mtu value. Fixes: 682cd3cf946b ("tipc: confgiure and apply UDP bearer MTU on running links") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15tipc: do not update mtu if msg_max is too small in mtu negotiationXin Long
When doing link mtu negotiation, a malicious peer may send Activate msg with a very small mtu, e.g. 4 in Shuang's testing, without checking for the minimum mtu, l->mtu will be set to 4 in tipc_link_proto_rcv(), then n->links[bearer_id].mtu is set to 4294967228, which is a overflow of '4 - INT_H_SIZE - EMSG_OVERHEAD' in tipc_link_mss(). With tipc_link.mtu = 4, tipc_link_xmit() kept printing the warning: tipc: Too large msg, purging xmit list 1 5 0 40 4! tipc: Too large msg, purging xmit list 1 15 0 60 4! And with tipc_link_entry.mtu 4294967228, a huge skb was allocated in named_distribute(), and when purging it in tipc_link_xmit(), a crash was even caused: general protection fault, probably for non-canonical address 0x2100001011000dd: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 6.3.0.neta #19 RIP: 0010:kfree_skb_list_reason+0x7e/0x1f0 Call Trace: <IRQ> skb_release_data+0xf9/0x1d0 kfree_skb_reason+0x40/0x100 tipc_link_xmit+0x57a/0x740 [tipc] tipc_node_xmit+0x16c/0x5c0 [tipc] tipc_named_node_up+0x27f/0x2c0 [tipc] tipc_node_write_unlock+0x149/0x170 [tipc] tipc_rcv+0x608/0x740 [tipc] tipc_udp_recv+0xdc/0x1f0 [tipc] udp_queue_rcv_one_skb+0x33e/0x620 udp_unicast_rcv_skb.isra.72+0x75/0x90 __udp4_lib_rcv+0x56d/0xc20 ip_protocol_deliver_rcu+0x100/0x2d0 This patch fixes it by checking the new mtu against tipc_bearer_min_mtu(), and not updating mtu if it is too small. Fixes: ed193ece2649 ("tipc: simplify link mtu negotiation") Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15tipc: add tipc_bearer_min_mtu to calculate min mtuXin Long
As different media may requires different min mtu, and even the same media with different net family requires different min mtu, add tipc_bearer_min_mtu() to calculate min mtu accordingly. This API will be used to check the new mtu when doing the link mtu negotiation in the next patch. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15sch_htb: Allow HTB priority parameter in offload modeNaveen Mamindlapalli
The current implementation of HTB offload returns the EINVAL error for unsupported parameters like prio and quantum. This patch removes the error returning checks for 'prio' parameter and populates its value to tc_htb_qopt_offload structure such that driver can use the same. Add prio parameter check in mlx5 driver, as mlx5 devices are not capable of supporting the prio parameter when htb offload is used. Report error if prio parameter is set to a non-default value. Signed-off-by: Naveen Mamindlapalli <naveenm@marvell.com> Co-developed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15net: Remove low_thresh in ip defragAngus Chen
As low_thresh has no work in fragment reassembles,del it. And Mark it deprecated in sysctl Document. Signed-off-by: Angus Chen <angus.chen@jaguarmicro.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-15net: nsh: Use correct mac_offset to unwind gso skb in nsh_gso_segment()Dong Chenchen
As the call trace shows, skb_panic was caused by wrong skb->mac_header in nsh_gso_segment(): invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 3 PID: 2737 Comm: syz Not tainted 6.3.0-next-20230505 #1 RIP: 0010:skb_panic+0xda/0xe0 call Trace: skb_push+0x91/0xa0 nsh_gso_segment+0x4f3/0x570 skb_mac_gso_segment+0x19e/0x270 __skb_gso_segment+0x1e8/0x3c0 validate_xmit_skb+0x452/0x890 validate_xmit_skb_list+0x99/0xd0 sch_direct_xmit+0x294/0x7c0 __dev_queue_xmit+0x16f0/0x1d70 packet_xmit+0x185/0x210 packet_snd+0xc15/0x1170 packet_sendmsg+0x7b/0xa0 sock_sendmsg+0x14f/0x160 The root cause is: nsh_gso_segment() use skb->network_header - nhoff to reset mac_header in skb_gso_error_unwind() if inner-layer protocol gso fails. However, skb->network_header may be reset by inner-layer protocol gso function e.g. mpls_gso_segment. skb->mac_header reset by the inaccurate network_header will be larger than skb headroom. nsh_gso_segment nhoff = skb->network_header - skb->mac_header; __skb_pull(skb,nsh_len) skb_mac_gso_segment mpls_gso_segment skb_reset_network_header(skb);//skb->network_header+=nsh_len return -EINVAL; skb_gso_error_unwind skb_push(skb, nsh_len); skb->mac_header = skb->network_header - nhoff; // skb->mac_header > skb->headroom, cause skb_push panic Use correct mac_offset to restore mac_header and get rid of nhoff. Fixes: c411ed854584 ("nsh: add GSO support") Reported-by: syzbot+632b5d9964208bfef8c0@syzkaller.appspotmail.com Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Dong Chenchen <dongchenchen2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-14SUNRPC: Fix trace_svc_register() call siteChuck Lever
The trace event recorded incorrect values for the registered family, protocol, and port because the arguments are in the wrong order. Fixes: b4af59328c25 ("SUNRPC: Trace server-side rpcbind registration events") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-14SUNRPC: always free ctxt when freeing deferred requestNeilBrown
Since the ->xprt_ctxt pointer was added to svc_deferred_req, it has not been sufficient to use kfree() to free a deferred request. We may need to free the ctxt as well. As freeing the ctxt is all that ->xpo_release_rqst() does, we repurpose it to explicit do that even when the ctxt is not stored in an rqst. So we now have ->xpo_release_ctxt() which is given an xprt and a ctxt, which may have been taken either from an rqst or from a dreq. The caller is now responsible for clearing that pointer after the call to ->xpo_release_ctxt. We also clear dr->xprt_ctxt when the ctxt is moved into a new rqst when revisiting a deferred request. This ensures there is only one pointer to the ctxt, so the risk of double freeing in future is reduced. The new code in svc_xprt_release which releases both the ctxt and any rq_deferred depends on this. Fixes: 773f91b2cf3f ("SUNRPC: Fix NFSD's request deferral on RDMA transports") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-14SUNRPC: double free xprt_ctxt while still in useNeilBrown
When an RPC request is deferred, the rq_xprt_ctxt pointer is moved out of the svc_rqst into the svc_deferred_req. When the deferred request is revisited, the pointer is copied into the new svc_rqst - and also remains in the svc_deferred_req. In the (rare?) case that the request is deferred a second time, the old svc_deferred_req is reused - it still has all the correct content. However in that case the rq_xprt_ctxt pointer is NOT cleared so that when xpo_release_xprt is called, the ctxt is freed (UDP) or possible added to a free list (RDMA). When the deferred request is revisited for a second time, it will reference this ctxt which may be invalid, and the free the object a second time which is likely to oops. So change svc_defer() to *always* clear rq_xprt_ctxt, and assert that the value is now stored in the svc_deferred_req. Fixes: 773f91b2cf3f ("SUNRPC: Fix NFSD's request deferral on RDMA transports") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-13ping: Convert hlist_nulls to plain hlist.Kuniyuki Iwashima
Since introduced in commit c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind"), ping socket does not use SLAB_TYPESAFE_BY_RCU nor check nulls marker in loops. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13net: introduce and use skb_frag_fill_page_desc()Yunsheng Lin
Most users use __skb_frag_set_page()/skb_frag_off_set()/ skb_frag_size_set() to fill the page desc for a skb frag. Introduce skb_frag_fill_page_desc() to do that. net/bpf/test_run.c does not call skb_frag_off_set() to set the offset, "copy_from_user(page_address(page), ...)" and 'shinfo' being part of the 'data' kzalloced in bpf_test_init() suggest that it is assuming offset to be initialized as zero, so call skb_frag_fill_page_desc() with offset being zero for this case. Also, skb_frag_set_page() is not used anymore, so remove it. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-13erspan: get the proto with the md version for collect_mdXin Long
In commit 20704bd1633d ("erspan: build the header with the right proto according to erspan_ver"), it gets the proto with t->parms.erspan_ver, but t->parms.erspan_ver is not used by collect_md branch, and instead it should get the proto with md->version for collect_md. Thanks to Kevin for pointing this out. Fixes: 20704bd1633d ("erspan: build the header with the right proto according to erspan_ver") Fixes: 94d7d8f29287 ("ip6_gre: add erspan v2 support") Reported-by: Kevin Traynor <ktraynor@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12sctp: add bpf_bypass_getsockopt proto callbackAlexander Mikhalitsyn
Implement ->bpf_bypass_getsockopt proto callback and filter out SCTP_SOCKOPT_PEELOFF, SCTP_SOCKOPT_PEELOFF_FLAGS and SCTP_SOCKOPT_CONNECTX3 socket options from running eBPF hook on them. SCTP_SOCKOPT_PEELOFF and SCTP_SOCKOPT_PEELOFF_FLAGS options do fd_install(), and if BPF_CGROUP_RUN_PROG_GETSOCKOPT hook returns an error after success of the original handler sctp_getsockopt(...), userspace will receive an error from getsockopt syscall and will be not aware that fd was successfully installed into a fdtable. As pointed by Marcelo Ricardo Leitner it seems reasonable to skip bpf getsockopt hook for SCTP_SOCKOPT_CONNECTX3 sockopt too. Because internaly, it triggers connect() and if error is masked then userspace will be confused. This patch was born as a result of discussion around a new SCM_PIDFD interface: https://lore.kernel.org/all/20230413133355.350571-3-aleksandr.mikhalitsyn@canonical.com/ Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks") Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Christian Brauner <brauner@kernel.org> Cc: Stanislav Fomichev <sdf@google.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Cc: Xin Long <lucien.xin@gmail.com> Cc: linux-sctp@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: netdev@vger.kernel.org Suggested-by: Stanislav Fomichev <sdf@google.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Acked-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12tcp: fix possible sk_priority leak in tcp_v4_send_reset()Eric Dumazet
When tcp_v4_send_reset() is called with @sk == NULL, we do not change ctl_sk->sk_priority, which could have been set from a prior invocation. Change tcp_v4_send_reset() to set sk_priority and sk_mark fields before calling ip_send_unicast_reply(). This means tcp_v4_send_reset() and tcp_v4_send_ack() no longer have to clear ctl_sk->sk_mark after their call to ip_send_unicast_reply(). Fixes: f6c0f5d209fa ("tcp: honor SO_PRIORITY in TIME_WAIT state") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Antoine Tenart <atenart@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12vsock: avoid to close connected socket after the timeoutZhuang Shengen
When client and server establish a connection through vsock, the client send a request to the server to initiate the connection, then start a timer to wait for the server's response. When the server's RESPONSE message arrives, the timer also times out and exits. The server's RESPONSE message is processed first, and the connection is established. However, the client's timer also times out, the original processing logic of the client is to directly set the state of this vsock to CLOSE and return ETIMEDOUT. It will not notify the server when the port is released, causing the server port remain. when client's vsock_connect timeout,it should check sk state is ESTABLISHED or not. if sk state is ESTABLISHED, it means the connection is established, the client should not set the sk state to CLOSE Note: I encountered this issue on kernel-4.18, which can be fixed by this patch. Then I checked the latest code in the community and found similar issue. Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") Signed-off-by: Zhuang Shengen <zhuangshengen@huawei.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12net/handshake: Enable the SNI extension to work properlyChuck Lever
Enable the upper layer protocol to specify the SNI peername. This avoids the need for tlshd to use a DNS lookup, which can return a hostname that doesn't match the incoming certificate's SubjectName. Fixes: 2fd5532044a8 ("net/handshake: Add a kernel API for requesting a TLSv1.3 handshake") Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12net/handshake: Unpin sock->file if a handshake is cancelledChuck Lever
If user space never calls DONE, sock->file's reference count remains elevated. Enable sock->file to be freed eventually in this case. Reported-by: Jakub Kacinski <kuba@kernel.org> Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12net/handshake: handshake_genl_notify() shouldn't ignore @flagsChuck Lever
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12net/handshake: Fix uninitialized local variableChuck Lever
trace_handshake_cmd_done_err() simply records the pointer in @req, so initializing it to NULL is sufficient and safe. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12net/handshake: Fix handshake_dup() ref countingChuck Lever
If get_unused_fd_flags() fails, we ended up calling fput(sock->file) twice. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Suggested-by: Paolo Abeni <pabeni@redhat.com> Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-12net/handshake: Remove unneeded check from handshake_dup()Chuck Lever
handshake_req_submit() now verifies that the socket has a file. Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests") Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-11ipv6: remove nexthop_fib6_nh_bh()Eric Dumazet
After blamed commit, nexthop_fib6_nh_bh() and nexthop_fib6_nh() are the same. Delete nexthop_fib6_nh_bh(), and convert /proc/net/ipv6_route to standard rcu to avoid this splat: [ 5723.180080] WARNING: suspicious RCU usage [ 5723.180083] ----------------------------- [ 5723.180084] include/net/nexthop.h:516 suspicious rcu_dereference_check() usage! [ 5723.180086] other info that might help us debug this: [ 5723.180087] rcu_scheduler_active = 2, debug_locks = 1 [ 5723.180089] 2 locks held by cat/55856: [ 5723.180091] #0: ffff9440a582afa8 (&p->lock){+.+.}-{3:3}, at: seq_read_iter (fs/seq_file.c:188) [ 5723.180100] #1: ffffffffaac07040 (rcu_read_lock_bh){....}-{1:2}, at: rcu_lock_acquire (include/linux/rcupdate.h:326) [ 5723.180109] stack backtrace: [ 5723.180111] CPU: 14 PID: 55856 Comm: cat Tainted: G S I 6.3.0-dbx-DEV #528 [ 5723.180115] Call Trace: [ 5723.180117] <TASK> [ 5723.180119] dump_stack_lvl (lib/dump_stack.c:107) [ 5723.180124] dump_stack (lib/dump_stack.c:114) [ 5723.180126] lockdep_rcu_suspicious (include/linux/context_tracking.h:122) [ 5723.180132] ipv6_route_seq_show (include/net/nexthop.h:?) [ 5723.180135] ? ipv6_route_seq_next (net/ipv6/ip6_fib.c:2605) [ 5723.180140] seq_read_iter (fs/seq_file.c:272) [ 5723.180145] seq_read (fs/seq_file.c:163) [ 5723.180151] proc_reg_read (fs/proc/inode.c:316 fs/proc/inode.c:328) [ 5723.180155] vfs_read (fs/read_write.c:468) [ 5723.180160] ? up_read (kernel/locking/rwsem.c:1617) [ 5723.180164] ksys_read (fs/read_write.c:613) [ 5723.180168] __x64_sys_read (fs/read_write.c:621) [ 5723.180170] do_syscall_64 (arch/x86/entry/common.c:?) [ 5723.180174] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) [ 5723.180177] RIP: 0033:0x7fa455677d2a Fixes: 09eed1192cec ("neighbour: switch to standard rcu, instead of rcu_bh") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230510154646.370659-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-11devlink: change per-devlink netdev notifier to static oneJiri Pirko
The commit 565b4824c39f ("devlink: change port event netdev notifier from per-net to global") changed original per-net notifier to be per-devlink instance. That fixed the issue of non-receiving events of netdev uninit if that moved to a different namespace. That worked fine in -net tree. However, later on when commit ee75f1fc44dd ("net/mlx5e: Create separate devlink instance for ethernet auxiliary device") and commit 72ed5d5624af ("net/mlx5: Suspend auxiliary devices only in case of PCI device suspend") were merged, a deadlock was introduced when removing a namespace with devlink instance with another nested instance. Here there is the bad flow example resulting in deadlock with mlx5: net_cleanup_work -> cleanup_net (takes down_read(&pernet_ops_rwsem) -> devlink_pernet_pre_exit() -> devlink_reload() -> mlx5_devlink_reload_down() -> mlx5_unload_one_devl_locked() -> mlx5_detach_device() -> del_adev() -> mlx5e_remove() -> mlx5e_destroy_devlink() -> devlink_free() -> unregister_netdevice_notifier() (takes down_write(&pernet_ops_rwsem) Steps to reproduce: $ modprobe mlx5_core $ ip netns add ns1 $ devlink dev reload pci/0000:08:00.0 netns ns1 $ ip netns del ns1 Resolve this by converting the notifier from per-devlink instance to a static one registered during init phase and leaving it registered forever. Use this notifier for all devlink port instances created later on. Note what a tree needs this fix only in case all of the cited fixes commits are present. Reported-by: Moshe Shemesh <moshe@nvidia.com> Fixes: 565b4824c39f ("devlink: change port event netdev notifier from per-net to global") Fixes: ee75f1fc44dd ("net/mlx5e: Create separate devlink instance for ethernet auxiliary device") Fixes: 72ed5d5624af ("net/mlx5: Suspend auxiliary devices only in case of PCI device suspend") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230510144621.932017-1-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes. No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-11tcp: make the first N SYN RTO backoffs linearDavid Morley
Currently the SYN RTO schedule follows an exponential backoff scheme, which can be unnecessarily conservative in cases where there are link failures. In such cases, it's better to aggressively try to retransmit packets, so it takes routers less time to find a repath with a working link. We chose a default value for this sysctl of 4, to follow the macOS and IOS backoff scheme of 1,1,1,1,1,2,4,8, ... MacOS and IOS have used this backoff schedule for over a decade, since before this 2009 IETF presentation discussed the behavior: https://www.ietf.org/proceedings/75/slides/tcpm-1.pdf This commit makes the SYN RTO schedule start with a number of linear backoffs given by the following sysctl: * tcp_syn_linear_timeouts This changes the SYN RTO scheme to be: init_rto_val for tcp_syn_linear_timeouts, exp backoff starting at init_rto_val For example if init_rto_val = 1 and tcp_syn_linear_timeouts = 2, our backoff scheme would be: 1, 1, 1, 2, 4, 8, 16, ... Signed-off-by: David Morley <morleyd@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Tested-by: David Morley <morleyd@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230509180558.2541885-1-morleyd.kernel@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-05-10Merge tag 'nf-23-05-10' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter updates for net The following patchset contains Netfilter fixes for net: 1) Fix UAF when releasing netnamespace, from Florian Westphal. 2) Fix possible BUG_ON when nf_conntrack is enabled with enable_hooks, from Florian Westphal. 3) Fixes for nft_flowtable.sh selftest, from Boris Sukholitko. 4) Extend nft_flowtable.sh selftest to cover integration with ingress/egress hooks, from Florian Westphal. * tag 'nf-23-05-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: selftests: nft_flowtable.sh: check ingress/egress chain too selftests: nft_flowtable.sh: monitor result file sizes selftests: nft_flowtable.sh: wait for specific nc pids selftests: nft_flowtable.sh: no need for ps -x option selftests: nft_flowtable.sh: use /proc for pid checking netfilter: conntrack: fix possible bug_on with enable_hooks=1 netfilter: nf_tables: always release netdev hooks from notifier ==================== Link: https://lore.kernel.org/r/20230510083313.152961-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-10af_unix: Fix data races around sk->sk_shutdown.Kuniyuki Iwashima
KCSAN found a data race around sk->sk_shutdown where unix_release_sock() and unix_shutdown() update it under unix_state_lock(), OTOH unix_poll() and unix_dgram_poll() read it locklessly. We need to annotate the writes and reads with WRITE_ONCE() and READ_ONCE(). BUG: KCSAN: data-race in unix_poll / unix_release_sock write to 0xffff88800d0f8aec of 1 bytes by task 264 on cpu 0: unix_release_sock+0x75c/0x910 net/unix/af_unix.c:631 unix_release+0x59/0x80 net/unix/af_unix.c:1042 __sock_release+0x7d/0x170 net/socket.c:653 sock_close+0x19/0x30 net/socket.c:1397 __fput+0x179/0x5e0 fs/file_table.c:321 ____fput+0x15/0x20 fs/file_table.c:349 task_work_run+0x116/0x1a0 kernel/task_work.c:179 resume_user_mode_work include/linux/resume_user_mode.h:49 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x174/0x180 kernel/entry/common.c:204 __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline] syscall_exit_to_user_mode+0x1a/0x30 kernel/entry/common.c:297 do_syscall_64+0x4b/0x90 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x72/0xdc read to 0xffff88800d0f8aec of 1 bytes by task 222 on cpu 1: unix_poll+0xa3/0x2a0 net/unix/af_unix.c:3170 sock_poll+0xcf/0x2b0 net/socket.c:1385 vfs_poll include/linux/poll.h:88 [inline] ep_item_poll.isra.0+0x78/0xc0 fs/eventpoll.c:855 ep_send_events fs/eventpoll.c:1694 [inline] ep_poll fs/eventpoll.c:1823 [inline] do_epoll_wait+0x6c4/0xea0 fs/eventpoll.c:2258 __do_sys_epoll_wait fs/eventpoll.c:2270 [inline] __se_sys_epoll_wait fs/eventpoll.c:2265 [inline] __x64_sys_epoll_wait+0xcc/0x190 fs/eventpoll.c:2265 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc value changed: 0x00 -> 0x03 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 222 Comm: dbus-broker Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Fixes: 3c73419c09a5 ("af_unix: fix 'poll for write'/ connected DGRAM sockets") Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-10af_unix: Fix a data race of sk->sk_receive_queue->qlen.Kuniyuki Iwashima
KCSAN found a data race of sk->sk_receive_queue->qlen where recvmsg() updates qlen under the queue lock and sendmsg() checks qlen under unix_state_sock(), not the queue lock, so the reader side needs READ_ONCE(). BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_wait_for_peer write (marked) to 0xffff888019fe7c68 of 4 bytes by task 49792 on cpu 0: __skb_unlink include/linux/skbuff.h:2347 [inline] __skb_try_recv_from_queue+0x3de/0x470 net/core/datagram.c:197 __skb_try_recv_datagram+0xf7/0x390 net/core/datagram.c:263 __unix_dgram_recvmsg+0x109/0x8a0 net/unix/af_unix.c:2452 unix_dgram_recvmsg+0x94/0xa0 net/unix/af_unix.c:2549 sock_recvmsg_nosec net/socket.c:1019 [inline] ____sys_recvmsg+0x3a3/0x3b0 net/socket.c:2720 ___sys_recvmsg+0xc8/0x150 net/socket.c:2764 do_recvmmsg+0x182/0x560 net/socket.c:2858 __sys_recvmmsg net/socket.c:2937 [inline] __do_sys_recvmmsg net/socket.c:2960 [inline] __se_sys_recvmmsg net/socket.c:2953 [inline] __x64_sys_recvmmsg+0x153/0x170 net/socket.c:2953 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc read to 0xffff888019fe7c68 of 4 bytes by task 49793 on cpu 1: skb_queue_len include/linux/skbuff.h:2127 [inline] unix_recvq_full net/unix/af_unix.c:229 [inline] unix_wait_for_peer+0x154/0x1a0 net/unix/af_unix.c:1445 unix_dgram_sendmsg+0x13bc/0x14b0 net/unix/af_unix.c:2048 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg+0x148/0x160 net/socket.c:747 ____sys_sendmsg+0x20e/0x620 net/socket.c:2503 ___sys_sendmsg+0xc6/0x140 net/socket.c:2557 __sys_sendmmsg+0x11d/0x370 net/socket.c:2643 __do_sys_sendmmsg net/socket.c:2672 [inline] __se_sys_sendmmsg net/socket.c:2669 [inline] __x64_sys_sendmmsg+0x58/0x70 net/socket.c:2669 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x72/0xdc value changed: 0x0000000b -> 0x00000001 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 49793 Comm: syz-executor.0 Not tainted 6.3.0-rc7-02330-gca6270c12e20 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-10net: datagram: fix data-races in datagram_poll()Eric Dumazet
datagram_poll() runs locklessly, we should add READ_ONCE() annotations while reading sk->sk_err, sk->sk_shutdown and sk->sk_state. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230509173131.3263780-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-10sctp: fix a potential OOB access in sctp_sched_set_sched()Ilia.Gavrilov
The 'sched' index value must be checked before accessing an element of the 'sctp_sched_ops' array. Otherwise, it can lead to OOB access. Note that it's harmless since the 'sched' parameter is checked before calling 'sctp_sched_set_sched'. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with SVACE. Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Ilia.Gavrilov <Ilia.Gavrilov@infotecs.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10tcp: add annotations around sk->sk_shutdown accessesEric Dumazet
Now sk->sk_shutdown is no longer a bitfield, we can add standard READ_ONCE()/WRITE_ONCE() annotations to silence KCSAN reports like the following: BUG: KCSAN: data-race in tcp_disconnect / tcp_poll write to 0xffff88814588582c of 1 bytes by task 3404 on cpu 1: tcp_disconnect+0x4d6/0xdb0 net/ipv4/tcp.c:3121 __inet_stream_connect+0x5dd/0x6e0 net/ipv4/af_inet.c:715 inet_stream_connect+0x48/0x70 net/ipv4/af_inet.c:727 __sys_connect_file net/socket.c:2001 [inline] __sys_connect+0x19b/0x1b0 net/socket.c:2018 __do_sys_connect net/socket.c:2028 [inline] __se_sys_connect net/socket.c:2025 [inline] __x64_sys_connect+0x41/0x50 net/socket.c:2025 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff88814588582c of 1 bytes by task 3374 on cpu 0: tcp_poll+0x2e6/0x7d0 net/ipv4/tcp.c:562 sock_poll+0x253/0x270 net/socket.c:1383 vfs_poll include/linux/poll.h:88 [inline] io_poll_check_events io_uring/poll.c:281 [inline] io_poll_task_func+0x15a/0x820 io_uring/poll.c:333 handle_tw_list io_uring/io_uring.c:1184 [inline] tctx_task_work+0x1fe/0x4d0 io_uring/io_uring.c:1246 task_work_run+0x123/0x160 kernel/task_work.c:179 get_signal+0xe64/0xff0 kernel/signal.c:2635 arch_do_signal_or_restart+0x89/0x2a0 arch/x86/kernel/signal.c:306 exit_to_user_mode_loop+0x6f/0xe0 kernel/entry/common.c:168 exit_to_user_mode_prepare+0x6c/0xb0 kernel/entry/common.c:204 __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline] syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:297 do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x03 -> 0x00 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10net: add vlan_get_protocol_and_depth() helperEric Dumazet
Before blamed commit, pskb_may_pull() was used instead of skb_header_pointer() in __vlan_get_protocol() and friends. Few callers depended on skb->head being populated with MAC header, syzbot caught one of them (skb_mac_gso_segment()) Add vlan_get_protocol_and_depth() to make the intent clearer and use it where sensible. This is a more generic fix than commit e9d3f80935b6 ("net/af_packet: make sure to pull mac header") which was dealing with a similar issue. kernel BUG at include/linux/skbuff.h:2655 ! invalid opcode: 0000 [#1] SMP KASAN CPU: 0 PID: 1441 Comm: syz-executor199 Not tainted 6.1.24-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/14/2023 RIP: 0010:__skb_pull include/linux/skbuff.h:2655 [inline] RIP: 0010:skb_mac_gso_segment+0x68f/0x6a0 net/core/gro.c:136 Code: fd 48 8b 5c 24 10 44 89 6b 70 48 c7 c7 c0 ae 0d 86 44 89 e6 e8 a1 91 d0 00 48 c7 c7 00 af 0d 86 48 89 de 31 d2 e8 d1 4a e9 ff <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 RSP: 0018:ffffc90001bd7520 EFLAGS: 00010286 RAX: ffffffff8469736a RBX: ffff88810f31dac0 RCX: ffff888115a18b00 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffc90001bd75e8 R08: ffffffff84697183 R09: fffff5200037adf9 R10: 0000000000000000 R11: dffffc0000000001 R12: 0000000000000012 R13: 000000000000fee5 R14: 0000000000005865 R15: 000000000000fed7 FS: 000055555633f300(0000) GS:ffff8881f6a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000000 CR3: 0000000116fea000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> [<ffffffff847018dd>] __skb_gso_segment+0x32d/0x4c0 net/core/dev.c:3419 [<ffffffff8470398a>] skb_gso_segment include/linux/netdevice.h:4819 [inline] [<ffffffff8470398a>] validate_xmit_skb+0x3aa/0xee0 net/core/dev.c:3725 [<ffffffff84707042>] __dev_queue_xmit+0x1332/0x3300 net/core/dev.c:4313 [<ffffffff851a9ec7>] dev_queue_xmit+0x17/0x20 include/linux/netdevice.h:3029 [<ffffffff851b4a82>] packet_snd net/packet/af_packet.c:3111 [inline] [<ffffffff851b4a82>] packet_sendmsg+0x49d2/0x6470 net/packet/af_packet.c:3142 [<ffffffff84669a12>] sock_sendmsg_nosec net/socket.c:716 [inline] [<ffffffff84669a12>] sock_sendmsg net/socket.c:736 [inline] [<ffffffff84669a12>] __sys_sendto+0x472/0x5f0 net/socket.c:2139 [<ffffffff84669c75>] __do_sys_sendto net/socket.c:2151 [inline] [<ffffffff84669c75>] __se_sys_sendto net/socket.c:2147 [inline] [<ffffffff84669c75>] __x64_sys_sendto+0xe5/0x100 net/socket.c:2147 [<ffffffff8551d40f>] do_syscall_x64 arch/x86/entry/common.c:50 [inline] [<ffffffff8551d40f>] do_syscall_64+0x2f/0x50 arch/x86/entry/common.c:80 [<ffffffff85600087>] entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: 469aceddfa3e ("vlan: consolidate VLAN parsing code and limit max parsing depth") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Cc: Willem de Bruijn <willemb@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10net: deal with most data-races in sk_wait_event()Eric Dumazet
__condition is evaluated twice in sk_wait_event() macro. First invocation is lockless, and reads can race with writes, as spotted by syzbot. BUG: KCSAN: data-race in sk_stream_wait_connect / tcp_disconnect write to 0xffff88812d83d6a0 of 4 bytes by task 9065 on cpu 1: tcp_disconnect+0x2cd/0xdb0 inet_shutdown+0x19e/0x1f0 net/ipv4/af_inet.c:911 __sys_shutdown_sock net/socket.c:2343 [inline] __sys_shutdown net/socket.c:2355 [inline] __do_sys_shutdown net/socket.c:2363 [inline] __se_sys_shutdown+0xf8/0x140 net/socket.c:2361 __x64_sys_shutdown+0x31/0x40 net/socket.c:2361 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff88812d83d6a0 of 4 bytes by task 9040 on cpu 0: sk_stream_wait_connect+0x1de/0x3a0 net/core/stream.c:75 tcp_sendmsg_locked+0x2e4/0x2120 net/ipv4/tcp.c:1266 tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1484 inet6_sendmsg+0x63/0x80 net/ipv6/af_inet6.c:651 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] __sys_sendto+0x246/0x300 net/socket.c:2142 __do_sys_sendto net/socket.c:2154 [inline] __se_sys_sendto net/socket.c:2150 [inline] __x64_sys_sendto+0x78/0x90 net/socket.c:2150 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x00000000 -> 0x00000068 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10net: annotate sk->sk_err write from do_recvmmsg()Eric Dumazet
do_recvmmsg() can write to sk->sk_err from multiple threads. As said before, many other points reading or writing sk_err need annotations. Fixes: 34b88a68f26a ("net: Fix use after free in the recvmmsg exit path") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10netlink: annotate accesses to nlk->cb_runningEric Dumazet
Both netlink_recvmsg() and netlink_native_seq_show() read nlk->cb_running locklessly. Use READ_ONCE() there. Add corresponding WRITE_ONCE() to netlink_dump() and __netlink_dump_start() syzbot reported: BUG: KCSAN: data-race in __netlink_dump_start / netlink_recvmsg write to 0xffff88813ea4db59 of 1 bytes by task 28219 on cpu 0: __netlink_dump_start+0x3af/0x4d0 net/netlink/af_netlink.c:2399 netlink_dump_start include/linux/netlink.h:308 [inline] rtnetlink_rcv_msg+0x70f/0x8c0 net/core/rtnetlink.c:6130 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2577 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6192 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1942 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] sock_write_iter+0x1aa/0x230 net/socket.c:1138 call_write_iter include/linux/fs.h:1851 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x463/0x760 fs/read_write.c:584 ksys_write+0xeb/0x1a0 fs/read_write.c:637 __do_sys_write fs/read_write.c:649 [inline] __se_sys_write fs/read_write.c:646 [inline] __x64_sys_write+0x42/0x50 fs/read_write.c:646 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff88813ea4db59 of 1 bytes by task 28222 on cpu 1: netlink_recvmsg+0x3b4/0x730 net/netlink/af_netlink.c:2022 sock_recvmsg_nosec+0x4c/0x80 net/socket.c:1017 ____sys_recvmsg+0x2db/0x310 net/socket.c:2718 ___sys_recvmsg net/socket.c:2762 [inline] do_recvmmsg+0x2e5/0x710 net/socket.c:2856 __sys_recvmmsg net/socket.c:2935 [inline] __do_sys_recvmmsg net/socket.c:2958 [inline] __se_sys_recvmmsg net/socket.c:2951 [inline] __x64_sys_recvmmsg+0xe2/0x160 net/socket.c:2951 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x00 -> 0x01 Fixes: 16b304f3404f ("netlink: Eliminate kmalloc in netlink dump operation.") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10net: ipconfig: Allow DNS to be overwritten by DHCPACKMartin Wetterwald
Some DHCP server implementations only send the important requested DHCP options in the final BOOTP reply (DHCPACK). One example is systemd-networkd. However, RFC2131, in section 4.3.1 states: > The server MUST return to the client: > [...] > o Parameters requested by the client, according to the following > rules: > > -- IF the server has been explicitly configured with a default > value for the parameter, the server MUST include that value > in an appropriate option in the 'option' field, ELSE I've reported the issue here: https://github.com/systemd/systemd/issues/27471 Linux PNP DHCP client implementation only takes into account the DNS servers received in the first BOOTP reply (DHCPOFFER). This usually isn't an issue as servers are required to put the same values in the DHCPOFFER and DHCPACK. However, RFC2131, in section 4.3.2 states: > Any configuration parameters in the DHCPACK message SHOULD NOT > conflict with those in the earlier DHCPOFFER message to which the > client is responding. The client SHOULD use the parameters in the > DHCPACK message for configuration. When making Linux PNP DHCP client (cmdline ip=dhcp) interact with systemd-networkd DHCP server, an interesting "protocol misunderstanding" happens: Because DNS servers were only specified in the DHCPACK and not in the DHCPOFFER, Linux will not catch the correct DNS servers: in the first BOOTP reply (DHCPOFFER), it sees that there is no DNS, and sets as fallback the IP of the DHCP server itself. When the second BOOTP reply comes (DHCPACK), it's already too late: the kernel will not overwrite the fallback setting it has set previously. This patch makes the kernel overwrite its DNS fallback by DNS servers specified in the DHCPACK if any. Signed-off-by: Martin Wetterwald <martin@wetterwald.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-10netfilter: conntrack: fix possible bug_on with enable_hooks=1Florian Westphal
I received a bug report (no reproducer so far) where we trip over 712 rcu_read_lock(); 713 ct_hook = rcu_dereference(nf_ct_hook); 714 BUG_ON(ct_hook == NULL); // here In nf_conntrack_destroy(). First turn this BUG_ON into a WARN. I think it was triggered via enable_hooks=1 flag. When this flag is turned on, the conntrack hooks are registered before nf_ct_hook pointer gets assigned. This opens a short window where packets enter the conntrack machinery, can have skb->_nfct set up and a subsequent kfree_skb might occur before nf_ct_hook is set. Call nf_conntrack_init_end() to set nf_ct_hook before we register the pernet ops. Fixes: ba3fbe663635 ("netfilter: nf_conntrack: provide modparam to always register conntrack hooks") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-05-10netfilter: nf_tables: always release netdev hooks from notifierFlorian Westphal
This reverts "netfilter: nf_tables: skip netdev events generated on netns removal". The problem is that when a veth device is released, the veth release callback will also queue the peer netns device for removal. Its possible that the peer netns is also slated for removal. In this case, the device memory is already released before the pre_exit hook of the peer netns runs: BUG: KASAN: slab-use-after-free in nf_hook_entry_head+0x1b8/0x1d0 Read of size 8 at addr ffff88812c0124f0 by task kworker/u8:1/45 Workqueue: netns cleanup_net Call Trace: nf_hook_entry_head+0x1b8/0x1d0 __nf_unregister_net_hook+0x76/0x510 nft_netdev_unregister_hooks+0xa0/0x220 __nft_release_hook+0x184/0x490 nf_tables_pre_exit_net+0x12f/0x1b0 .. Order is: 1. First netns is released, veth_dellink() queues peer netns device for removal 2. peer netns is queued for removal 3. peer netns device is released, unreg event is triggered 4. unreg event is ignored because netns is going down 5. pre_exit hook calls nft_netdev_unregister_hooks but device memory might be free'd already. Fixes: 68a3765c659f ("netfilter: nf_tables: skip netdev events generated on netns removal") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-05-10xfrm: Check if_id in inbound policy/secpath matchBenedict Wong
This change ensures that if configured in the policy, the if_id set in the policy and secpath states match during the inbound policy check. Without this, there is potential for ambiguity where entries in the secpath differing by only the if_id could be mismatched. Notably, this is checked in the outbound direction when resolving templates to SAs, but not on the inbound path when matching SAs and policies. Test: Tested against Android kernel unit tests & CTS Signed-off-by: Benedict Wong <benedictwong@google.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-05-10af_key: Reject optional tunnel/BEET mode templates in outbound policiesTobias Brunner
xfrm_state_find() uses `encap_family` of the current template with the passed local and remote addresses to find a matching state. If an optional tunnel or BEET mode template is skipped in a mixed-family scenario, there could be a mismatch causing an out-of-bounds read as the addresses were not replaced to match the family of the next template. While there are theoretical use cases for optional templates in outbound policies, the only practical one is to skip IPComp states in inbound policies if uncompressed packets are received that are handled by an implicitly created IPIP state instead. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Tobias Brunner <tobias@strongswan.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-05-10xfrm: Reject optional tunnel/BEET mode templates in outbound policiesTobias Brunner
xfrm_state_find() uses `encap_family` of the current template with the passed local and remote addresses to find a matching state. If an optional tunnel or BEET mode template is skipped in a mixed-family scenario, there could be a mismatch causing an out-of-bounds read as the addresses were not replaced to match the family of the next template. While there are theoretical use cases for optional templates in outbound policies, the only practical one is to skip IPComp states in inbound policies if uncompressed packets are received that are handled by an implicitly created IPIP state instead. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Tobias Brunner <tobias@strongswan.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-05-09net: skbuff: remove special handling for SLOBLukas Bulwahn
Commit c9929f0e344a ("mm/slob: remove CONFIG_SLOB") removes CONFIG_SLOB. Now, we can also remove special handling for socket buffers with the SLOB allocator. The code with HAVE_SKB_SMALL_HEAD_CACHE=1 is now the default behavior for all allocators. Remove an unnecessary distinction between SLOB and SLAB/SLUB allocator after the SLOB allocator is gone. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230509071207.28942-1-lukas.bulwahn@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-07net: skb_partial_csum_set() fix against transport header magic valueEric Dumazet
skb->transport_header uses the special 0xFFFF value to mark if the transport header was set or not. We must prevent callers to accidentaly set skb->transport_header to 0xFFFF. Note that only fuzzers can possibly do this today. syzbot reported: WARNING: CPU: 0 PID: 2340 at include/linux/skbuff.h:2847 skb_transport_offset include/linux/skbuff.h:2956 [inline] WARNING: CPU: 0 PID: 2340 at include/linux/skbuff.h:2847 virtio_net_hdr_to_skb+0xbcc/0x10c0 include/linux/virtio_net.h:103 Modules linked in: CPU: 0 PID: 2340 Comm: syz-executor.0 Not tainted 6.3.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/14/2023 RIP: 0010:skb_transport_header include/linux/skbuff.h:2847 [inline] RIP: 0010:skb_transport_offset include/linux/skbuff.h:2956 [inline] RIP: 0010:virtio_net_hdr_to_skb+0xbcc/0x10c0 include/linux/virtio_net.h:103 Code: 41 39 df 0f 82 c3 04 00 00 48 8b 7c 24 10 44 89 e6 e8 08 6e 59 ff 48 85 c0 74 54 e8 ce 36 7e fc e9 37 f8 ff ff e8 c4 36 7e fc <0f> 0b e9 93 f8 ff ff 44 89 f7 44 89 e6 e8 32 38 7e fc 45 39 e6 0f RSP: 0018:ffffc90004497880 EFLAGS: 00010293 RAX: ffffffff84fea55c RBX: 000000000000ffff RCX: ffff888120be2100 RDX: 0000000000000000 RSI: 000000000000ffff RDI: 000000000000ffff RBP: ffffc90004497990 R08: ffffffff84fe9de5 R09: 0000000000000034 R10: ffffea00048ebd80 R11: 0000000000000034 R12: ffff88811dc2d9c8 R13: dffffc0000000000 R14: ffff88811dc2d9ae R15: 1ffff11023b85b35 FS: 00007f9211a59700(0000) GS:ffff8881f6c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000200002c0 CR3: 00000001215a5000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> packet_snd net/packet/af_packet.c:3076 [inline] packet_sendmsg+0x4590/0x61a0 net/packet/af_packet.c:3115 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] __sys_sendto+0x472/0x630 net/socket.c:2144 __do_sys_sendto net/socket.c:2156 [inline] __se_sys_sendto net/socket.c:2152 [inline] __x64_sys_sendto+0xe5/0x100 net/socket.c:2152 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2f/0x50 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f9210c8c169 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f9211a59168 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007f9210dabf80 RCX: 00007f9210c8c169 RDX: 000000000000ffed RSI: 00000000200000c0 RDI: 0000000000000003 RBP: 00007f9210ce7ca1 R08: 0000000020000540 R09: 0000000000000014 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffe135d65cf R14: 00007f9211a59300 R15: 0000000000022000 Fixes: 66e4c8d95008 ("net: warn if transport header was not set") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-05Merge tag 'net-6.4-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter. Current release - regressions: - sched: act_pedit: free pedit keys on bail from offset check Current release - new code bugs: - pds_core: - Kconfig fixes (DEBUGFS and AUXILIARY_BUS) - fix mutex double unlock in error path Previous releases - regressions: - sched: cls_api: remove block_cb from driver_list before freeing - nf_tables: fix ct untracked match breakage - eth: mtk_eth_soc: drop generic vlan rx offload - sched: flower: fix error handler on replace Previous releases - always broken: - tcp: fix skb_copy_ubufs() vs BIG TCP - ipv6: fix skb hash for some RST packets - af_packet: don't send zero-byte data in packet_sendmsg_spkt() - rxrpc: timeout handling fixes after moving client call connection to the I/O thread - ixgbe: fix panic during XDP_TX with > 64 CPUs - igc: RMW the SRRCTL register to prevent losing timestamp config - dsa: mt7530: fix corrupt frames using TRGMII on 40 MHz XTAL MT7621 - r8152: - fix flow control issue of RTL8156A - fix the poor throughput for 2.5G devices - move setting r8153b_rx_agg_chg_indicate() to fix coalescing - enable autosuspend - ncsi: clear Tx enable mode when handling a Config required AEN - octeontx2-pf: macsec: fixes for CN10KB ASIC rev Misc: - 9p: remove INET dependency" * tag 'net-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (69 commits) net: bcmgenet: Remove phy_stop() from bcmgenet_netif_stop() pds_core: fix mutex double unlock in error path net/sched: flower: fix error handler on replace Revert "net/sched: flower: Fix wrong handle assignment during filter change" net/sched: flower: fix filter idr initialization net: fec: correct the counting of XDP sent frames bonding: add xdp_features support net: enetc: check the index of the SFI rather than the handle sfc: Add back mailing list virtio_net: suppress cpu stall when free_unused_bufs ice: block LAN in case of VF to VF offload net: dsa: mt7530: fix network connectivity with multiple CPU ports net: dsa: mt7530: fix corrupt frames using trgmii on 40 MHz XTAL MT7621 9p: Remove INET dependency netfilter: nf_tables: fix ct untracked match breakage af_packet: Don't send zero-byte data in packet_sendmsg_spkt(). igc: read before write to SRRCTL register pds_core: add AUXILIARY_BUS and NET_DEVLINK to Kconfig pds_core: remove CONFIG_DEBUG_FS from makefile ionic: catch failure from devlink_alloc ...
2023-05-05SUNRPC: Fix error handling in svc_setup_socket()Chuck Lever
Dan points out that sock_alloc_file() releases @sock on error, but so do all of svc_setup_socket's callers, resulting in a double- release if sock_alloc_file() returns an error. Rather than allocating a struct file for all new sockets, allocate one only for sockets created during a TCP accept. For the moment, those are the only ones that will ever be used with RPC-with-TLS. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: ae0d77708aae ("SUNRPC: Ensure server-side sockets have a sock->file") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-05-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextJakub Kicinski
There's a fix which landed in net-next, pull it in along with the couple of minor cleanups. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-05-05net/sched: flower: fix error handler on replaceVlad Buslov
When replacing a filter (i.e. 'fold' pointer is not NULL) the insertion of new filter to idr is postponed until later in code since handle is already provided by the user. However, the error handling code in fl_change() always assumes that the new filter had been inserted into idr. If error handler is reached when replacing existing filter it may remove it from idr therefore making it unreachable for delete or dump afterwards. Fix the issue by verifying that 'fold' argument wasn't provided by caller before calling idr_remove(). Fixes: 08a0063df3ae ("net/sched: flower: Move filter handle initialization earlier") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-05-05Revert "net/sched: flower: Fix wrong handle assignment during filter change"Vlad Buslov
This reverts commit 32eff6bacec2cb574677c15378169a9fa30043ef. Superseded by the following commit in this series. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>