summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2023-08-02libceph: fix potential hang in ceph_osdc_notify()Ilya Dryomov
If the cluster becomes unavailable, ceph_osdc_notify() may hang even with osd_request_timeout option set because linger_notify_finish_wait() waits for MWatchNotify NOTIFY_COMPLETE message with no associated OSD request in flight -- it's completely asynchronous. Introduce an additional timeout, derived from the specified notify timeout. While at it, switch both waits to killable which is more correct. Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Reviewed-by: Xiubo Li <xiubli@redhat.com>
2023-08-01net: dcb: choose correct policy to parse DCB_ATTR_BCNLin Ma
The dcbnl_bcn_setcfg uses erroneous policy to parse tb[DCB_ATTR_BCN], which is introduced in commit 859ee3c43812 ("DCB: Add support for DCB BCN"). Please see the comment in below code static int dcbnl_bcn_setcfg(...) { ... ret = nla_parse_nested_deprecated(..., dcbnl_pfc_up_nest, .. ) // !!! dcbnl_pfc_up_nest for attributes // DCB_PFC_UP_ATTR_0 to DCB_PFC_UP_ATTR_ALL in enum dcbnl_pfc_up_attrs ... for (i = DCB_BCN_ATTR_RP_0; i <= DCB_BCN_ATTR_RP_7; i++) { // !!! DCB_BCN_ATTR_RP_0 to DCB_BCN_ATTR_RP_7 in enum dcbnl_bcn_attrs ... value_byte = nla_get_u8(data[i]); ... } ... for (i = DCB_BCN_ATTR_BCNA_0; i <= DCB_BCN_ATTR_RI; i++) { // !!! DCB_BCN_ATTR_BCNA_0 to DCB_BCN_ATTR_RI in enum dcbnl_bcn_attrs ... value_int = nla_get_u32(data[i]); ... } ... } That is, the nla_parse_nested_deprecated uses dcbnl_pfc_up_nest attributes to parse nlattr defined in dcbnl_pfc_up_attrs. But the following access code fetch each nlattr as dcbnl_bcn_attrs attributes. By looking up the associated nla_policy for dcbnl_bcn_attrs. We can find the beginning part of these two policies are "same". static const struct nla_policy dcbnl_pfc_up_nest[...] = { [DCB_PFC_UP_ATTR_0] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_1] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_2] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_3] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_4] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_5] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_6] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_7] = {.type = NLA_U8}, [DCB_PFC_UP_ATTR_ALL] = {.type = NLA_FLAG}, }; static const struct nla_policy dcbnl_bcn_nest[...] = { [DCB_BCN_ATTR_RP_0] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_1] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_2] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_3] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_4] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_5] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_6] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_7] = {.type = NLA_U8}, [DCB_BCN_ATTR_RP_ALL] = {.type = NLA_FLAG}, // from here is somewhat different [DCB_BCN_ATTR_BCNA_0] = {.type = NLA_U32}, ... [DCB_BCN_ATTR_ALL] = {.type = NLA_FLAG}, }; Therefore, the current code is buggy and this nla_parse_nested_deprecated could overflow the dcbnl_pfc_up_nest and use the adjacent nla_policy to parse attributes from DCB_BCN_ATTR_BCNA_0. Hence use the correct policy dcbnl_bcn_nest to parse the nested tb[DCB_ATTR_BCN] TLV. Fixes: 859ee3c43812 ("DCB: Add support for DCB BCN") Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230801013248.87240-1-linma@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-01net: make sure we never create ifindex = 0Jakub Kicinski
Instead of allocating from 1 use proper xa_init flag, to protect ourselves from IDs wrapping back to 0. Fixes: 759ab1edb56c ("net: store netdevs in an xarray") Reported-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://lore.kernel.org/all/20230728162350.2a6d4979@hermes.local/ Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230731171159.988962-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-01xfrm: don't skip free of empty state in acquire policyLeon Romanovsky
In destruction flow, the assignment of NULL to xso->dev caused to skip of xfrm_dev_state_free() call, which was called in xfrm_state_put(to_put) routine. Instead of open-coded variant of xfrm_dev_state_delete() and xfrm_dev_state_free(), let's use them directly. Fixes: f8a70afafc17 ("xfrm: add TX datapath support for IPsec packet offload mode") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-08-01xfrm: delete offloaded policyLeon Romanovsky
The policy memory was released but not HW driver data. Add call to xfrm_dev_policy_delete(), so drivers will have a chance to release their resources. Fixes: 919e43fad516 ("xfrm: add an interface to offload policy") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-08-01net: dsa: tag_qca: return early if dev is not foundChristian Marangi
Currently checksum is recalculated and dsa tag stripped even if we later don't find the dev. To improve code, exit early if we don't find the dev and skip additional operation on the skb since it will be freed anyway. Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Link: https://lore.kernel.org/r/20230730074113.21889-1-ansuelsmth@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-01net/sched: sch_qfq: warn about class in use while deletingPedro Tammela
Add extack to warn that delete was rejected because the class is still in use Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-01net/sched: sch_htb: warn about class in use while deletingPedro Tammela
Add extack to warn that delete was rejected because the class is still in use Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-01net/sched: sch_hfsc: warn about class in use while deletingPedro Tammela
Add extack to warn that delete was rejected because the class is still in use Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-01net/sched: sch_drr: warn about class in use while deletingPedro Tammela
Add extack to warn that delete was rejected because the class is still in use Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-01net/sched: wrap open coded Qdics class filter counterPedro Tammela
The 'filter_cnt' counter is used to control a Qdisc class lifetime. Each filter referecing this class by its id will eventually increment/decrement this counter in their respective 'add/update/delete' routines. As these operations are always serialized under rtnl lock, we don't need an atomic type like 'refcount_t'. It also means that we lose the overflow/underflow checks already present in refcount_t, which are valuable to hunt down bugs where the unsigned counter wraps around as it aids automated tools like syzkaller to scream in such situations. Wrap the open coded increment/decrement into helper functions and add overflow checks to the operations. Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-08-01bpf: sockmap: Remove preempt_disable in sock_map_sk_acquireTomas Glozar
Disabling preemption in sock_map_sk_acquire conflicts with GFP_ATOMIC allocation later in sk_psock_init_link on PREEMPT_RT kernels, since GFP_ATOMIC might sleep on RT (see bpf: Make BPF and PREEMPT_RT co-exist patchset notes for details). This causes calling bpf_map_update_elem on BPF_MAP_TYPE_SOCKMAP maps to BUG (sleeping function called from invalid context) on RT kernels. preempt_disable was introduced together with lock_sk and rcu_read_lock in commit 99ba2b5aba24e ("bpf: sockhash, disallow bpf_tcp_close and update in parallel"), probably to match disabled migration of BPF programs, and is no longer necessary. Remove preempt_disable to fix BUG in sock_map_update_common on RT. Signed-off-by: Tomas Glozar <tglozar@redhat.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/all/20200224140131.461979697@linutronix.de/ Fixes: 99ba2b5aba24 ("bpf: sockhash, disallow bpf_tcp_close and update in parallel") Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230728064411.305576-1-tglozar@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-31net/hsr: Remove unused function declarationsYue Haibing
commit f421436a591d ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)") introducted these but never implemented. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230729123456.36340-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31net/sched: cls_route: No longer copy tcf_result on update to avoid ↵valis
use-after-free When route4_change() is called on an existing filter, the whole tcf_result struct is always copied into the new instance of the filter. This causes a problem when updating a filter bound to a class, as tcf_unbind_filter() is always called on the old instance in the success path, decreasing filter_cnt of the still referenced class and allowing it to be deleted, leading to a use-after-free. Fix this by no longer copying the tcf_result struct from the old filter. Fixes: 1109c00547fc ("net: sched: RCU cls_route") Reported-by: valis <sec@valis.email> Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg> Signed-off-by: valis <sec@valis.email> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg> Link: https://lore.kernel.org/r/20230729123202.72406-4-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31net/sched: cls_fw: No longer copy tcf_result on update to avoid use-after-freevalis
When fw_change() is called on an existing filter, the whole tcf_result struct is always copied into the new instance of the filter. This causes a problem when updating a filter bound to a class, as tcf_unbind_filter() is always called on the old instance in the success path, decreasing filter_cnt of the still referenced class and allowing it to be deleted, leading to a use-after-free. Fix this by no longer copying the tcf_result struct from the old filter. Fixes: e35a8ee5993b ("net: sched: fw use RCU") Reported-by: valis <sec@valis.email> Reported-by: Bing-Jhong Billy Jheng <billy@starlabs.sg> Signed-off-by: valis <sec@valis.email> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg> Link: https://lore.kernel.org/r/20230729123202.72406-3-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31net/sched: cls_u32: No longer copy tcf_result on update to avoid use-after-freevalis
When u32_change() is called on an existing filter, the whole tcf_result struct is always copied into the new instance of the filter. This causes a problem when updating a filter bound to a class, as tcf_unbind_filter() is always called on the old instance in the success path, decreasing filter_cnt of the still referenced class and allowing it to be deleted, leading to a use-after-free. Fix this by no longer copying the tcf_result struct from the old filter. Fixes: de5df63228fc ("net: sched: cls_u32 changes to knode must appear atomic to readers") Reported-by: valis <sec@valis.email> Reported-by: M A Ramdhan <ramdhan@starlabs.sg> Signed-off-by: valis <sec@valis.email> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: M A Ramdhan <ramdhan@starlabs.sg> Link: https://lore.kernel.org/r/20230729123202.72406-2-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31netfilter: bpf: Only define get_proto_defrag_hook() if necessaryDaniel Xu
Before, we were getting this warning: net/netfilter/nf_bpf_link.c:32:1: warning: 'get_proto_defrag_hook' defined but not used [-Wunused-function] Guard the definition with CONFIG_NF_DEFRAG_IPV[4|6]. Fixes: 91721c2d02d3 ("netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter link") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202307291213.fZ0zDmoG-lkp@intel.com/ Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/b128b6489f0066db32c4772ae4aaee1480495929.1690840454.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-31vsock: Remove unused function declarationsYue Haibing
These are never implemented since introduction in commit d021c344051a ("VSOCK: Introduce VM Sockets") Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20230729122036.32988-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31net/smc: Remove unused function declarationsYue Haibing
commit f9aab6f2ce57 ("net/smc: immediate freeing in smc_lgr_cleanup_early()") left behind smc_lgr_schedule_free_work_fast() declaration. And since commit 349d43127dac ("net/smc: fix kernel panic caused by race of smc_sock") smc_ib_modify_qp_reset() is not used anymore. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Link: https://lore.kernel.org/r/20230729121929.17180-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-31net: remove duplicate INDIRECT_CALLABLE_DECLARE of udp[6]_ehashfnLorenz Bauer
There are already INDIRECT_CALLABLE_DECLARE in the hashtable headers, no need to declare them again. Fixes: 0f495f761722 ("net: remove duplicate reuseport_lookup functions") Suggested-by: Martin Lau <martin.lau@linux.dev> Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230731-indir-call-v1-1-4cd0aeaee64f@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-07-31Bluetooth: rfcomm: remove casts from tty->driver_dataJiri Slaby
tty->driver_data is 'void *', so there is no need to cast from that. Therefore remove the casts and assign the pointer directly. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Cc: linux-bluetooth@vger.kernel.org Link: https://lore.kernel.org/r/20230731080244.2698-3-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-31net: Use sockaddr_storage for getsockopt(SO_PEERNAME).Kuniyuki Iwashima
Commit df8fc4e934c1 ("kbuild: Enable -fstrict-flex-arrays=3") started applying strict rules to standard string functions. It does not work well with conventional socket code around each protocol- specific sockaddr_XXX struct, which is cast from sockaddr_storage and has a bigger size than fortified functions expect. See these commits: commit 06d4c8a80836 ("af_unix: Fix fortify_panic() in unix_bind_bsd().") commit ecb4534b6a1c ("af_unix: Terminate sun_path when bind()ing pathname socket.") commit a0ade8404c3b ("af_packet: Fix warning of fortified memcpy() in packet_getname().") We must cast the protocol-specific address back to sockaddr_storage to call such functions. However, in the case of getsockaddr(SO_PEERNAME), the rationale is a bit unclear as the buffer is defined by char[128] which is the same size as sockaddr_storage. Let's use sockaddr_storage explicitly. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-31net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.Kuniyuki Iwashima
syzkaller found zero division error [0] in div_s64_rem() called from get_cycle_time_elapsed(), where sched->cycle_time is the divisor. We have tests in parse_taprio_schedule() so that cycle_time will never be 0, and actually cycle_time is not 0 in get_cycle_time_elapsed(). The problem is that the types of divisor are different; cycle_time is s64, but the argument of div_s64_rem() is s32. syzkaller fed this input and 0x100000000 is cast to s32 to be 0. @TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME={0xc, 0x8, 0x100000000} We use s64 for cycle_time to cast it to ktime_t, so let's keep it and set max for cycle_time. While at it, we prevent overflow in setup_txtime() and add another test in parse_taprio_schedule() to check if cycle_time overflows. Also, we add a new tdc test case for this issue. [0]: divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 1 PID: 103 Comm: kworker/1:3 Not tainted 6.5.0-rc1-00330-g60cc1f7d0605 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: ipv6_addrconf addrconf_dad_work RIP: 0010:div_s64_rem include/linux/math64.h:42 [inline] RIP: 0010:get_cycle_time_elapsed net/sched/sch_taprio.c:223 [inline] RIP: 0010:find_entry_to_transmit+0x252/0x7e0 net/sched/sch_taprio.c:344 Code: 3c 02 00 0f 85 5e 05 00 00 48 8b 4c 24 08 4d 8b bd 40 01 00 00 48 8b 7c 24 48 48 89 c8 4c 29 f8 48 63 f7 48 99 48 89 74 24 70 <48> f7 fe 48 29 d1 48 8d 04 0f 49 89 cc 48 89 44 24 20 49 8d 85 10 RSP: 0018:ffffc90000acf260 EFLAGS: 00010206 RAX: 177450e0347560cf RBX: 0000000000000000 RCX: 177450e0347560cf RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000100000000 RBP: 0000000000000056 R08: 0000000000000000 R09: ffffed10020a0934 R10: ffff8880105049a7 R11: ffff88806cf3a520 R12: ffff888010504800 R13: ffff88800c00d800 R14: ffff8880105049a0 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0edf84f0e8 CR3: 000000000d73c002 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> get_packet_txtime net/sched/sch_taprio.c:508 [inline] taprio_enqueue_one+0x900/0xff0 net/sched/sch_taprio.c:577 taprio_enqueue+0x378/0xae0 net/sched/sch_taprio.c:658 dev_qdisc_enqueue+0x46/0x170 net/core/dev.c:3732 __dev_xmit_skb net/core/dev.c:3821 [inline] __dev_queue_xmit+0x1b2f/0x3000 net/core/dev.c:4169 dev_queue_xmit include/linux/netdevice.h:3088 [inline] neigh_resolve_output net/core/neighbour.c:1552 [inline] neigh_resolve_output+0x4a7/0x780 net/core/neighbour.c:1532 neigh_output include/net/neighbour.h:544 [inline] ip6_finish_output2+0x924/0x17d0 net/ipv6/ip6_output.c:135 __ip6_finish_output+0x620/0xaa0 net/ipv6/ip6_output.c:196 ip6_finish_output net/ipv6/ip6_output.c:207 [inline] NF_HOOK_COND include/linux/netfilter.h:292 [inline] ip6_output+0x206/0x410 net/ipv6/ip6_output.c:228 dst_output include/net/dst.h:458 [inline] NF_HOOK.constprop.0+0xea/0x260 include/linux/netfilter.h:303 ndisc_send_skb+0x872/0xe80 net/ipv6/ndisc.c:508 ndisc_send_ns+0xb5/0x130 net/ipv6/ndisc.c:666 addrconf_dad_work+0xc14/0x13f0 net/ipv6/addrconf.c:4175 process_one_work+0x92c/0x13a0 kernel/workqueue.c:2597 worker_thread+0x60f/0x1240 kernel/workqueue.c:2748 kthread+0x2fe/0x3f0 kernel/kthread.c:389 ret_from_fork+0x2c/0x50 arch/x86/entry/entry_64.S:308 </TASK> Modules linked in: Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Co-developed-by: Eric Dumazet <edumazet@google.com> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-31net: flow_dissector: Use 64bits for used_keysRatheesh Kannoth
As 32bits of dissector->used_keys are exhausted, increase the size to 64bits. This is base change for ESP/AH flow dissector patch. Please find patch and discussions at https://lore.kernel.org/netdev/ZMDNjD46BvZ5zp5I@corigine.com/T/#t Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Reviewed-by: Petr Machata <petrm@nvidia.com> # for mlxsw Tested-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Martin Habets <habetsm.xilinx@gmail.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-31xfrm: add forgotten nla_policy for XFRMA_MTIMER_THRESHLin Ma
The previous commit 4e484b3e969b ("xfrm: rate limit SA mapping change message to user space") added one additional attribute named XFRMA_MTIMER_THRESH and described its type at compat_policy (net/xfrm/xfrm_compat.c). However, the author forgot to also describe the nla_policy at xfrma_policy (net/xfrm/xfrm_user.c). Hence, this suppose NLA_U32 (4 bytes) value can be faked as empty (0 bytes) by a malicious user, which leads to 4 bytes overflow read and heap information leak when parsing nlattrs. To exploit this, one malicious user can spray the SLUB objects and then leverage this 4 bytes OOB read to leak the heap data into x->mapping_maxage (see xfrm_update_ae_params(...)), and leak it to userspace via copy_to_user_state_extra(...). The above bug is assigned CVE-2023-3773. To fix it, this commit just completes the nla_policy description for XFRMA_MTIMER_THRESH, which enforces the length check and avoids such OOB read. Fixes: 4e484b3e969b ("xfrm: rate limit SA mapping change message to user space") Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-07-31xfrm: add NULL check in xfrm_update_ae_paramsLin Ma
Normally, x->replay_esn and x->preplay_esn should be allocated at xfrm_alloc_replay_state_esn(...) in xfrm_state_construct(...), hence the xfrm_update_ae_params(...) is okay to update them. However, the current implementation of xfrm_new_ae(...) allows a malicious user to directly dereference a NULL pointer and crash the kernel like below. BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 8253067 P4D 8253067 PUD 8e0e067 PMD 0 Oops: 0002 [#1] PREEMPT SMP KASAN NOPTI CPU: 0 PID: 98 Comm: poc.npd Not tainted 6.4.0-rc7-00072-gdad9774deaf1 #8 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.o4 RIP: 0010:memcpy_orig+0xad/0x140 Code: e8 4c 89 5f e0 48 8d 7f e0 73 d2 83 c2 20 48 29 d6 48 29 d7 83 fa 10 72 34 4c 8b 06 4c 8b 4e 08 c RSP: 0018:ffff888008f57658 EFLAGS: 00000202 RAX: 0000000000000000 RBX: ffff888008bd0000 RCX: ffffffff8238e571 RDX: 0000000000000018 RSI: ffff888007f64844 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff888008f57818 R13: ffff888007f64aa4 R14: 0000000000000000 R15: 0000000000000000 FS: 00000000014013c0(0000) GS:ffff88806d600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000000054d8000 CR4: 00000000000006f0 Call Trace: <TASK> ? __die+0x1f/0x70 ? page_fault_oops+0x1e8/0x500 ? __pfx_is_prefetch.constprop.0+0x10/0x10 ? __pfx_page_fault_oops+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x11/0x40 ? fixup_exception+0x36/0x460 ? _raw_spin_unlock_irqrestore+0x11/0x40 ? exc_page_fault+0x5e/0xc0 ? asm_exc_page_fault+0x26/0x30 ? xfrm_update_ae_params+0xd1/0x260 ? memcpy_orig+0xad/0x140 ? __pfx__raw_spin_lock_bh+0x10/0x10 xfrm_update_ae_params+0xe7/0x260 xfrm_new_ae+0x298/0x4e0 ? __pfx_xfrm_new_ae+0x10/0x10 ? __pfx_xfrm_new_ae+0x10/0x10 xfrm_user_rcv_msg+0x25a/0x410 ? __pfx_xfrm_user_rcv_msg+0x10/0x10 ? __alloc_skb+0xcf/0x210 ? stack_trace_save+0x90/0xd0 ? filter_irq_stacks+0x1c/0x70 ? __stack_depot_save+0x39/0x4e0 ? __kasan_slab_free+0x10a/0x190 ? kmem_cache_free+0x9c/0x340 ? netlink_recvmsg+0x23c/0x660 ? sock_recvmsg+0xeb/0xf0 ? __sys_recvfrom+0x13c/0x1f0 ? __x64_sys_recvfrom+0x71/0x90 ? do_syscall_64+0x3f/0x90 ? entry_SYSCALL_64_after_hwframe+0x72/0xdc ? copyout+0x3e/0x50 netlink_rcv_skb+0xd6/0x210 ? __pfx_xfrm_user_rcv_msg+0x10/0x10 ? __pfx_netlink_rcv_skb+0x10/0x10 ? __pfx_sock_has_perm+0x10/0x10 ? mutex_lock+0x8d/0xe0 ? __pfx_mutex_lock+0x10/0x10 xfrm_netlink_rcv+0x44/0x50 netlink_unicast+0x36f/0x4c0 ? __pfx_netlink_unicast+0x10/0x10 ? netlink_recvmsg+0x500/0x660 netlink_sendmsg+0x3b7/0x700 This Null-ptr-deref bug is assigned CVE-2023-3772. And this commit adds additional NULL check in xfrm_update_ae_params to fix the NPD. Fixes: d8647b79c3b7 ("xfrm: Add user interface for esn and big anti-replay windows") Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-07-29net: annotate data-races around sk->sk_priorityEric Dumazet
sk_getsockopt() runs locklessly. This means sk->sk_priority can be read while other threads are changing its value. Other reads also happen without socket lock being held. Add missing annotations where needed. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: add missing data-race annotation for sk_ll_usecEric Dumazet
In a prior commit I forgot that sk_getsockopt() reads sk->sk_ll_usec without holding a lock. Fixes: 0dbffbb5335a ("net: annotate data race around sk_ll_usec") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: add missing data-race annotations around sk->sk_peek_offEric Dumazet
sk_getsockopt() runs locklessly, thus we need to annotate the read of sk->sk_peek_off. While we are at it, add corresponding annotations to sk_set_peek_off() and unix_set_peek_off(). Fixes: b9bb53f3836f ("sock: convert sk_peek_offset functions to WRITE_ONCE") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: annotate data-races around sk->sk_markEric Dumazet
sk->sk_mark is often read while another thread could change the value. Fixes: 4a19ec5800fc ("[NET]: Introducing socket mark socket option.") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: add missing READ_ONCE(sk->sk_rcvbuf) annotationEric Dumazet
In a prior commit, I forgot to change sk_getsockopt() when reading sk->sk_rcvbuf locklessly. Fixes: ebb3b78db7bf ("tcp: annotate sk->sk_rcvbuf lockless reads") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: add missing READ_ONCE(sk->sk_sndbuf) annotationEric Dumazet
In a prior commit, I forgot to change sk_getsockopt() when reading sk->sk_sndbuf locklessly. Fixes: e292f05e0df7 ("tcp: annotate sk->sk_sndbuf lockless reads") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: annotate data-races around sk->sk_{rcv|snd}timeoEric Dumazet
sk_getsockopt() runs without locks, we must add annotations to sk->sk_rcvtimeo and sk->sk_sndtimeo. In the future we might allow fetching these fields before we lock the socket in TCP fast path. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: add missing READ_ONCE(sk->sk_rcvlowat) annotationEric Dumazet
In a prior commit, I forgot to change sk_getsockopt() when reading sk->sk_rcvlowat locklessly. Fixes: eac66402d1c3 ("net: annotate sk->sk_rcvlowat lockless reads") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: annotate data-races around sk->sk_max_pacing_rateEric Dumazet
sk_getsockopt() runs locklessly. This means sk->sk_max_pacing_rate can be read while other threads are changing its value. Fixes: 62748f32d501 ("net: introduce SO_MAX_PACING_RATE") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: annotate data-race around sk->sk_txrehashEric Dumazet
sk_getsockopt() runs locklessly. This means sk->sk_txrehash can be read while other threads are changing its value. Other locations were handled in commit cb6cd2cec799 ("tcp: Change SYN ACK retransmit behaviour to account for rehash") Fixes: 26859240e4ee ("txhash: Add socket option to control TX hash rethink behavior") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Akhmat Karakotov <hmukos@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: annotate data-races around sk->sk_reserved_memEric Dumazet
sk_getsockopt() runs locklessly. This means sk->sk_reserved_mem can be read while other threads are changing its value. Add missing annotations where they are needed. Fixes: 2bb2f5fb21b0 ("net: add new socket option SO_RESERVE_MEM") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-29net: gro: fix misuse of CB in udp socket lookupRichard Gobert
This patch fixes a misuse of IP{6}CB(skb) in GRO, while calling to `udp6_lib_lookup2` when handling udp tunnels. `udp6_lib_lookup2` fetch the device from CB. The fix changes it to fetch the device from `skb->dev`. l3mdev case requires special attention since it has a master and a slave device. Fixes: a6024562ffd7 ("udp: Add GRO functions to UDP socket") Reported-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-28net: sched: cls_u32: Fix match key mis-addressingJamal Hadi Salim
A match entry is uniquely identified with an "address" or "path" in the form of: hashtable ID(12b):bucketid(8b):nodeid(12b). When creating table match entries all of hash table id, bucket id and node (match entry id) are needed to be either specified by the user or reasonable in-kernel defaults are used. The in-kernel default for a table id is 0x800(omnipresent root table); for bucketid it is 0x0. Prior to this fix there was none for a nodeid i.e. the code assumed that the user passed the correct nodeid and if the user passes a nodeid of 0 (as Mingi Cho did) then that is what was used. But nodeid of 0 is reserved for identifying the table. This is not a problem until we dump. The dump code notices that the nodeid is zero and assumes it is referencing a table and therefore references table struct tc_u_hnode instead of what was created i.e match entry struct tc_u_knode. Ming does an equivalent of: tc filter add dev dummy0 parent 10: prio 1 handle 0x1000 \ protocol ip u32 match ip src 10.0.0.1/32 classid 10:1 action ok Essentially specifying a table id 0, bucketid 1 and nodeid of zero Tableid 0 is remapped to the default of 0x800. Bucketid 1 is ignored and defaults to 0x00. Nodeid was assumed to be what Ming passed - 0x000 dumping before fix shows: ~$ tc filter ls dev dummy0 parent 10: filter protocol ip pref 1 u32 chain 0 filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor 1 filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor -30591 Note that the last line reports a table instead of a match entry (you can tell this because it says "ht divisor..."). As a result of reporting the wrong data type (misinterpretting of struct tc_u_knode as being struct tc_u_hnode) the divisor is reported with value of -30591. Ming identified this as part of the heap address (physmap_base is 0xffff8880 (-30591 - 1)). The fix is to ensure that when table entry matches are added and no nodeid is specified (i.e nodeid == 0) then we get the next available nodeid from the table's pool. After the fix, this is what the dump shows: $ tc filter ls dev dummy0 parent 10: filter protocol ip pref 1 u32 chain 0 filter protocol ip pref 1 u32 chain 0 fh 800: ht divisor 1 filter protocol ip pref 1 u32 chain 0 fh 800::800 order 2048 key ht 800 bkt 0 flowid 10:1 not_in_hw match 0a000001/ffffffff at 12 action order 1: gact action pass random type none pass val 0 index 1 ref 1 bind 1 Reported-by: Mingi Cho <mgcho.minic@gmail.com> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20230726135151.416917-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28netfilter: bpf: Support BPF_F_NETFILTER_IP_DEFRAG in netfilter linkDaniel Xu
This commit adds support for enabling IP defrag using pre-existing netfilter defrag support. Basically all the flag does is bump a refcnt while the link the active. Checks are also added to ensure the prog requesting defrag support is run _after_ netfilter defrag hooks. We also take care to avoid any issues w.r.t. module unloading -- while defrag is active on a link, the module is prevented from unloading. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/5cff26f97e55161b7d56b09ddcf5f8888a5add1d.1689970773.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28netfilter: defrag: Add glue hooks for enabling/disabling defragDaniel Xu
We want to be able to enable/disable IP packet defrag from core bpf/netfilter code. In other words, execute code from core that could possibly be built as a module. To help avoid symbol resolution errors, use glue hooks that the modules will register callbacks with during module init. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Reviewed-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/f6a8824052441b72afe5285acedbd634bd3384c1.1689970773.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-28Merge branch 'in-kernel-support-for-the-tls-alert-protocol'Jakub Kicinski
Chuck Lever says: ==================== In-kernel support for the TLS Alert protocol IMO the kernel doesn't need user space (ie, tlshd) to handle the TLS Alert protocol. Instead, a set of small helper functions can be used to handle sending and receiving TLS Alerts for in-kernel TLS consumers. ==================== Merged on top of a tag in case it's needed in the NFS tree. Link: https://lore.kernel.org/r/169047923706.5241.1181144206068116926.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28net/handshake: Trace events for TLS Alert helpersChuck Lever
Add observability for the new TLS Alert infrastructure. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047947409.5241.14548832149596892717.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28SUNRPC: Use new helpers to handle TLS AlertsChuck Lever
Use the helpers to parse the level and description fields in incoming alerts. "Warning" alerts are discarded, and "fatal" alerts mean the session is no longer valid. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047944747.5241.1974889594004407123.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28net/handshake: Add helpers for parsing incoming TLS AlertsChuck Lever
Kernel TLS consumers can replace common TLS Alert parsing code with these helpers. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047942074.5241.13791647439480672048.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28SUNRPC: Send TLS Closure alerts before closing a TCP socketChuck Lever
Before closing a TCP connection, the TLS protocol wants peers to send session close Alert notifications. Add those in both the RPC client and server. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047939404.5241.14392506226409865832.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28net/handshake: Add API for sending TLS Closure alertsChuck Lever
This helper sends an alert only if a TLS session was established. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047936730.5241.618595693821012638.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28net/tls: Move TLS protocol elements to a separate headerChuck Lever
Kernel TLS consumers will need definitions of various parts of the TLS protocol, but often do not need the function declarations and other infrastructure provided in <net/tls.h>. Break out existing standardized protocol elements into a separate header, and make room for a few more elements in subsequent patches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047931374.5241.7713175865185969309.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28net: change accept_ra_min_rtr_lft to affect all RA lifetimesPatrick Rohr
accept_ra_min_rtr_lft only considered the lifetime of the default route and discarded entire RAs accordingly. This change renames accept_ra_min_rtr_lft to accept_ra_min_lft, and applies the value to individual RA sections; in particular, router lifetime, PIO preferred lifetime, and RIO lifetime. If any of those lifetimes are lower than the configured value, the specific RA section is ignored. In order for the sysctl to be useful to Android, it should really apply to all lifetimes in the RA, since that is what determines the minimum frequency at which RAs must be processed by the kernel. Android uses hardware offloads to drop RAs for a fraction of the minimum of all lifetimes present in the RA (some networks have very frequent RAs (5s) with high lifetimes (2h)). Despite this, we have encountered networks that set the router lifetime to 30s which results in very frequent CPU wakeups. Instead of disabling IPv6 (and dropping IPv6 ethertype in the WiFi firmware) entirely on such networks, it seems better to ignore the misconfigured routers while still processing RAs from other IPv6 routers on the same network (i.e. to support IoT applications). The previous implementation dropped the entire RA based on router lifetime. This turned out to be hard to expand to the other lifetimes present in the RA in a consistent manner; dropping the entire RA based on RIO/PIO lifetimes would essentially require parsing the whole thing twice. Fixes: 1671bcfd76fd ("net: add sysctl accept_ra_min_rtr_lft") Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Patrick Rohr <prohr@google.com> Reviewed-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230726230701.919212-1-prohr@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-28net: convert some netlink netdev iterators to depend on the xarrayJakub Kicinski
Reap the benefits of easier iteration thanks to the xarray. Convert just the genetlink ones, those are easier to test. Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20230726185530.2247698-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>