summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2023-07-18tcp: get rid of sysctl_tcp_adv_win_scaleEric Dumazet
With modern NIC drivers shifting to full page allocations per received frame, we face the following issue: TCP has one per-netns sysctl used to tweak how to translate a memory use into an expected payload (RWIN), in RX path. tcp_win_from_space() implementation is limited to few cases. For hosts dealing with various MSS, we either under estimate or over estimate the RWIN we send to the remote peers. For instance with the default sysctl_tcp_adv_win_scale value, we expect to store 50% of payload per allocated chunk of memory. For the typical use of MTU=1500 traffic, and order-0 pages allocations by NIC drivers, we are sending too big RWIN, leading to potential tcp collapse operations, which are extremely expensive and source of latency spikes. This patch makes sysctl_tcp_adv_win_scale obsolete, and instead uses a per socket scaling factor, so that we can precisely adjust the RWIN based on effective skb->len/skb->truesize ratio. This patch alone can double TCP receive performance when receivers are too slow to drain their receive queue, or by allowing a bigger RWIN when MSS is close to PAGE_SIZE. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Link: https://lore.kernel.org/r/20230717152917.751987-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-18Merge tag 'linux-can-fixes-for-6.5-20230717' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2023-07-17 The 1st patch is by Ziyang Xuan and fixes a possible memory leak in the receiver handling in the CAN RAW protocol. YueHaibing contributes a use after free in bcm_proc_show() of the Broad Cast Manager (BCM) CAN protocol. The next 2 patches are by me and fix a possible null pointer dereference in the RX path of the gs_usb driver with activated hardware timestamps and the candlelight firmware. The last patch is by Fedor Ross, Marek Vasut and me and targets the mcp251xfd driver. The polling timeout of __mcp251xfd_chip_set_mode() is increased to fix bus joining on busy CAN buses and very low bit rate. * tag 'linux-can-fixes-for-6.5-20230717' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: can: mcp251xfd: __mcp251xfd_chip_set_mode(): increase poll timeout can: gs_usb: fix time stamp counter initialization can: gs_usb: gs_can_open(): improve error handling can: bcm: Fix UAF in bcm_proc_show() can: raw: fix receiver memory leak ==================== Link: https://lore.kernel.org/r/20230717180938.230816-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-18bpf: Drop useless btf_vmlinux in bpf_tcp_caGeliang Tang
The code using btf_vmlinux in bpf_tcp_ca is removed by the commit 9f0265e921de ("bpf: Require only one of cong_avoid() and cong_control() from a TCP CC") so drop this useless btf_vmlinux declaration. Signed-off-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/4d38da4eadaba476bd92ffcd7a5a03a5e28745c0.1689582557.git.geliang.tang@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18rtnetlink: Move nesting cancellation rollback to proper functionGal Pressman
Make rtnl_fill_vf() cancel the vfinfo attribute on error instead of the inner rtnl_fill_vfinfo(), as it is the function that starts it. Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://lore.kernel.org/r/20230716072440.2372567-1-gal@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-18net: dsa: remove legacy_pre_march2020 detectionRussell King (Oracle)
All drivers are now updated for the March 2020 changes, and no longer make use of the mac_pcs_get_state() or mac_an_restart() operations, which are now NULL across all DSA drivers. All DSA drivers don't look at speed, duplex, pause or advertisement in their phylink_mac_config() method either. Remove support for these operations from DSA, and stop marking DSA as a legacy driver by default. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-17net: qrtr: Handle IPCR control port format of older targetsVignesh Viswanathan
The destination port value in the IPCR control buffer on older targets is 0xFFFF. Handle the same by updating the dst_port to QRTR_PORT_CTRL. Signed-off-by: Vignesh Viswanathan <quic_viswanat@quicinc.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-17net: qrtr: ns: Change nodes radix tree to xarrayVignesh Viswanathan
There is a use after free scenario while iterating through the nodes radix tree despite the ns being a single threaded process. This can happen when the radix tree APIs are not synchronized with the rcu_read_lock() APIs. Convert the radix tree for nodes to xarray to take advantage of the built in rcu lock usage provided by xarray. Signed-off-by: Chris Lew <quic_clew@quicinc.com> Signed-off-by: Vignesh Viswanathan <quic_viswanat@quicinc.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-17net: qrtr: ns: Change servers radix tree to xarrayVignesh Viswanathan
There is a use after free scenario while iterating through the servers radix tree despite the ns being a single threaded process. This can happen when the radix tree APIs are not synchronized with the rcu_read_lock() APIs. Convert the radix tree for servers to xarray to take advantage of the built in rcu lock usage provided by xarray. Signed-off-by: Chris Lew <quic_clew@quicinc.com> Signed-off-by: Vignesh Viswanathan <quic_viswanat@quicinc.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-17can: bcm: Fix UAF in bcm_proc_show()YueHaibing
BUG: KASAN: slab-use-after-free in bcm_proc_show+0x969/0xa80 Read of size 8 at addr ffff888155846230 by task cat/7862 CPU: 1 PID: 7862 Comm: cat Not tainted 6.5.0-rc1-00153-gc8746099c197 #230 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0xd5/0x150 print_report+0xc1/0x5e0 kasan_report+0xba/0xf0 bcm_proc_show+0x969/0xa80 seq_read_iter+0x4f6/0x1260 seq_read+0x165/0x210 proc_reg_read+0x227/0x300 vfs_read+0x1d5/0x8d0 ksys_read+0x11e/0x240 do_syscall_64+0x35/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd Allocated by task 7846: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 __kasan_kmalloc+0x9e/0xa0 bcm_sendmsg+0x264b/0x44e0 sock_sendmsg+0xda/0x180 ____sys_sendmsg+0x735/0x920 ___sys_sendmsg+0x11d/0x1b0 __sys_sendmsg+0xfa/0x1d0 do_syscall_64+0x35/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 7846: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x27/0x40 ____kasan_slab_free+0x161/0x1c0 slab_free_freelist_hook+0x119/0x220 __kmem_cache_free+0xb4/0x2e0 rcu_core+0x809/0x1bd0 bcm_op is freed before procfs entry be removed in bcm_release(), this lead to bcm_proc_show() may read the freed bcm_op. Fixes: ffd980f976e7 ("[CAN]: Add broadcast manager (bcm) protocol") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/all/20230715092543.15548-1-yuehaibing@huawei.com Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2023-07-17can: raw: fix receiver memory leakZiyang Xuan
Got kmemleak errors with the following ltp can_filter testcase: for ((i=1; i<=100; i++)) do ./can_filter & sleep 0.1 done ============================================================== [<00000000db4a4943>] can_rx_register+0x147/0x360 [can] [<00000000a289549d>] raw_setsockopt+0x5ef/0x853 [can_raw] [<000000006d3d9ebd>] __sys_setsockopt+0x173/0x2c0 [<00000000407dbfec>] __x64_sys_setsockopt+0x61/0x70 [<00000000fd468496>] do_syscall_64+0x33/0x40 [<00000000b7e47d51>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 It's a bug in the concurrent scenario of unregister_netdevice_many() and raw_release() as following: cpu0 cpu1 unregister_netdevice_many(can_dev) unlist_netdevice(can_dev) // dev_get_by_index() return NULL after this net_set_todo(can_dev) raw_release(can_socket) dev = dev_get_by_index(, ro->ifindex); // dev == NULL if (dev) { // receivers in dev_rcv_lists not free because dev is NULL raw_disable_allfilters(, dev, ); dev_put(dev); } ... ro->bound = 0; ... call_netdevice_notifiers(NETDEV_UNREGISTER, ) raw_notify(, NETDEV_UNREGISTER, ) if (ro->bound) // invalid because ro->bound has been set 0 raw_disable_allfilters(, dev, ); // receivers in dev_rcv_lists will never be freed Add a net_device pointer member in struct raw_sock to record bound can_dev, and use rtnl_lock to serialize raw_socket members between raw_bind(), raw_release(), raw_setsockopt() and raw_notify(). Use ro->dev to decide whether to free receivers in dev_rcv_lists. Fixes: 8d0caedb7596 ("can: bcm/raw/isotp: use per module netdevice notifier") Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Link: https://lore.kernel.org/all/20230711011737.1969582-1-william.xuanziyang@huawei.com Cc: stable@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2023-07-17net: sched: cls_flower: Undo tcf_bind_filter in case of an errorVictor Nogueira
If TCA_FLOWER_CLASSID is specified in the netlink message, the code will call tcf_bind_filter. However, if any error occurs after that, the code should undo this by calling tcf_unbind_filter. Fixes: 77b9900ef53a ("tc: introduce Flower classifier") Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-17net: sched: cls_bpf: Undo tcf_bind_filter in case of an errorVictor Nogueira
If cls_bpf_offload errors out, we must also undo tcf_bind_filter that was done before the error. Fix that by calling tcf_unbind_filter in errout_parms. Fixes: eadb41489fd2 ("net: cls_bpf: add support for marking filters as hardware-only") Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-17net: sched: cls_u32: Undo refcount decrement in case update failedVictor Nogueira
In the case of an update, when TCA_U32_LINK is set, u32_set_parms will decrement the refcount of the ht_down (struct tc_u_hnode) pointer present in the older u32 filter which we are replacing. However, if u32_replace_hw_knode errors out, the update command fails and that ht_down pointer continues decremented. To fix that, when u32_replace_hw_knode fails, check if ht_down's refcount was decremented and undo the decrement. Fixes: d34e3e181395 ("net: cls_u32: Add support for skip-sw flag to tc u32 classifier.") Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-17net: sched: cls_u32: Undo tcf_bind_filter if u32_replace_hw_knodeVictor Nogueira
When u32_replace_hw_knode fails, we need to undo the tcf_bind_filter operation done at u32_set_parms. Fixes: d34e3e181395 ("net: cls_u32: Add support for skip-sw flag to tc u32 classifier.") Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-17net: sched: cls_matchall: Undo tcf_bind_filter in case of failure after ↵Victor Nogueira
mall_set_parms In case an error occurred after mall_set_parms executed successfully, we must undo the tcf_bind_filter call it issues. Fix that by calling tcf_unbind_filter in err_replace_hw_filter label. Fixes: ec2507d2a306 ("net/sched: cls_matchall: Fix error path") Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-14Merge tag 'ceph-for-6.5-rc2' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fix from Ilya Dryomov: "A fix to prevent a potential buffer overrun in the messenger, marked for stable" * tag 'ceph-for-6.5-rc2' of https://github.com/ceph/ceph-client: libceph: harden msgr2.1 frame segment length checks
2023-07-14gso: fix dodgy bit handling for GSO_UDP_L4Yan Zhai
Commit 1fd54773c267 ("udp: allow header check for dodgy GSO_UDP_L4 packets.") checks DODGY bit for UDP, but for packets that can be fed directly to the device after gso_segs reset, it actually falls through to fragmentation: https://lore.kernel.org/all/CAJPywTKDdjtwkLVUW6LRA2FU912qcDmQOQGt2WaDo28KzYDg+A@mail.gmail.com/ This change restores the expected behavior of GSO_UDP_L4 packets. Fixes: 1fd54773c267 ("udp: allow header check for dodgy GSO_UDP_L4 packets.") Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: Yan Zhai <yan@cloudflare.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-14devlink: remove reload failed checks in params get/set callbacksJiri Pirko
The checks in question were introduced by: commit 6b4db2e528f6 ("devlink: Fix use-after-free after a failed reload"). That fixed an issue of reload with mlxsw driver. Back then, that was a valid fix, because there was a limitation in place that prevented drivers from registering/unregistering params when devlink instance was registered. It was possible to do the fix differently by changing drivers to register/unregister params in appropriate places making sure the ops operate only on memory which is allocated and initialized. But that, as a dependency, would require to remove the limitation mentioned above. Eventually, this limitation was lifted by: commit 1d18bb1a4ddd ("devlink: allow registering parameters after the instance") Also, the alternative fix (which also fixed another issue) was done by: commit 74cbc3c03c82 ("mlxsw: spectrum_acl_tcam: Move devlink param to TCAM code"). Therefore, the checks are no longer relevant. Each driver should make sure to have the params registered only when the memory the ops are working with is allocated and initialized. So remove the checks. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-14bridge: Add extack warning when enabling STP in netns.Kuniyuki Iwashima
When we create an L2 loop on a bridge in netns, we will see packets storm even if STP is enabled. # unshare -n # ip link add br0 type bridge # ip link add veth0 type veth peer name veth1 # ip link set veth0 master br0 up # ip link set veth1 master br0 up # ip link set br0 type bridge stp_state 1 # ip link set br0 up # sleep 30 # ip -s link show br0 2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether b6:61:98:1c:1c:b5 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped missed mcast 956553768 12861249 0 0 0 12861249 <-. Keep TX: bytes packets errors dropped carrier collsns | increasing 1027834 11951 0 0 0 0 <-' rapidly This is because llc_rcv() drops all packets in non-root netns and BPDU is dropped. Let's add extack warning when enabling STP in netns. # unshare -n # ip link add br0 type bridge # ip link set br0 type bridge stp_state 1 Warning: bridge: STP does not work in non-root netns. Note this commit will be reverted later when we namespacify the whole LLC infra. Fixes: e730c15519d0 ("[NET]: Make packet reception network namespace safe") Suggested-by: Harry Coin <hcoin@quietfountain.com> Link: https://lore.kernel.org/netdev/0f531295-e289-022d-5add-5ceffa0df9bc@quietfountain.com/ Suggested-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-14ipv6: Constify the sk parameter of several helper functions.Guillaume Nault
icmpv6_flow_init(), ip6_datagram_flow_key_init() and ip6_mc_hdr() don't need to modify their sk argument. Make that explicit using const. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-13Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2023-07-13 We've added 67 non-merge commits during the last 15 day(s) which contain a total of 106 files changed, 4444 insertions(+), 619 deletions(-). The main changes are: 1) Fix bpftool build in presence of stale vmlinux.h, from Alexander Lobakin. 2) Introduce bpf_me_mcache_free_rcu() and fix OOM under stress, from Alexei Starovoitov. 3) Teach verifier actual bounds of bpf_get_smp_processor_id() and fix perf+libbpf issue related to custom section handling, from Andrii Nakryiko. 4) Introduce bpf map element count, from Anton Protopopov. 5) Check skb ownership against full socket, from Kui-Feng Lee. 6) Support for up to 12 arguments in BPF trampoline, from Menglong Dong. 7) Export rcu_request_urgent_qs_task, from Paul E. McKenney. 8) Fix BTF walking of unions, from Yafang Shao. 9) Extend link_info for kprobe_multi and perf_event links, from Yafang Shao. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (67 commits) selftests/bpf: Add selftest for PTR_UNTRUSTED bpf: Fix an error in verifying a field in a union selftests/bpf: Add selftests for nested_trust bpf: Fix an error around PTR_UNTRUSTED selftests/bpf: add testcase for TRACING with 6+ arguments bpf, x86: allow function arguments up to 12 for TRACING bpf, x86: save/restore regs with BPF_DW size bpftool: Use "fallthrough;" keyword instead of comments bpf: Add object leak check. bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu. bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu(). selftests/bpf: Improve test coverage of bpf_mem_alloc. rcu: Export rcu_request_urgent_qs_task() bpf: Allow reuse from waiting_for_gp_ttrace list. bpf: Add a hint to allocated objects. bpf: Change bpf_mem_cache draining process. bpf: Further refactor alloc_bulk(). bpf: Factor out inc/dec of active flag into helpers. bpf: Refactor alloc_bulk(). bpf: Let free_all() return the number of freed elements. ... ==================== Link: https://lore.kernel.org/r/20230714020910.80794-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-13selftests/bpf: add testcase for TRACING with 6+ argumentsMenglong Dong
Add fentry_many_args.c and fexit_many_args.c to test the fentry/fexit with 7/11 arguments. As this feature is not supported by arm64 yet, we disable these testcases for arm64 in DENYLIST.aarch64. We can combine them with fentry_test.c/fexit_test.c when arm64 is supported too. Correspondingly, add bpf_testmod_fentry_test7() and bpf_testmod_fentry_test11() to bpf_testmod.c Meanwhile, add bpf_modify_return_test2() to test_run.c to test the MODIFY_RETURN with 7 arguments. Add bpf_testmod_test_struct_arg_7/bpf_testmod_test_struct_arg_7 in bpf_testmod.c to test the struct in the arguments. And the testcases passed on x86_64: ./test_progs -t fexit Summary: 5/14 PASSED, 0 SKIPPED, 0 FAILED ./test_progs -t fentry Summary: 3/2 PASSED, 0 SKIPPED, 0 FAILED ./test_progs -t modify_return Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED ./test_progs -t tracing_struct Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Menglong Dong <imagedong@tencent.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230713040738.1789742-4-imagedong@tencent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-13libceph: harden msgr2.1 frame segment length checksIlya Dryomov
ceph_frame_desc::fd_lens is an int array. decode_preamble() thus effectively casts u32 -> int but the checks for segment lengths are written as if on unsigned values. While reading in HELLO or one of the AUTH frames (before authentication is completed), arithmetic in head_onwire_len() can get duped by negative ctrl_len and produce head_len which is less than CEPH_PREAMBLE_LEN but still positive. This would lead to a buffer overrun in prepare_read_control() as the preamble gets copied to the newly allocated buffer of size head_len. Cc: stable@vger.kernel.org Fixes: cd1a677cad99 ("libceph, ceph: implement msgr2.1 protocol (crc and secure modes)") Reported-by: Thelford Williams <thelford@google.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Xiubo Li <xiubli@redhat.com>
2023-07-13net/sched: sch_qfq: account for stab overhead in qfq_enqueuePedro Tammela
Lion says: ------- In the QFQ scheduler a similar issue to CVE-2023-31436 persists. Consider the following code in net/sched/sch_qfq.c: static int qfq_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) { unsigned int len = qdisc_pkt_len(skb), gso_segs; // ... if (unlikely(cl->agg->lmax < len)) { pr_debug("qfq: increasing maxpkt from %u to %u for class %u", cl->agg->lmax, len, cl->common.classid); err = qfq_change_agg(sch, cl, cl->agg->class_weight, len); if (err) { cl->qstats.drops++; return qdisc_drop(skb, sch, to_free); } // ... } Similarly to CVE-2023-31436, "lmax" is increased without any bounds checks according to the packet length "len". Usually this would not impose a problem because packet sizes are naturally limited. This is however not the actual packet length, rather the "qdisc_pkt_len(skb)" which might apply size transformations according to "struct qdisc_size_table" as created by "qdisc_get_stab()" in net/sched/sch_api.c if the TCA_STAB option was set when modifying the qdisc. A user may choose virtually any size using such a table. As a result the same issue as in CVE-2023-31436 can occur, allowing heap out-of-bounds read / writes in the kmalloc-8192 cache. ------- We can create the issue with the following commands: tc qdisc add dev $DEV root handle 1: stab mtu 2048 tsize 512 mpu 0 \ overhead 999999999 linklayer ethernet qfq tc class add dev $DEV parent 1: classid 1:1 htb rate 6mbit burst 15k tc filter add dev $DEV parent 1: matchall classid 1:1 ping -I $DEV 1.1.1.2 This is caused by incorrectly assuming that qdisc_pkt_len() returns a length within the QFQ_MIN_LMAX < len < QFQ_MAX_LMAX. Fixes: 462dbc9101ac ("pkt_sched: QFQ Plus: fair-queueing service at DRR cost") Reported-by: Lion <nnamrec@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-13net/sched: sch_qfq: reintroduce lmax bound check for MTUPedro Tammela
25369891fcef deletes a check for the case where no 'lmax' is specified which 3037933448f6 previously fixed as 'lmax' could be set to the device's MTU without any bound checking for QFQ_LMAX_MIN and QFQ_LMAX_MAX. Therefore, reintroduce the check. Fixes: 25369891fcef ("net/sched: sch_qfq: refactor parsing of netlink parameters") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-12Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Alexei Starovoitov says: ==================== pull-request: bpf 2023-07-12 We've added 5 non-merge commits during the last 7 day(s) which contain a total of 7 files changed, 93 insertions(+), 28 deletions(-). The main changes are: 1) Fix max stack depth check for async callbacks, from Kumar. 2) Fix inconsistent JIT image generation, from Björn. 3) Use trusted arguments in XDP hints kfuncs, from Larysa. 4) Fix memory leak in cpu_map_update_elem, from Pu. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: xdp: use trusted arguments in XDP hints kfuncs bpf: cpumap: Fix memory leak in cpu_map_update_elem riscv, bpf: Fix inconsistent JIT image generation selftests/bpf: Add selftest for check_stack_max_depth bug bpf: Fix max stack depth check for async callbacks ==================== Link: https://lore.kernel.org/r/20230712223045.40182-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-12wifi: cfg80211: fix receiving mesh packets without RFC1042 headerFelix Fietkau
Fix ethernet header length field after stripping the mesh header Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/CT5GNZSK28AI.2K6M69OXM9RW5@syracuse/ Fixes: 986e43b19ae9 ("wifi: mac80211: fix receiving A-MSDU frames on mesh interfaces") Reported-and-tested-by: Nicolas Escande <nico.escande@gmail.com> Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20230711115052.68430-1-nbd@nbd.name Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-12ipv6: rpl: Remove redundant skb_dst_drop().Kuniyuki Iwashima
RPL code has a pattern where skb_dst_drop() is called before ip6_route_input(). However, ip6_route_input() calls skb_dst_drop() internally, so we need not call skb_dst_drop() before ip6_route_input(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230710213511.5364-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-12tcp: add a scheduling point in established_get_first()Jian Wen
Kubernetes[1] is going to stick with /proc/net/tcp for a while. This commit reduces the scheduling latency introduced by established_get_first(), similar to commit acffb584cda7 ("net: diag: add a scheduling point in inet_diag_dump_icsk()"). In our environment, the scheduling latency affects the performance of latency-sensitive services like Redis. Changes in V2 : - call cond_resched() before checking if a bucket is empty as suggested by Eric Dumazet - removed the delay of synchronize_net() from the commit message [1] https://github.com/google/cadvisor/blob/v0.47.2/container/libcontainer/handler.go#L130 Signed-off-by: Jian Wen <wenjian1@xiaomi.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230711032405.3253025-1-wenjian1@xiaomi.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-12net/sched: flower: Ensure both minimum and maximum ports are specifiedIdo Schimmel
The kernel does not currently validate that both the minimum and maximum ports of a port range are specified. This can lead user space to think that a filter matching on a port range was successfully added, when in fact it was not. For example, with a patched (buggy) iproute2 that only sends the minimum port, the following commands do not return an error: # tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp src_port 100-200 action pass # tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp dst_port 100-200 action pass # tc filter show dev swp1 ingress filter protocol ip pref 1 flower chain 0 filter protocol ip pref 1 flower chain 0 handle 0x1 eth_type ipv4 ip_proto udp not_in_hw action order 1: gact action pass random type none pass val 0 index 1 ref 1 bind 1 filter protocol ip pref 1 flower chain 0 handle 0x2 eth_type ipv4 ip_proto udp not_in_hw action order 1: gact action pass random type none pass val 0 index 2 ref 1 bind 1 Fix by returning an error unless both ports are specified: # tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp src_port 100-200 action pass Error: Both min and max source ports must be specified. We have an error talking to the kernel # tc filter add dev swp1 ingress pref 1 proto ip flower ip_proto udp dst_port 100-200 action pass Error: Both min and max destination ports must be specified. We have an error talking to the kernel Fixes: 5c72299fba9d ("net: sched: cls_flower: Classify packets using port ranges") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-11xdp: use trusted arguments in XDP hints kfuncsLarysa Zaremba
Currently, verifier does not reject XDP programs that pass NULL pointer to hints functions. At the same time, this case is not handled in any driver implementation (including veth). For example, changing bpf_xdp_metadata_rx_timestamp(ctx, &timestamp); to bpf_xdp_metadata_rx_timestamp(ctx, NULL); in xdp_metadata test successfully crashes the system. Add KF_TRUSTED_ARGS flag to hints kfunc definitions, so driver code does not have to worry about getting invalid pointers. Fixes: 3d76a4d3d4e5 ("bpf: XDP metadata RX kfuncs") Reported-by: Stanislav Fomichev <sdf@google.com> Closes: https://lore.kernel.org/bpf/ZKWo0BbpLfkZHbyE@google.com/ Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230711105930.29170-1-larysa.zaremba@intel.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11netlink: Make use of __assign_bit() APIAndy Shevchenko
We have for some time the __assign_bit() API to replace open coded if (foo) __set_bit(n, bar); else __clear_bit(n, bar); Use this API in the code. No functional change intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Message-ID: <20230710100830.89936-2-andriy.shevchenko@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-11net/core: Make use of assign_bit() APIAndy Shevchenko
We have for some time the assign_bit() API to replace open coded if (foo) set_bit(n, bar); else clear_bit(n, bar); Use this API in the code. No functional change intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Message-ID: <20230710100830.89936-1-andriy.shevchenko@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-11ip_vti: fix potential slab-use-after-free in decode_session6Zhengchao Shao
When ip_vti device is set to the qdisc of the sfb type, the cb field of the sent skb may be modified during enqueuing. Then, slab-use-after-free may occur when ip_vti device sends IPv6 packets. As commit f855691975bb ("xfrm6: Fix the nexthdr offset in _decode_session6.") showed, xfrm_decode_session was originally intended only for the receive path. IP6CB(skb)->nhoff is not set during transmission. Therefore, set the cb field in the skb to 0 before sending packets. Fixes: f855691975bb ("xfrm6: Fix the nexthdr offset in _decode_session6.") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-07-11ip6_vti: fix slab-use-after-free in decode_session6Zhengchao Shao
When ipv6_vti device is set to the qdisc of the sfb type, the cb field of the sent skb may be modified during enqueuing. Then, slab-use-after-free may occur when ipv6_vti device sends IPv6 packets. The stack information is as follows: BUG: KASAN: slab-use-after-free in decode_session6+0x103f/0x1890 Read of size 1 at addr ffff88802e08edc2 by task swapper/0/0 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.4.0-next-20230707-00001-g84e2cad7f979 #410 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 Call Trace: <IRQ> dump_stack_lvl+0xd9/0x150 print_address_description.constprop.0+0x2c/0x3c0 kasan_report+0x11d/0x130 decode_session6+0x103f/0x1890 __xfrm_decode_session+0x54/0xb0 vti6_tnl_xmit+0x3e6/0x1ee0 dev_hard_start_xmit+0x187/0x700 sch_direct_xmit+0x1a3/0xc30 __qdisc_run+0x510/0x17a0 __dev_queue_xmit+0x2215/0x3b10 neigh_connected_output+0x3c2/0x550 ip6_finish_output2+0x55a/0x1550 ip6_finish_output+0x6b9/0x1270 ip6_output+0x1f1/0x540 ndisc_send_skb+0xa63/0x1890 ndisc_send_rs+0x132/0x6f0 addrconf_rs_timer+0x3f1/0x870 call_timer_fn+0x1a0/0x580 expire_timers+0x29b/0x4b0 run_timer_softirq+0x326/0x910 __do_softirq+0x1d4/0x905 irq_exit_rcu+0xb7/0x120 sysvec_apic_timer_interrupt+0x97/0xc0 </IRQ> Allocated by task 9176: kasan_save_stack+0x22/0x40 kasan_set_track+0x25/0x30 __kasan_slab_alloc+0x7f/0x90 kmem_cache_alloc_node+0x1cd/0x410 kmalloc_reserve+0x165/0x270 __alloc_skb+0x129/0x330 netlink_sendmsg+0x9b1/0xe30 sock_sendmsg+0xde/0x190 ____sys_sendmsg+0x739/0x920 ___sys_sendmsg+0x110/0x1b0 __sys_sendmsg+0xf7/0x1c0 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd Freed by task 9176: kasan_save_stack+0x22/0x40 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2b/0x40 ____kasan_slab_free+0x160/0x1c0 slab_free_freelist_hook+0x11b/0x220 kmem_cache_free+0xf0/0x490 skb_free_head+0x17f/0x1b0 skb_release_data+0x59c/0x850 consume_skb+0xd2/0x170 netlink_unicast+0x54f/0x7f0 netlink_sendmsg+0x926/0xe30 sock_sendmsg+0xde/0x190 ____sys_sendmsg+0x739/0x920 ___sys_sendmsg+0x110/0x1b0 __sys_sendmsg+0xf7/0x1c0 do_syscall_64+0x39/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd The buggy address belongs to the object at ffff88802e08ed00 which belongs to the cache skbuff_small_head of size 640 The buggy address is located 194 bytes inside of freed 640-byte region [ffff88802e08ed00, ffff88802e08ef80) As commit f855691975bb ("xfrm6: Fix the nexthdr offset in _decode_session6.") showed, xfrm_decode_session was originally intended only for the receive path. IP6CB(skb)->nhoff is not set during transmission. Therefore, set the cb field in the skb to 0 before sending packets. Fixes: f855691975bb ("xfrm6: Fix the nexthdr offset in _decode_session6.") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-07-11xfrm: fix slab-use-after-free in decode_session6Zhengchao Shao
When the xfrm device is set to the qdisc of the sfb type, the cb field of the sent skb may be modified during enqueuing. Then, slab-use-after-free may occur when the xfrm device sends IPv6 packets. The stack information is as follows: BUG: KASAN: slab-use-after-free in decode_session6+0x103f/0x1890 Read of size 1 at addr ffff8881111458ef by task swapper/3/0 CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.4.0-next-20230707 #409 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 Call Trace: <IRQ> dump_stack_lvl+0xd9/0x150 print_address_description.constprop.0+0x2c/0x3c0 kasan_report+0x11d/0x130 decode_session6+0x103f/0x1890 __xfrm_decode_session+0x54/0xb0 xfrmi_xmit+0x173/0x1ca0 dev_hard_start_xmit+0x187/0x700 sch_direct_xmit+0x1a3/0xc30 __qdisc_run+0x510/0x17a0 __dev_queue_xmit+0x2215/0x3b10 neigh_connected_output+0x3c2/0x550 ip6_finish_output2+0x55a/0x1550 ip6_finish_output+0x6b9/0x1270 ip6_output+0x1f1/0x540 ndisc_send_skb+0xa63/0x1890 ndisc_send_rs+0x132/0x6f0 addrconf_rs_timer+0x3f1/0x870 call_timer_fn+0x1a0/0x580 expire_timers+0x29b/0x4b0 run_timer_softirq+0x326/0x910 __do_softirq+0x1d4/0x905 irq_exit_rcu+0xb7/0x120 sysvec_apic_timer_interrupt+0x97/0xc0 </IRQ> <TASK> asm_sysvec_apic_timer_interrupt+0x1a/0x20 RIP: 0010:intel_idle_hlt+0x23/0x30 Code: 1f 84 00 00 00 00 00 f3 0f 1e fa 41 54 41 89 d4 0f 1f 44 00 00 66 90 0f 1f 44 00 00 0f 00 2d c4 9f ab 00 0f 1f 44 00 00 fb f4 <fa> 44 89 e0 41 5c c3 66 0f 1f 44 00 00 f3 0f 1e fa 41 54 41 89 d4 RSP: 0018:ffffc90000197d78 EFLAGS: 00000246 RAX: 00000000000a83c3 RBX: ffffe8ffffd09c50 RCX: ffffffff8a22d8e5 RDX: 0000000000000001 RSI: ffffffff8d3f8080 RDI: ffffe8ffffd09c50 RBP: ffffffff8d3f8080 R08: 0000000000000001 R09: ffffed1026ba6d9d R10: ffff888135d36ceb R11: 0000000000000001 R12: 0000000000000001 R13: ffffffff8d3f8100 R14: 0000000000000001 R15: 0000000000000000 cpuidle_enter_state+0xd3/0x6f0 cpuidle_enter+0x4e/0xa0 do_idle+0x2fe/0x3c0 cpu_startup_entry+0x18/0x20 start_secondary+0x200/0x290 secondary_startup_64_no_verify+0x167/0x16b </TASK> Allocated by task 939: kasan_save_stack+0x22/0x40 kasan_set_track+0x25/0x30 __kasan_slab_alloc+0x7f/0x90 kmem_cache_alloc_node+0x1cd/0x410 kmalloc_reserve+0x165/0x270 __alloc_skb+0x129/0x330 inet6_ifa_notify+0x118/0x230 __ipv6_ifa_notify+0x177/0xbe0 addrconf_dad_completed+0x133/0xe00 addrconf_dad_work+0x764/0x1390 process_one_work+0xa32/0x16f0 worker_thread+0x67d/0x10c0 kthread+0x344/0x440 ret_from_fork+0x1f/0x30 The buggy address belongs to the object at ffff888111145800 which belongs to the cache skbuff_small_head of size 640 The buggy address is located 239 bytes inside of freed 640-byte region [ffff888111145800, ffff888111145a80) As commit f855691975bb ("xfrm6: Fix the nexthdr offset in _decode_session6.") showed, xfrm_decode_session was originally intended only for the receive path. IP6CB(skb)->nhoff is not set during transmission. Therefore, set the cb field in the skb to 0 before sending packets. Fixes: f855691975bb ("xfrm6: Fix the nexthdr offset in _decode_session6.") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-07-10xfrm: Silence warnings triggerable by bad packetsHerbert Xu
After the elimination of inner modes, a couple of warnings that were previously unreachable can now be triggered by malformed inbound packets. Fix this by: 1. Moving the setting of skb->protocol into the decap functions. 2. Returning -EINVAL when unexpected protocol is seen. Reported-by: Maciej Żenczykowski<maze@google.com> Fixes: 5f24f41e8ea6 ("xfrm: Remove inner/outer modes from input path") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2023-07-10net: sched: Replace strlcpy with strscpyAzeem Shaikh
strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). Direct replacement is safe here since return value of -errno is used to check for truncation instead of sizeof(dest). [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89 Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-09ipv6/addrconf: fix a potential refcount underflow for idevZiyang Xuan
Now in addrconf_mod_rs_timer(), reference idev depends on whether rs_timer is not pending. Then modify rs_timer timeout. There is a time gap in [1], during which if the pending rs_timer becomes not pending. It will miss to hold idev, but the rs_timer is activated. Thus rs_timer callback function addrconf_rs_timer() will be executed and put idev later without holding idev. A refcount underflow issue for idev can be caused by this. if (!timer_pending(&idev->rs_timer)) in6_dev_hold(idev); <--------------[1] mod_timer(&idev->rs_timer, jiffies + when); To fix the issue, hold idev if mod_timer() return 0. Fixes: b7b1bfce0bb6 ("ipv6: split duplicate address detection and router solicitation timer") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-08udp6: fix udp6_ehashfn() typoEric Dumazet
Amit Klein reported that udp6_ehash_secret was initialized but never used. Fixes: 1bbdceef1e53 ("inet: convert inet_ehash_secret and ipv6_hash_secret to net_get_random_once") Reported-by: Amit Klein <aksecurity@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: David Ahern <dsahern@kernel.org> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-08icmp6: Fix null-ptr-deref of ip6_null_entry->rt6i_idev in icmp6_dev().Kuniyuki Iwashima
With some IPv6 Ext Hdr (RPL, SRv6, etc.), we can send a packet that has the link-local address as src and dst IP and will be forwarded to an external IP in the IPv6 Ext Hdr. For example, the script below generates a packet whose src IP is the link-local address and dst is updated to 11::. # for f in $(find /proc/sys/net/ -name *seg6_enabled*); do echo 1 > $f; done # python3 >>> from socket import * >>> from scapy.all import * >>> >>> SRC_ADDR = DST_ADDR = "fe80::5054:ff:fe12:3456" >>> >>> pkt = IPv6(src=SRC_ADDR, dst=DST_ADDR) >>> pkt /= IPv6ExtHdrSegmentRouting(type=4, addresses=["11::", "22::"], segleft=1) >>> >>> sk = socket(AF_INET6, SOCK_RAW, IPPROTO_RAW) >>> sk.sendto(bytes(pkt), (DST_ADDR, 0)) For such a packet, we call ip6_route_input() to look up a route for the next destination in these three functions depending on the header type. * ipv6_rthdr_rcv() * ipv6_rpl_srh_rcv() * ipv6_srh_rcv() If no route is found, ip6_null_entry is set to skb, and the following dst_input(skb) calls ip6_pkt_drop(). Finally, in icmp6_dev(), we dereference skb_rt6_info(skb)->rt6i_idev->dev as the input device is the loopback interface. Then, we have to check if skb_rt6_info(skb)->rt6i_idev is NULL or not to avoid NULL pointer deref for ip6_null_entry. BUG: kernel NULL pointer dereference, address: 0000000000000000 PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 157 Comm: python3 Not tainted 6.4.0-11996-gb121d614371c #35 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:icmp6_send (net/ipv6/icmp.c:436 net/ipv6/icmp.c:503) Code: fe ff ff 48 c7 40 30 c0 86 5d 83 e8 c6 44 1c 00 e9 c8 fc ff ff 49 8b 46 58 48 83 e0 fe 0f 84 4a fb ff ff 48 8b 80 d0 00 00 00 <48> 8b 00 44 8b 88 e0 00 00 00 e9 34 fb ff ff 4d 85 ed 0f 85 69 01 RSP: 0018:ffffc90000003c70 EFLAGS: 00000286 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 00000000000000e0 RDX: 0000000000000021 RSI: 0000000000000000 RDI: ffff888006d72a18 RBP: ffffc90000003d80 R08: 0000000000000000 R09: 0000000000000001 R10: ffffc90000003d98 R11: 0000000000000040 R12: ffff888006d72a10 R13: 0000000000000000 R14: ffff8880057fb800 R15: ffffffff835d86c0 FS: 00007f9dc72ee740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000000057b2000 CR4: 00000000007506f0 PKRU: 55555554 Call Trace: <IRQ> ip6_pkt_drop (net/ipv6/route.c:4513) ipv6_rthdr_rcv (net/ipv6/exthdrs.c:640 net/ipv6/exthdrs.c:686) ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:437 (discriminator 5)) ip6_input_finish (./include/linux/rcupdate.h:781 net/ipv6/ip6_input.c:483) __netif_receive_skb_one_core (net/core/dev.c:5455) process_backlog (./include/linux/rcupdate.h:781 net/core/dev.c:5895) __napi_poll (net/core/dev.c:6460) net_rx_action (net/core/dev.c:6529 net/core/dev.c:6660) __do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:554) do_softirq (kernel/softirq.c:454 kernel/softirq.c:441) </IRQ> <TASK> __local_bh_enable_ip (kernel/softirq.c:381) __dev_queue_xmit (net/core/dev.c:4231) ip6_finish_output2 (./include/net/neighbour.h:544 net/ipv6/ip6_output.c:135) rawv6_sendmsg (./include/net/dst.h:458 ./include/linux/netfilter.h:303 net/ipv6/raw.c:656 net/ipv6/raw.c:914) sock_sendmsg (net/socket.c:725 net/socket.c:748) __sys_sendto (net/socket.c:2134) __x64_sys_sendto (net/socket.c:2146 net/socket.c:2142 net/socket.c:2142) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) RIP: 0033:0x7f9dc751baea Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89 RSP: 002b:00007ffe98712c38 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007ffe98712cf8 RCX: 00007f9dc751baea RDX: 0000000000000060 RSI: 00007f9dc6460b90 RDI: 0000000000000003 RBP: 00007f9dc56e8be0 R08: 00007ffe98712d70 R09: 000000000000001c R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007f9dc6af5d1b </TASK> Modules linked in: CR2: 0000000000000000 ---[ end trace 0000000000000000 ]--- RIP: 0010:icmp6_send (net/ipv6/icmp.c:436 net/ipv6/icmp.c:503) Code: fe ff ff 48 c7 40 30 c0 86 5d 83 e8 c6 44 1c 00 e9 c8 fc ff ff 49 8b 46 58 48 83 e0 fe 0f 84 4a fb ff ff 48 8b 80 d0 00 00 00 <48> 8b 00 44 8b 88 e0 00 00 00 e9 34 fb ff ff 4d 85 ed 0f 85 69 01 RSP: 0018:ffffc90000003c70 EFLAGS: 00000286 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 00000000000000e0 RDX: 0000000000000021 RSI: 0000000000000000 RDI: ffff888006d72a18 RBP: ffffc90000003d80 R08: 0000000000000000 R09: 0000000000000001 R10: ffffc90000003d98 R11: 0000000000000040 R12: ffff888006d72a10 R13: 0000000000000000 R14: ffff8880057fb800 R15: ffffffff835d86c0 FS: 00007f9dc72ee740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000000057b2000 CR4: 00000000007506f0 PKRU: 55555554 Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled Fixes: 4832c30d5458 ("net: ipv6: put host and anycast routes on device with address") Reported-by: Wang Yufen <wangyufen@huawei.com> Closes: https://lore.kernel.org/netdev/c41403a9-c2f6-3b7e-0c96-e1901e605cd0@huawei.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-08net: prevent skb corruption on frag list segmentationPaolo Abeni
Ian reported several skb corruptions triggered by rx-gro-list, collecting different oops alike: [ 62.624003] BUG: kernel NULL pointer dereference, address: 00000000000000c0 [ 62.631083] #PF: supervisor read access in kernel mode [ 62.636312] #PF: error_code(0x0000) - not-present page [ 62.641541] PGD 0 P4D 0 [ 62.644174] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 62.648629] CPU: 1 PID: 913 Comm: napi/eno2-79 Not tainted 6.4.0 #364 [ 62.655162] Hardware name: Supermicro Super Server/A2SDi-12C-HLN4F, BIOS 1.7a 10/13/2022 [ 62.663344] RIP: 0010:__udp_gso_segment (./include/linux/skbuff.h:2858 ./include/linux/udp.h:23 net/ipv4/udp_offload.c:228 net/ipv4/udp_offload.c:261 net/ipv4/udp_offload.c:277) [ 62.687193] RSP: 0018:ffffbd3a83b4f868 EFLAGS: 00010246 [ 62.692515] RAX: 00000000000000ce RBX: 0000000000000000 RCX: 0000000000000000 [ 62.699743] RDX: ffffa124def8a000 RSI: 0000000000000079 RDI: ffffa125952a14d4 [ 62.706970] RBP: ffffa124def8a000 R08: 0000000000000022 R09: 00002000001558c9 [ 62.714199] R10: 0000000000000000 R11: 00000000be554639 R12: 00000000000000e2 [ 62.721426] R13: ffffa125952a1400 R14: ffffa125952a1400 R15: 00002000001558c9 [ 62.728654] FS: 0000000000000000(0000) GS:ffffa127efa40000(0000) knlGS:0000000000000000 [ 62.736852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 62.742702] CR2: 00000000000000c0 CR3: 00000001034b0000 CR4: 00000000003526e0 [ 62.749948] Call Trace: [ 62.752498] <TASK> [ 62.779267] inet_gso_segment (net/ipv4/af_inet.c:1398) [ 62.787605] skb_mac_gso_segment (net/core/gro.c:141) [ 62.791906] __skb_gso_segment (net/core/dev.c:3403 (discriminator 2)) [ 62.800492] validate_xmit_skb (./include/linux/netdevice.h:4862 net/core/dev.c:3659) [ 62.804695] validate_xmit_skb_list (net/core/dev.c:3710) [ 62.809158] sch_direct_xmit (net/sched/sch_generic.c:330) [ 62.813198] __dev_queue_xmit (net/core/dev.c:3805 net/core/dev.c:4210) net/netfilter/core.c:626) [ 62.821093] br_dev_queue_push_xmit (net/bridge/br_forward.c:55) [ 62.825652] maybe_deliver (net/bridge/br_forward.c:193) [ 62.829420] br_flood (net/bridge/br_forward.c:233) [ 62.832758] br_handle_frame_finish (net/bridge/br_input.c:215) [ 62.837403] br_handle_frame (net/bridge/br_input.c:298 net/bridge/br_input.c:416) [ 62.851417] __netif_receive_skb_core.constprop.0 (net/core/dev.c:5387) [ 62.866114] __netif_receive_skb_list_core (net/core/dev.c:5570) [ 62.871367] netif_receive_skb_list_internal (net/core/dev.c:5638 net/core/dev.c:5727) [ 62.876795] napi_complete_done (./include/linux/list.h:37 ./include/net/gro.h:434 ./include/net/gro.h:429 net/core/dev.c:6067) [ 62.881004] ixgbe_poll (drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3191) [ 62.893534] __napi_poll (net/core/dev.c:6498) [ 62.897133] napi_threaded_poll (./include/linux/netpoll.h:89 net/core/dev.c:6640) [ 62.905276] kthread (kernel/kthread.c:379) [ 62.913435] ret_from_fork (arch/x86/entry/entry_64.S:314) [ 62.917119] </TASK> In the critical scenario, rx-gro-list GRO-ed packets are fed, via a bridge, both to the local input path and to an egress device (tun). The segmentation of such packets unsafely writes to the cloned skbs with shared heads. This change addresses the issue by uncloning as needed the to-be-segmented skbs. Reported-by: Ian Kumlien <ian.kumlien@gmail.com> Tested-by: Ian Kumlien <ian.kumlien@gmail.com> Fixes: 3a1296a38d0c ("net: Support GRO/GSO fraglist chaining.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-07udp6: add a missing call into udp_fail_queue_rcv_skb tracepointIvan Babrou
The tracepoint has existed for 12 years, but it only covered udp over the legacy IPv4 protocol. Having it enabled for udp6 removes the unnecessary difference in error visibility. Signed-off-by: Ivan Babrou <ivan@cloudflare.com> Fixes: 296f7ea75b45 ("udp: add tracepoints for queueing skb to rcvbuf") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-07-06net/sched: cls_fw: Fix improper refcount update leads to use-after-freeM A Ramdhan
In the event of a failure in tcf_change_indev(), fw_set_parms() will immediately return an error after incrementing or decrementing reference counter in tcf_bind_filter(). If attacker can control reference counter to zero and make reference freed, leading to use after free. In order to prevent this, move the point of possible failure above the point where the TC_FW_CLASSID is handled. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: M A Ramdhan <ramdhan@starlabs.sg> Signed-off-by: M A Ramdhan <ramdhan@starlabs.sg> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Message-ID: <20230705161530.52003-1-ramdhan@starlabs.sg> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-06Merge tag 'nf-23-07-06' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Fix missing overflow use refcount checks in nf_tables. 2) Do not set IPS_ASSURED for IPS_NAT_CLASH entries in GRE tracker, from Florian Westphal. 3) Bail out if nf_ct_helper_hash is NULL before registering helper, from Florent Revest. 4) Use siphash() instead siphash_4u64() to fix performance regression, also from Florian. 5) Do not allow to add rules to removed chains via ID, from Thadeu Lima de Souza Cascardo. 6) Fix oob read access in byteorder expression, also from Thadeu. netfilter pull request 23-07-06 ==================== Link: https://lore.kernel.org/r/20230705230406.52201-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-07-06netfilter: nf_tables: prevent OOB access in nft_byteorder_evalThadeu Lima de Souza Cascardo
When evaluating byteorder expressions with size 2, a union with 32-bit and 16-bit members is used. Since the 16-bit members are aligned to 32-bit, the array accesses will be out-of-bounds. It may lead to a stack-out-of-bounds access like the one below: [ 23.095215] ================================================================== [ 23.095625] BUG: KASAN: stack-out-of-bounds in nft_byteorder_eval+0x13c/0x320 [ 23.096020] Read of size 2 at addr ffffc90000007948 by task ping/115 [ 23.096358] [ 23.096456] CPU: 0 PID: 115 Comm: ping Not tainted 6.4.0+ #413 [ 23.096770] Call Trace: [ 23.096910] <IRQ> [ 23.097030] dump_stack_lvl+0x60/0xc0 [ 23.097218] print_report+0xcf/0x630 [ 23.097388] ? nft_byteorder_eval+0x13c/0x320 [ 23.097577] ? kasan_addr_to_slab+0xd/0xc0 [ 23.097760] ? nft_byteorder_eval+0x13c/0x320 [ 23.097949] kasan_report+0xc9/0x110 [ 23.098106] ? nft_byteorder_eval+0x13c/0x320 [ 23.098298] __asan_load2+0x83/0xd0 [ 23.098453] nft_byteorder_eval+0x13c/0x320 [ 23.098659] nft_do_chain+0x1c8/0xc50 [ 23.098852] ? __pfx_nft_do_chain+0x10/0x10 [ 23.099078] ? __kasan_check_read+0x11/0x20 [ 23.099295] ? __pfx___lock_acquire+0x10/0x10 [ 23.099535] ? __pfx___lock_acquire+0x10/0x10 [ 23.099745] ? __kasan_check_read+0x11/0x20 [ 23.099929] nft_do_chain_ipv4+0xfe/0x140 [ 23.100105] ? __pfx_nft_do_chain_ipv4+0x10/0x10 [ 23.100327] ? lock_release+0x204/0x400 [ 23.100515] ? nf_hook.constprop.0+0x340/0x550 [ 23.100779] nf_hook_slow+0x6c/0x100 [ 23.100977] ? __pfx_nft_do_chain_ipv4+0x10/0x10 [ 23.101223] nf_hook.constprop.0+0x334/0x550 [ 23.101443] ? __pfx_ip_local_deliver_finish+0x10/0x10 [ 23.101677] ? __pfx_nf_hook.constprop.0+0x10/0x10 [ 23.101882] ? __pfx_ip_rcv_finish+0x10/0x10 [ 23.102071] ? __pfx_ip_local_deliver_finish+0x10/0x10 [ 23.102291] ? rcu_read_lock_held+0x4b/0x70 [ 23.102481] ip_local_deliver+0xbb/0x110 [ 23.102665] ? __pfx_ip_rcv+0x10/0x10 [ 23.102839] ip_rcv+0x199/0x2a0 [ 23.102980] ? __pfx_ip_rcv+0x10/0x10 [ 23.103140] __netif_receive_skb_one_core+0x13e/0x150 [ 23.103362] ? __pfx___netif_receive_skb_one_core+0x10/0x10 [ 23.103647] ? mark_held_locks+0x48/0xa0 [ 23.103819] ? process_backlog+0x36c/0x380 [ 23.103999] __netif_receive_skb+0x23/0xc0 [ 23.104179] process_backlog+0x91/0x380 [ 23.104350] __napi_poll.constprop.0+0x66/0x360 [ 23.104589] ? net_rx_action+0x1cb/0x610 [ 23.104811] net_rx_action+0x33e/0x610 [ 23.105024] ? _raw_spin_unlock+0x23/0x50 [ 23.105257] ? __pfx_net_rx_action+0x10/0x10 [ 23.105485] ? mark_held_locks+0x48/0xa0 [ 23.105741] __do_softirq+0xfa/0x5ab [ 23.105956] ? __dev_queue_xmit+0x765/0x1c00 [ 23.106193] do_softirq.part.0+0x49/0xc0 [ 23.106423] </IRQ> [ 23.106547] <TASK> [ 23.106670] __local_bh_enable_ip+0xf5/0x120 [ 23.106903] __dev_queue_xmit+0x789/0x1c00 [ 23.107131] ? __pfx___dev_queue_xmit+0x10/0x10 [ 23.107381] ? find_held_lock+0x8e/0xb0 [ 23.107585] ? lock_release+0x204/0x400 [ 23.107798] ? neigh_resolve_output+0x185/0x350 [ 23.108049] ? mark_held_locks+0x48/0xa0 [ 23.108265] ? neigh_resolve_output+0x185/0x350 [ 23.108514] neigh_resolve_output+0x246/0x350 [ 23.108753] ? neigh_resolve_output+0x246/0x350 [ 23.109003] ip_finish_output2+0x3c3/0x10b0 [ 23.109250] ? __pfx_ip_finish_output2+0x10/0x10 [ 23.109510] ? __pfx_nf_hook+0x10/0x10 [ 23.109732] __ip_finish_output+0x217/0x390 [ 23.109978] ip_finish_output+0x2f/0x130 [ 23.110207] ip_output+0xc9/0x170 [ 23.110404] ip_push_pending_frames+0x1a0/0x240 [ 23.110652] raw_sendmsg+0x102e/0x19e0 [ 23.110871] ? __pfx_raw_sendmsg+0x10/0x10 [ 23.111093] ? lock_release+0x204/0x400 [ 23.111304] ? __mod_lruvec_page_state+0x148/0x330 [ 23.111567] ? find_held_lock+0x8e/0xb0 [ 23.111777] ? find_held_lock+0x8e/0xb0 [ 23.111993] ? __rcu_read_unlock+0x7c/0x2f0 [ 23.112225] ? aa_sk_perm+0x18a/0x550 [ 23.112431] ? filemap_map_pages+0x4f1/0x900 [ 23.112665] ? __pfx_aa_sk_perm+0x10/0x10 [ 23.112880] ? find_held_lock+0x8e/0xb0 [ 23.113098] inet_sendmsg+0xa0/0xb0 [ 23.113297] ? inet_sendmsg+0xa0/0xb0 [ 23.113500] ? __pfx_inet_sendmsg+0x10/0x10 [ 23.113727] sock_sendmsg+0xf4/0x100 [ 23.113924] ? move_addr_to_kernel.part.0+0x4f/0xa0 [ 23.114190] __sys_sendto+0x1d4/0x290 [ 23.114391] ? __pfx___sys_sendto+0x10/0x10 [ 23.114621] ? __pfx_mark_lock.part.0+0x10/0x10 [ 23.114869] ? lock_release+0x204/0x400 [ 23.115076] ? find_held_lock+0x8e/0xb0 [ 23.115287] ? rcu_is_watching+0x23/0x60 [ 23.115503] ? __rseq_handle_notify_resume+0x6e2/0x860 [ 23.115778] ? __kasan_check_write+0x14/0x30 [ 23.116008] ? blkcg_maybe_throttle_current+0x8d/0x770 [ 23.116285] ? mark_held_locks+0x28/0xa0 [ 23.116503] ? do_syscall_64+0x37/0x90 [ 23.116713] __x64_sys_sendto+0x7f/0xb0 [ 23.116924] do_syscall_64+0x59/0x90 [ 23.117123] ? irqentry_exit_to_user_mode+0x25/0x30 [ 23.117387] ? irqentry_exit+0x77/0xb0 [ 23.117593] ? exc_page_fault+0x92/0x140 [ 23.117806] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 23.118081] RIP: 0033:0x7f744aee2bba [ 23.118282] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89 [ 23.119237] RSP: 002b:00007ffd04a7c9f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 23.119644] RAX: ffffffffffffffda RBX: 00007ffd04a7e0a0 RCX: 00007f744aee2bba [ 23.120023] RDX: 0000000000000040 RSI: 000056488e9e6300 RDI: 0000000000000003 [ 23.120413] RBP: 000056488e9e6300 R08: 00007ffd04a80320 R09: 0000000000000010 [ 23.120809] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000040 [ 23.121219] R13: 00007ffd04a7dc38 R14: 00007ffd04a7ca00 R15: 00007ffd04a7e0a0 [ 23.121617] </TASK> [ 23.121749] [ 23.121845] The buggy address belongs to the virtual mapping at [ 23.121845] [ffffc90000000000, ffffc90000009000) created by: [ 23.121845] irq_init_percpu_irqstack+0x1cf/0x270 [ 23.122707] [ 23.122803] The buggy address belongs to the physical page: [ 23.123104] page:0000000072ac19f0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x24a09 [ 23.123609] flags: 0xfffffc0001000(reserved|node=0|zone=1|lastcpupid=0x1fffff) [ 23.123998] page_type: 0xffffffff() [ 23.124194] raw: 000fffffc0001000 ffffea0000928248 ffffea0000928248 0000000000000000 [ 23.124610] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 [ 23.125023] page dumped because: kasan: bad access detected [ 23.125326] [ 23.125421] Memory state around the buggy address: [ 23.125682] ffffc90000007800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 23.126072] ffffc90000007880: 00 00 00 00 00 f1 f1 f1 f1 f1 f1 00 00 f2 f2 00 [ 23.126455] >ffffc90000007900: 00 00 00 00 00 00 00 00 00 f2 f2 f2 f2 00 00 00 [ 23.126840] ^ [ 23.127138] ffffc90000007980: 00 00 00 00 00 00 00 00 00 00 00 00 00 f3 f3 f3 [ 23.127522] ffffc90000007a00: f3 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 [ 23.127906] ================================================================== [ 23.128324] Disabling lock debugging due to kernel taint Using simple s16 pointers for the 16-bit accesses fixes the problem. For the 32-bit accesses, src and dst can be used directly. Fixes: 96518518cc41 ("netfilter: add nftables") Cc: stable@vger.kernel.org Reported-by: Tanguy DUBROCA (@SidewayRE) from @Synacktiv working with ZDI Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-07-05Merge tag 'net-6.5-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth, bpf and wireguard. Current release - regressions: - nvme-tcp: fix comma-related oops after sendpage changes Current release - new code bugs: - ptp: make max_phase_adjustment sysfs device attribute invisible when not supported Previous releases - regressions: - sctp: fix potential deadlock on &net->sctp.addr_wq_lock - mptcp: - ensure subflow is unhashed before cleaning the backlog - do not rely on implicit state check in mptcp_listen() Previous releases - always broken: - net: fix net_dev_start_xmit trace event vs skb_transport_offset() - Bluetooth: - fix use-bdaddr-property quirk - L2CAP: fix multiple UaFs - ISO: use hci_sync for setting CIG parameters - hci_event: fix Set CIG Parameters error status handling - hci_event: fix parsing of CIS Established Event - MGMT: fix marking SCAN_RSP as not connectable - wireguard: queuing: use saner cpu selection wrapping - sched: act_ipt: various bug fixes for iptables <> TC interactions - sched: act_pedit: add size check for TCA_PEDIT_PARMS_EX - dsa: fixes for receiving PTP packets with 8021q and sja1105 tagging - eth: sfc: fix null-deref in devlink port without MAE access - eth: ibmvnic: do not reset dql stats on NON_FATAL err Misc: - xsk: honor SO_BINDTODEVICE on bind" * tag 'net-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits) nfp: clean mc addresses in application firmware when closing port selftests: mptcp: pm_nl_ctl: fix 32-bit support selftests: mptcp: depend on SYN_COOKIES selftests: mptcp: userspace_pm: report errors with 'remove' tests selftests: mptcp: userspace_pm: use correct server port selftests: mptcp: sockopt: return error if wrong mark selftests: mptcp: sockopt: use 'iptables-legacy' if available selftests: mptcp: connect: fail if nft supposed to work mptcp: do not rely on implicit state check in mptcp_listen() mptcp: ensure subflow is unhashed before cleaning the backlog s390/qeth: Fix vipa deletion octeontx-af: fix hardware timestamp configuration net: dsa: sja1105: always enable the send_meta options net: dsa: tag_sja1105: fix MAC DA patching from meta frames net: Replace strlcpy with strscpy pptp: Fix fib lookup calls. mlxsw: spectrum_router: Fix an IS_ERR() vs NULL check net/sched: act_pedit: Add size check for TCA_PEDIT_PARMS_EX xsk: Honor SO_BINDTODEVICE on bind ptp: Make max_phase_adjustment sysfs device attribute invisible when not supported ...
2023-07-05Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-07-05 We've added 2 non-merge commits during the last 1 day(s) which contain a total of 3 files changed, 16 insertions(+), 4 deletions(-). The main changes are: 1) Fix BTF to warn but not returning an error for a NULL BTF to still be able to load modules under CONFIG_DEBUG_INFO_BTF, from SeongJae Park. 2) Fix xsk sockets to honor SO_BINDTODEVICE in bind(), from Ilya Maximets. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: xsk: Honor SO_BINDTODEVICE on bind bpf, btf: Warn but return no error for NULL btf from __register_btf_kfunc_id_set() ==================== Link: https://lore.kernel.org/r/20230705171716.6494-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-05netfilter: nf_tables: do not ignore genmask when looking up chain by idThadeu Lima de Souza Cascardo
When adding a rule to a chain referring to its ID, if that chain had been deleted on the same batch, the rule might end up referring to a deleted chain. This will lead to a WARNING like following: [ 33.098431] ------------[ cut here ]------------ [ 33.098678] WARNING: CPU: 5 PID: 69 at net/netfilter/nf_tables_api.c:2037 nf_tables_chain_destroy+0x23d/0x260 [ 33.099217] Modules linked in: [ 33.099388] CPU: 5 PID: 69 Comm: kworker/5:1 Not tainted 6.4.0+ #409 [ 33.099726] Workqueue: events nf_tables_trans_destroy_work [ 33.100018] RIP: 0010:nf_tables_chain_destroy+0x23d/0x260 [ 33.100306] Code: 8b 7c 24 68 e8 64 9c ed fe 4c 89 e7 e8 5c 9c ed fe 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d 31 c0 89 c6 89 c7 c3 cc cc cc cc <0f> 0b 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d 31 c0 89 c6 89 c7 [ 33.101271] RSP: 0018:ffffc900004ffc48 EFLAGS: 00010202 [ 33.101546] RAX: 0000000000000001 RBX: ffff888006fc0a28 RCX: 0000000000000000 [ 33.101920] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 33.102649] RBP: ffffc900004ffc78 R08: 0000000000000000 R09: 0000000000000000 [ 33.103018] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8880135ef500 [ 33.103385] R13: 0000000000000000 R14: dead000000000122 R15: ffff888006fc0a10 [ 33.103762] FS: 0000000000000000(0000) GS:ffff888024c80000(0000) knlGS:0000000000000000 [ 33.104184] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 33.104493] CR2: 00007fe863b56a50 CR3: 00000000124b0001 CR4: 0000000000770ee0 [ 33.104872] PKRU: 55555554 [ 33.104999] Call Trace: [ 33.105113] <TASK> [ 33.105214] ? show_regs+0x72/0x90 [ 33.105371] ? __warn+0xa5/0x210 [ 33.105520] ? nf_tables_chain_destroy+0x23d/0x260 [ 33.105732] ? report_bug+0x1f2/0x200 [ 33.105902] ? handle_bug+0x46/0x90 [ 33.106546] ? exc_invalid_op+0x19/0x50 [ 33.106762] ? asm_exc_invalid_op+0x1b/0x20 [ 33.106995] ? nf_tables_chain_destroy+0x23d/0x260 [ 33.107249] ? nf_tables_chain_destroy+0x30/0x260 [ 33.107506] nf_tables_trans_destroy_work+0x669/0x680 [ 33.107782] ? mark_held_locks+0x28/0xa0 [ 33.107996] ? __pfx_nf_tables_trans_destroy_work+0x10/0x10 [ 33.108294] ? _raw_spin_unlock_irq+0x28/0x70 [ 33.108538] process_one_work+0x68c/0xb70 [ 33.108755] ? lock_acquire+0x17f/0x420 [ 33.108977] ? __pfx_process_one_work+0x10/0x10 [ 33.109218] ? do_raw_spin_lock+0x128/0x1d0 [ 33.109435] ? _raw_spin_lock_irq+0x71/0x80 [ 33.109634] worker_thread+0x2bd/0x700 [ 33.109817] ? __pfx_worker_thread+0x10/0x10 [ 33.110254] kthread+0x18b/0x1d0 [ 33.110410] ? __pfx_kthread+0x10/0x10 [ 33.110581] ret_from_fork+0x29/0x50 [ 33.110757] </TASK> [ 33.110866] irq event stamp: 1651 [ 33.111017] hardirqs last enabled at (1659): [<ffffffffa206a209>] __up_console_sem+0x79/0xa0 [ 33.111379] hardirqs last disabled at (1666): [<ffffffffa206a1ee>] __up_console_sem+0x5e/0xa0 [ 33.111740] softirqs last enabled at (1616): [<ffffffffa1f5d40e>] __irq_exit_rcu+0x9e/0xe0 [ 33.112094] softirqs last disabled at (1367): [<ffffffffa1f5d40e>] __irq_exit_rcu+0x9e/0xe0 [ 33.112453] ---[ end trace 0000000000000000 ]--- This is due to the nft_chain_lookup_byid ignoring the genmask. After this change, adding the new rule will fail as it will not find the chain. Fixes: 837830a4b439 ("netfilter: nf_tables: add NFTA_RULE_CHAIN_ID attribute") Cc: stable@vger.kernel.org Reported-by: Mingi Cho of Theori working with ZDI Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>