summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2023-10-25Merge tag 'nf-23-10-25' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net This patch contains two late Netfilter's flowtable fixes for net: 1) Flowtable GC pushes back packets to classic path in every GC run, ie. every second. This is because NF_FLOW_HW_ESTABLISHED is only used by sched/act_ct (never set) and IPS_SEEN_REPLY might be unset by the time the flow is offloaded (this status bit is only reliable in the sched/act_ct datapath). 2) sched/act_ct logic to push back packets to classic path to reevaluate if UDP flow is unidirectional only applies if IPS_HW_OFFLOAD_BIT is set on and no hardware offload request is pending to be handled. From Vlad Buslov. These two patches fixes two problems that were introduced in the previous 6.5 development cycle. * tag 'nf-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: net/sched: act_ct: additional checks for outdated flows netfilter: flowtable: GC pushes back packets to classic path ==================== Link: https://lore.kernel.org/r/20231025100819.2664-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25vsock/virtio: initialize the_virtio_vsock before using VQsAlexandru Matei
Once VQs are filled with empty buffers and we kick the host, it can send connection requests. If the_virtio_vsock is not initialized before, replies are silently dropped and do not reach the host. virtio_transport_send_pkt() can queue packets once the_virtio_vsock is set, but they won't be processed until vsock->tx_run is set to true. We queue vsock->send_pkt_work when initialization finishes to send those packets queued earlier. Fixes: 0deab087b16a ("vsock/virtio: use RCU to avoid use-after-free on the_virtio_vsock") Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://lore.kernel.org/r/20231024191742.14259-1-alexandru.matei@uipath.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-25net: ipv6: fix typo in commentsDeming Wang
The word "advertize" should be replaced by "advertise". Signed-off-by: Deming Wang <wangdeming@inspur.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-25net: ipv4: fix typo in commentsDeming Wang
The word "advertize" should be replaced by "advertise". Signed-off-by: Deming Wang <wangdeming@inspur.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-25net/sched: act_ct: additional checks for outdated flowsVlad Buslov
Current nf_flow_is_outdated() implementation considers any flow table flow which state diverged from its underlying CT connection status for teardown which can be problematic in the following cases: - Flow has never been offloaded to hardware in the first place either because flow table has hardware offload disabled (flag NF_FLOWTABLE_HW_OFFLOAD is not set) or because it is still pending on 'add' workqueue to be offloaded for the first time. The former is incorrect, the later generates excessive deletions and additions of flows. - Flow is already pending to be updated on the workqueue. Tearing down such flows will also generate excessive removals from the flow table, especially on highly loaded system where the latency to re-offload a flow via 'add' workqueue can be quite high. When considering a flow for teardown as outdated verify that it is both offloaded to hardware and doesn't have any pending updates. Fixes: 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple") Reviewed-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-25netfilter: flowtable: GC pushes back packets to classic pathPablo Neira Ayuso
Since 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple"), flowtable GC pushes back flows with IPS_SEEN_REPLY back to classic path in every run, ie. every second. This is because of a new check for NF_FLOW_HW_ESTABLISHED which is specific of sched/act_ct. In Netfilter's flowtable case, NF_FLOW_HW_ESTABLISHED never gets set on and IPS_SEEN_REPLY is unreliable since users decide when to offload the flow before, such bit might be set on at a later stage. Fix it by adding a custom .gc handler that sched/act_ct can use to deal with its NF_FLOW_HW_ESTABLISHED bit. Fixes: 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple") Reported-by: Vladimir Smelhaus <vl.sm@email.cz> Reviewed-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-10-24Merge tag 'wireless-2023-10-24' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Three more fixes: - don't drop all unprotected public action frames since some don't have a protected dual - fix pointer confusion in scanning code - fix warning in some connections with multiple links * tag 'wireless-2023-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: mac80211: don't drop all unprotected public action frames wifi: cfg80211: fix assoc response warning on failed links wifi: cfg80211: pass correct pointer to rdev_inform_bss() ==================== Link: https://lore.kernel.org/r/20231024103540.19198-2-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-23net/handshake: fix file ref count in handshake_nl_accept_doit()Moritz Wanzenböck
If req->hr_proto->hp_accept() fail, we call fput() twice: Once in the error path, but also a second time because sock->file is at that point already associated with the file descriptor. Once the task exits, as it would probably do after receiving an error reading from netlink, the fd is closed, calling fput() a second time. To fix, we move installing the file after the error path for the hp_accept() call. In the case of errors we simply put the unused fd. In case of success we can use fd_install() to link the sock->file to the reserved fd. Fixes: 7ea9c1ec66bc ("net/handshake: Fix handshake_dup() ref counting") Signed-off-by: Moritz Wanzenböck <moritz.wanzenboeck@linbit.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/20231019125847.276443-1-moritz.wanzenboeck@linbit.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-23wifi: mac80211: don't drop all unprotected public action framesAvraham Stern
Not all public action frames have a protected variant. When MFP is enabled drop only public action frames that have a dual protected variant. Fixes: 76a3059cf124 ("wifi: mac80211: drop some unprotected action frames") Signed-off-by: Avraham Stern <avraham.stern@intel.com> Signed-off-by: Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20231016145213.2973e3c8d3bb.I6198b8d3b04cf4a97b06660d346caec3032f232a@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-10-23wifi: cfg80211: fix assoc response warning on failed linksJohannes Berg
The warning here shouldn't be done before we even set the bss field (or should've used the input data). Move the assignment before the warning to fix it. We noticed this now because of Wen's bugfix, where the bug fixed there had previously hidden this other bug. Fixes: 53ad07e9823b ("wifi: cfg80211: support reporting failed links") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-10-23wifi: cfg80211: pass correct pointer to rdev_inform_bss()Ben Greear
Confusing struct member names here resulted in passing the wrong pointer, causing crashes. Pass the correct one. Fixes: eb142608e2c4 ("wifi: cfg80211: use a struct for inform_single_bss data") Signed-off-by: Ben Greear <greearb@candelatech.com> Link: https://lore.kernel.org/r/20231021154827.1142734-1-greearb@candelatech.com [rewrite commit message, add fixes] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-10-22tcp: fix wrong RTO timeout when received SACK renegingFred Chen
This commit fix wrong RTO timeout when received SACK reneging. When an ACK arrived pointing to a SACK reneging, tcp_check_sack_reneging() will rearm the RTO timer for min(1/2*srtt, 10ms) into to the future. But since the commit 62d9f1a6945b ("tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN") merged, the tcp_set_xmit_timer() is moved after tcp_fastretrans_alert()(which do the SACK reneging check), so the RTO timeout will be overwrited by tcp_set_xmit_timer() with icsk_rto instead of 1/2*srtt. Here is a packetdrill script to check this bug: 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 // simulate srtt to 100ms +0 < S 0:0(0) win 32792 <mss 1000, sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> +.1 < . 1:1(0) ack 1 win 1024 +0 accept(3, ..., ...) = 4 +0 write(4, ..., 10000) = 10000 +0 > P. 1:10001(10000) ack 1 // inject sack +.1 < . 1:1(0) ack 1 win 257 <sack 1001:10001,nop,nop> +0 > . 1:1001(1000) ack 1 // inject sack reneging +.1 < . 1:1(0) ack 1001 win 257 <sack 9001:10001,nop,nop> // we expect rto fired in 1/2*srtt (50ms) +.05 > . 1001:2001(1000) ack 1 This fix remove the FLAG_SET_XMIT_TIMER from ack_flag when tcp_check_sack_reneging() set RTO timer with 1/2*srtt to avoid being overwrited later. Fixes: 62d9f1a6945b ("tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN") Signed-off-by: Fred Chen <fred.chenchen03@gmail.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-20neighbour: fix various data-racesEric Dumazet
1) tbl->gc_thresh1, tbl->gc_thresh2, tbl->gc_thresh3 and tbl->gc_interval can be written from sysfs. 2) tbl->last_flush is read locklessly from neigh_alloc() 3) tbl->proxy_queue.qlen is read locklessly from neightbl_fill_info() 4) neightbl_fill_info() reads cpu stats that can be changed concurrently. Fixes: c7fb64db001f ("[NETLINK]: Neighbour table configuration and statistics via rtnetlink") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231019122104.1448310-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-20net: do not leave an empty skb in write queueEric Dumazet
Under memory stress conditions, tcp_sendmsg_locked() might call sk_stream_wait_memory(), thus releasing the socket lock. If a fresh skb has been allocated prior to this, we should not leave it in the write queue otherwise tcp_write_xmit() could panic. This apparently does not happen often, but a future change in __sk_mem_raise_allocated() that Shakeel and others are considering would increase chances of being hurt. Under discussion is to remove this controversial part: /* Fail only if socket is _under_ its sndbuf. * In this case we cannot block, so that we have to fail. */ if (sk->sk_wmem_queued + size >= sk->sk_sndbuf) { /* Force charge with __GFP_NOFAIL */ if (memcg_charge && !charged) { mem_cgroup_charge_skmem(sk->sk_memcg, amt, gfp_memcg_charge() | __GFP_NOFAIL); } return 1; } Fixes: fdfc5c8594c2 ("tcp: remove empty skb from write queue in error cases") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Link: https://lore.kernel.org/r/20231019112457.1190114-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-19Merge tag 'net-6.6-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth, netfilter, WiFi. Feels like an up-tick in regression fixes, mostly for older releases. The hfsc fix, tcp_disconnect() and Intel WWAN fixes stand out as fairly clear-cut user reported regressions. The mlx5 DMA bug was causing strife for 390x folks. The fixes themselves are not particularly scary, tho. No open investigations / outstanding reports at the time of writing. Current release - regressions: - eth: mlx5: perform DMA operations in the right locations, make devices usable on s390x, again - sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner curve, previous fix of rejecting invalid config broke some scripts - rfkill: reduce data->mtx scope in rfkill_fop_open, avoid deadlock - revert "ethtool: Fix mod state of verbose no_mask bitset", needs more work Current release - new code bugs: - tcp: fix listen() warning with v4-mapped-v6 address Previous releases - regressions: - tcp: allow tcp_disconnect() again when threads are waiting, it was denied to plug a constant source of bugs but turns out .NET depends on it - eth: mlx5: fix double-free if buffer refill fails under OOM - revert "net: wwan: iosm: enable runtime pm support for 7560", it's causing regressions and the WWAN team at Intel disappeared - tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a single skb, fix single-stream perf regression on some devices Previous releases - always broken: - Bluetooth: - fix issues in legacy BR/EDR PIN code pairing - correctly bounds check and pad HCI_MON_NEW_INDEX name - netfilter: - more fixes / follow ups for the large "commit protocol" rework, which went in as a fix to 6.5 - fix null-derefs on netlink attrs which user may not pass in - tcp: fix excessive TLP and RACK timeouts from HZ rounding (bless Debian for keeping HZ=250 alive) - net: more strict VIRTIO_NET_HDR_GSO_UDP_L4 validation, prevent letting frankenstein UDP super-frames from getting into the stack - net: fix interface altnames when ifc moves to a new namespace - eth: qed: fix the size of the RX buffers - mptcp: avoid sending RST when closing the initial subflow" * tag 'net-6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits) Revert "ethtool: Fix mod state of verbose no_mask bitset" selftests: mptcp: join: no RST when rm subflow/addr mptcp: avoid sending RST when closing the initial subflow mptcp: more conservative check for zero probes tcp: check mptcp-level constraints for backlog coalescing selftests: mptcp: join: correctly check for no RST net: ti: icssg-prueth: Fix r30 CMDs bitmasks selftests: net: add very basic test for netdev names and namespaces net: move altnames together with the netdevice net: avoid UAF on deleted altname net: check for altname conflicts when changing netdev's netns net: fix ifname in netlink ntf during netns move net: ethernet: ti: Fix mixed module-builtin object net: phy: bcm7xxx: Add missing 16nm EPHY statistics ipv4: fib: annotate races around nh->nh_saddr_genid and nh->nh_saddr tcp_bpf: properly release resources on error paths net/sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner curve net: mdio-mux: fix C45 access returning -EIO after API change tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a single skb octeon_ep: update BQL sent bytes before ringing doorbell ...
2023-10-19Revert "ethtool: Fix mod state of verbose no_mask bitset"Kory Maincent
This reverts commit 108a36d07c01edbc5942d27c92494d1c6e4d45a0. It was reported that this fix breaks the possibility to remove existing WoL flags. For example: ~$ ethtool lan2 ... Supports Wake-on: pg Wake-on: d ... ~$ ethtool -s lan2 wol gp ~$ ethtool lan2 ... Wake-on: pg ... ~$ ethtool -s lan2 wol d ~$ ethtool lan2 ... Wake-on: pg ... This worked correctly before this commit because we were always updating a zero bitmap (since commit 6699170376ab ("ethtool: fix application of verbose no_mask bitset"), that is) so that the rest was left zero naturally. But now the 1->0 change (old_val is true, bit not present in netlink nest) no longer works. Reported-by: Oleksij Rempel <o.rempel@pengutronix.de> Reported-by: Michal Kubecek <mkubecek@suse.cz> Closes: https://lore.kernel.org/netdev/20231019095140.l6fffnszraeb6iiw@lion.mk-sys.cz/ Cc: stable@vger.kernel.org Fixes: 108a36d07c01 ("ethtool: Fix mod state of verbose no_mask bitset") Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Reviewed-by: Michal Kubecek <mkubecek@suse.cz> Link: https://lore.kernel.org/r/20231019-feature_ptp_bitset_fix-v1-1-70f3c429a221@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-19mptcp: avoid sending RST when closing the initial subflowGeliang Tang
When closing the first subflow, the MPTCP protocol unconditionally calls tcp_disconnect(), which in turn generates a reset if the subflow is established. That is unexpected and different from what MPTCP does with MPJ subflows, where resets are generated only on FASTCLOSE and other edge scenarios. We can't reuse for the first subflow the same code in place for MPJ subflows, as MPTCP clean them up completely via a tcp_close() call, while must keep the first subflow socket alive for later re-usage, due to implementation constraints. This patch adds a new helper __mptcp_subflow_disconnect() that encapsulates, a logic similar to tcp_close, issuing a reset only when the MPTCP_CF_FASTCLOSE flag is set, and performing a clean shutdown otherwise. Fixes: c2b2ae3925b6 ("mptcp: handle correctly disconnect() failures") Cc: stable@vger.kernel.org Reviewed-by: Matthieu Baerts <matttbe@kernel.org> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231018-send-net-20231018-v1-4-17ecb002e41d@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-19mptcp: more conservative check for zero probesPaolo Abeni
Christoph reported that the MPTCP protocol can find the subflow-level write queue unexpectedly not empty while crafting a zero-window probe, hitting a warning: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 188 at net/mptcp/protocol.c:1312 mptcp_sendmsg_frag+0xc06/0xe70 Modules linked in: CPU: 0 PID: 188 Comm: kworker/0:2 Not tainted 6.6.0-rc2-g1176aa719d7a #47 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Workqueue: events mptcp_worker RIP: 0010:mptcp_sendmsg_frag+0xc06/0xe70 net/mptcp/protocol.c:1312 RAX: 47d0530de347ff6a RBX: 47d0530de347ff6b RCX: ffff8881015d3c00 RDX: ffff8881015d3c00 RSI: 47d0530de347ff6b RDI: 47d0530de347ff6b RBP: 47d0530de347ff6b R08: ffffffff8243c6a8 R09: ffffffff82042d9c R10: 0000000000000002 R11: ffffffff82056850 R12: ffff88812a13d580 R13: 0000000000000001 R14: ffff88812b375e50 R15: ffff88812bbf3200 FS: 0000000000000000(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000695118 CR3: 0000000115dfc001 CR4: 0000000000170ef0 Call Trace: <TASK> __subflow_push_pending+0xa4/0x420 net/mptcp/protocol.c:1545 __mptcp_push_pending+0x128/0x3b0 net/mptcp/protocol.c:1614 mptcp_release_cb+0x218/0x5b0 net/mptcp/protocol.c:3391 release_sock+0xf6/0x100 net/core/sock.c:3521 mptcp_worker+0x6e8/0x8f0 net/mptcp/protocol.c:2746 process_scheduled_works+0x341/0x690 kernel/workqueue.c:2630 worker_thread+0x3a7/0x610 kernel/workqueue.c:2784 kthread+0x143/0x180 kernel/kthread.c:388 ret_from_fork+0x4d/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:304 </TASK> The root cause of the issue is that expectations are wrong: e.g. due to MPTCP-level re-injection we can hit the critical condition. Explicitly avoid the zero-window probe when the subflow write queue is not empty and drop the related warnings. Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/444 Fixes: f70cad1085d1 ("mptcp: stop relying on tcp_tx_skb_cache") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231018-send-net-20231018-v1-3-17ecb002e41d@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-19tcp: check mptcp-level constraints for backlog coalescingPaolo Abeni
The MPTCP protocol can acquire the subflow-level socket lock and cause the tcp backlog usage. When inserting new skbs into the backlog, the stack will try to coalesce them. Currently, we have no check in place to ensure that such coalescing will respect the MPTCP-level DSS, and that may cause data stream corruption, as reported by Christoph. Address the issue by adding the relevant admission check for coalescing in tcp_add_backlog(). Note the issue is not easy to reproduce, as the MPTCP protocol tries hard to avoid acquiring the subflow-level socket lock. Fixes: 648ef4b88673 ("mptcp: Implement MPTCP receive path") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/420 Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <martineau@kernel.org> Link: https://lore.kernel.org/r/20231018-send-net-20231018-v1-2-17ecb002e41d@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-19net: move altnames together with the netdeviceJakub Kicinski
The altname nodes are currently not moved to the new netns when netdevice itself moves: [ ~]# ip netns add test [ ~]# ip -netns test link add name eth0 type dummy [ ~]# ip -netns test link property add dev eth0 altname some-name [ ~]# ip -netns test link show dev some-name 2: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 1e:67:ed:19:3d:24 brd ff:ff:ff:ff:ff:ff altname some-name [ ~]# ip -netns test link set dev eth0 netns 1 [ ~]# ip link ... 3: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 02:40:88:62:ec:b8 brd ff:ff:ff:ff:ff:ff altname some-name [ ~]# ip li show dev some-name Device "some-name" does not exist. Remove them from the hash table when device is unlisted and add back when listed again. Fixes: 36fbf1e52bd3 ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-19net: avoid UAF on deleted altnameJakub Kicinski
Altnames are accessed under RCU (dev_get_by_name_rcu()) but freed by kfree() with no synchronization point. Each node has one or two allocations (node and a variable-size name, sometimes the name is netdev->name). Adding rcu_heads here is a bit tedious. Besides most code which unlists the names already has rcu barriers - so take the simpler approach of adding synchronize_rcu(). Note that the one on the unregistration path (which matters more) is removed by the next fix. Fixes: ff92741270bf ("net: introduce name_node struct to be used in hashlist") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-19net: check for altname conflicts when changing netdev's netnsJakub Kicinski
It's currently possible to create an altname conflicting with an altname or real name of another device by creating it in another netns and moving it over: [ ~]$ ip link add dev eth0 type dummy [ ~]$ ip netns add test [ ~]$ ip -netns test link add dev ethX netns test type dummy [ ~]$ ip -netns test link property add dev ethX altname eth0 [ ~]$ ip -netns test link set dev ethX netns 1 [ ~]$ ip link ... 3: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 02:40:88:62:ec:b8 brd ff:ff:ff:ff:ff:ff ... 5: ethX: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 26:b7:28:78:38:0f brd ff:ff:ff:ff:ff:ff altname eth0 Create a macro for walking the altnames, this hopefully makes it clearer that the list we walk contains only altnames. Which is otherwise not entirely intuitive. Fixes: 36fbf1e52bd3 ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-19net: fix ifname in netlink ntf during netns moveJakub Kicinski
dev_get_valid_name() overwrites the netdev's name on success. This makes it hard to use in prepare-commit-like fashion, where we do validation first, and "commit" to the change later. Factor out a helper which lets us save the new name to a buffer. Use it to fix the problem of notification on netns move having incorrect name: 5: eth0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default link/ether be:4d:58:f9:d5:40 brd ff:ff:ff:ff:ff:ff 6: eth1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default link/ether 1e:4a:34:36:e3:cd brd ff:ff:ff:ff:ff:ff [ ~]# ip link set dev eth0 netns 1 name eth1 ip monitor inside netns: Deleted inet eth0 Deleted inet6 eth0 Deleted 5: eth1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default link/ether be:4d:58:f9:d5:40 brd ff:ff:ff:ff:ff:ff new-netnsid 0 new-ifindex 7 Name is reported as eth1 in old netns for ifindex 5, already renamed. Fixes: d90310243fd7 ("net: device name allocation cleanups") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-18Merge tag 'nf-23-10-18' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florian Westphal says: ==================== netfilter: updates for net First patch, from Phil Sutter, reduces number of audit notifications when userspace requests to re-set stateful objects. This change also comes with a selftest update. Second patch, also from Phil, moves the nftables audit selftest to its own netns to avoid interference with the init netns. Third patch, from Pablo Neira, fixes an inconsistency with the "rbtree" set backend: When set element X has expired, a request to delete element X should fail (like with all other backends). Finally, patch four, also from Pablo, reverts a recent attempt to speed up abort of a large pending update with the "pipapo" set backend. It could cause stray references to remain in the set, which then results in a double-free. * tag 'nf-23-10-18' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: revert do not remove elements if set backend implements .abort netfilter: nft_set_rbtree: .deactivate fails if element has expired selftests: netfilter: Run nft_audit.sh in its own netns netfilter: nf_tables: audit log object reset once per table ==================== Link: https://lore.kernel.org/r/20231018125605.27299-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18Merge tag 'wireless-2023-10-18' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== A few more fixes: * prevent value bounce/glitch in rfkill GPIO probe * fix lockdep report in rfkill * fix error path leak in mac80211 key handling * use system_unbound_wq for wiphy work since it can take longer * tag 'wireless-2023-10-18' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: net: rfkill: reduce data->mtx scope in rfkill_fop_open net: rfkill: gpio: prevent value glitch during probe wifi: mac80211: fix error path key leak wifi: cfg80211: use system_unbound_wq for wiphy work ==================== Link: https://lore.kernel.org/r/20231018071041.8175-2-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18ipv4: fib: annotate races around nh->nh_saddr_genid and nh->nh_saddrEric Dumazet
syzbot reported a data-race while accessing nh->nh_saddr_genid [1] Add annotations, but leave the code lazy as intended. [1] BUG: KCSAN: data-race in fib_select_path / fib_select_path write to 0xffff8881387166f0 of 4 bytes by task 6778 on cpu 1: fib_info_update_nhc_saddr net/ipv4/fib_semantics.c:1334 [inline] fib_result_prefsrc net/ipv4/fib_semantics.c:1354 [inline] fib_select_path+0x292/0x330 net/ipv4/fib_semantics.c:2269 ip_route_output_key_hash_rcu+0x659/0x12c0 net/ipv4/route.c:2810 ip_route_output_key_hash net/ipv4/route.c:2644 [inline] __ip_route_output_key include/net/route.h:134 [inline] ip_route_output_flow+0xa6/0x150 net/ipv4/route.c:2872 send4+0x1f5/0x520 drivers/net/wireguard/socket.c:61 wg_socket_send_skb_to_peer+0x94/0x130 drivers/net/wireguard/socket.c:175 wg_socket_send_buffer_to_peer+0xd6/0x100 drivers/net/wireguard/socket.c:200 wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:40 [inline] wg_packet_handshake_send_worker+0x10c/0x150 drivers/net/wireguard/send.c:51 process_one_work kernel/workqueue.c:2630 [inline] process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2703 worker_thread+0x525/0x730 kernel/workqueue.c:2784 kthread+0x1d7/0x210 kernel/kthread.c:388 ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 read to 0xffff8881387166f0 of 4 bytes by task 6759 on cpu 0: fib_result_prefsrc net/ipv4/fib_semantics.c:1350 [inline] fib_select_path+0x1cb/0x330 net/ipv4/fib_semantics.c:2269 ip_route_output_key_hash_rcu+0x659/0x12c0 net/ipv4/route.c:2810 ip_route_output_key_hash net/ipv4/route.c:2644 [inline] __ip_route_output_key include/net/route.h:134 [inline] ip_route_output_flow+0xa6/0x150 net/ipv4/route.c:2872 send4+0x1f5/0x520 drivers/net/wireguard/socket.c:61 wg_socket_send_skb_to_peer+0x94/0x130 drivers/net/wireguard/socket.c:175 wg_socket_send_buffer_to_peer+0xd6/0x100 drivers/net/wireguard/socket.c:200 wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:40 [inline] wg_packet_handshake_send_worker+0x10c/0x150 drivers/net/wireguard/send.c:51 process_one_work kernel/workqueue.c:2630 [inline] process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2703 worker_thread+0x525/0x730 kernel/workqueue.c:2784 kthread+0x1d7/0x210 kernel/kthread.c:388 ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 value changed: 0x959d3217 -> 0x959d3218 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 6759 Comm: kworker/u4:15 Not tainted 6.6.0-rc4-syzkaller-00029-gcbf3a2cb156a #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/06/2023 Workqueue: wg-kex-wg1 wg_packet_handshake_send_worker Fixes: 436c3b66ec98 ("ipv4: Invalidate nexthop cache nh_saddr more correctly.") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231017192304.82626-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18tcp_bpf: properly release resources on error pathsPaolo Abeni
In the blamed commit below, I completely forgot to release the acquired resources before erroring out in the TCP BPF code, as reported by Dan. Address the issues by replacing the bogus return with a jump to the relevant cleanup code. Fixes: 419ce133ab92 ("tcp: allow again tcp_disconnect() when threads are waiting") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/8f99194c698bcef12666f0a9a999c58f8b1cb52c.1697557782.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18net/sched: sch_hfsc: upgrade 'rt' to 'sc' when it becomes a inner curvePedro Tammela
Christian Theune says: I upgraded from 6.1.38 to 6.1.55 this morning and it broke my traffic shaping script, leaving me with a non-functional uplink on a remote router. A 'rt' curve cannot be used as a inner curve (parent class), but we were allowing such configurations since the qdisc was introduced. Such configurations would trigger a UAF as Budimir explains: The parent will have vttree_insert() called on it in init_vf(), but will not have vttree_remove() called on it in update_vf() because it does not have the HFSC_FSC flag set. The qdisc always assumes that inner classes have the HFSC_FSC flag set. This is by design as it doesn't make sense 'qdisc wise' for an 'rt' curve to be an inner curve. Budimir's original patch disallows users to add classes with a 'rt' parent, but this is too strict as it breaks users that have been using 'rt' as a inner class. Another approach, taken by this patch, is to upgrade the inner 'rt' into a 'sc', warning the user in the process. It avoids the UAF reported by Budimir while also being more permissive to bad scripts/users/code using 'rt' as a inner class. Users checking the `tc class ls [...]` or `tc class get [...]` dumps would observe the curve change and are potentially breaking with this change. v1->v2: https://lore.kernel.org/all/20231013151057.2611860-1-pctammela@mojatatu.com/ - Correct 'Fixes' tag and merge with revert (Jakub) Cc: Christian Theune <ct@flyingcircus.io> Cc: Budimir Markovic <markovicbudimir@gmail.com> Fixes: b3d26c5702c7 ("net/sched: sch_hfsc: Ensure inner classes have fsc curve") Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20231017143602.3191556-1-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18tcp: tsq: relax tcp_small_queue_check() when rtx queue contains a single skbEric Dumazet
In commit 75eefc6c59fd ("tcp: tsq: add a shortcut in tcp_small_queue_check()") we allowed to send an skb regardless of TSQ limits being hit if rtx queue was empty or had a single skb, in order to better fill the pipe when/if TX completions were slow. Then later, commit 75c119afe14f ("tcp: implement rb-tree based retransmit queue") accidentally removed the special case for one skb in rtx queue. Stefan Wahren reported a regression in single TCP flow throughput using a 100Mbit fec link, starting from commit 65466904b015 ("tcp: adjust TSO packet sizes based on min_rtt"). This last commit only made the regression more visible, because it locked the TCP flow on a particular behavior where TSQ prevented two skbs being pushed downstream, adding silences on the wire between each TSO packet. Many thanks to Stefan for his invaluable help ! Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue") Link: https://lore.kernel.org/netdev/7f31ddc8-9971-495e-a1f6-819df542e0af@gmx.net/ Reported-by: Stefan Wahren <wahrenst@gmx.net> Tested-by: Stefan Wahren <wahrenst@gmx.net> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20231017124526.4060202-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-18netfilter: nf_tables: revert do not remove elements if set backend ↵Pablo Neira Ayuso
implements .abort nf_tables_abort_release() path calls nft_set_elem_destroy() for NFT_MSG_NEWSETELEM which releases the element, however, a reference to the element still remains in the working copy. Fixes: ebd032fa8818 ("netfilter: nf_tables: do not remove elements if set backend implements .abort") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18netfilter: nft_set_rbtree: .deactivate fails if element has expiredPablo Neira Ayuso
This allows to remove an expired element which is not possible in other existing set backends, this is more noticeable if gc-interval is high so expired elements remain in the tree. On-demand gc also does not help in this case, because this is delete element path. Return NULL if element has expired. Fixes: 8d8540c4f5e0 ("netfilter: nft_set_rbtree: add timeout support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18netfilter: nf_tables: audit log object reset once per tablePhil Sutter
When resetting multiple objects at once (via dump request), emit a log message per table (or filled skb) and resurrect the 'entries' parameter to contain the number of objects being logged for. To test the skb exhaustion path, perform some bulk counter and quota adds in the kselftest. Signed-off-by: Phil Sutter <phil@nwl.cc> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Acked-by: Paul Moore <paul@paul-moore.com> (Audit) Signed-off-by: Florian Westphal <fw@strlen.de>
2023-10-18net: pktgen: Fix interface flags printingGavrilov Ilia
Device flags are displayed incorrectly: 1) The comparison (i == F_FLOW_SEQ) is always false, because F_FLOW_SEQ is equal to (1 << FLOW_SEQ_SHIFT) == 2048, and the maximum value of the 'i' variable is (NR_PKT_FLAG - 1) == 17. It should be compared with FLOW_SEQ_SHIFT. 2) Similarly to the F_IPSEC flag. 3) Also add spaces to the print end of the string literal "spi:%u" to prevent the output from merging with the flag that follows. Found by InfoTeCS on behalf of Linux Verification Center (linuxtesting.org) with SVACE. Fixes: 99c6d3d20d62 ("pktgen: Remove brute-force printing of flags") Signed-off-by: Gavrilov Ilia <Ilia.Gavrilov@infotecs.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-17Merge tag 'ipsec-2023-10-17' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2023-10-17 1) Fix a slab-use-after-free in xfrm_policy_inexact_list_reinsert. From Dong Chenchen. 2) Fix data-races in the xfrm interfaces dev->stats fields. From Eric Dumazet. 3) Fix a data-race in xfrm_gen_index. From Eric Dumazet. 4) Fix an inet6_dev refcount underflow. From Zhang Changzhong. 5) Check the return value of pskb_trim in esp_remove_trailer for esp4 and esp6. From Ma Ke. 6) Fix a data-race in xfrm_lookup_with_ifid. From Eric Dumazet. * tag 'ipsec-2023-10-17' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: fix a data-race in xfrm_lookup_with_ifid() net: ipv4: fix return value check in esp_remove_trailer net: ipv6: fix return value check in esp_remove_trailer xfrm6: fix inet6_dev refcount underflow problem xfrm: fix a data-race in xfrm_gen_index() xfrm: interface: use DEV_STATS_INC() net: xfrm: skip policies marked as dead while reinserting policies ==================== Link: https://lore.kernel.org/r/20231017083723.1364940-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-17tcp: fix excessive TLP and RACK timeouts from HZ roundingNeal Cardwell
We discovered from packet traces of slow loss recovery on kernels with the default HZ=250 setting (and min_rtt < 1ms) that after reordering, when receiving a SACKed sequence range, the RACK reordering timer was firing after about 16ms rather than the desired value of roughly min_rtt/4 + 2ms. The problem is largely due to the RACK reorder timer calculation adding in TCP_TIMEOUT_MIN, which is 2 jiffies. On kernels with HZ=250, this is 2*4ms = 8ms. The TLP timer calculation has the exact same issue. This commit fixes the TLP transmit timer and RACK reordering timer floor calculation to more closely match the intended 2ms floor even on kernels with HZ=250. It does this by adding in a new TCP_TIMEOUT_MIN_US floor of 2000 us and then converting to jiffies, instead of the current approach of converting to jiffies and then adding th TCP_TIMEOUT_MIN value of 2 jiffies. Our testing has verified that on kernels with HZ=1000, as expected, this does not produce significant changes in behavior, but on kernels with the default HZ=250 the latency improvement can be large. For example, our tests show that for HZ=250 kernels at low RTTs this fix roughly halves the latency for the RACK reorder timer: instead of mostly firing at 16ms it mostly fires at 8ms. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Fixes: bb4d991a28cc ("tcp: adjust tail loss probe timeout") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231015174700.2206872-1-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16Merge tag 'for-net-2023-10-13' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - Fix race when opening vhci device - Avoid memcmp() out of bounds warning - Correctly bounds check and pad HCI_MON_NEW_INDEX name - Fix using memcmp when comparing keys - Ignore error return for hci_devcd_register() in btrtl - Always check if connection is alive before deleting - Fix a refcnt underflow problem for hci_conn * tag 'for-net-2023-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: hci_sock: Correctly bounds check and pad HCI_MON_NEW_INDEX name Bluetooth: avoid memcmp() out of bounds warning Bluetooth: hci_sock: fix slab oob read in create_monitor_event Bluetooth: btrtl: Ignore error return for hci_devcd_register() Bluetooth: hci_event: Fix coding style Bluetooth: hci_event: Fix using memcmp when comparing keys Bluetooth: Fix a refcnt underflow problem for hci_conn Bluetooth: hci_sync: always check if connection is alive before deleting Bluetooth: Reject connection with the device which has same BD_ADDR Bluetooth: hci_event: Ignore NULL link key Bluetooth: ISO: Fix invalid context error Bluetooth: vhci: Fix race when opening vhci device ==================== Link: https://lore.kernel.org/r/20231014031336.1664558-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16nfc: nci: fix possible NULL pointer dereference in send_acknowledge()Krzysztof Kozlowski
Handle memory allocation failure from nci_skb_alloc() (calling alloc_skb()) to avoid possible NULL pointer dereference. Reported-by: 黄思聪 <huangsicong@iie.ac.cn> Fixes: 391d8a2da787 ("NFC: Add NCI over SPI receive") Cc: <stable@vger.kernel.org> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20231013184129.18738-1-krzysztof.kozlowski@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16netlink: Correct offload_xstats sizeChristoph Paasch
rtnl_offload_xstats_get_size_hw_s_info_one() conditionalizes the size-computation for IFLA_OFFLOAD_XSTATS_HW_S_INFO_USED based on whether or not the device has offload_xstats enabled. However, rtnl_offload_xstats_fill_hw_s_info_one() is adding the u8 for that field uncondtionally. syzkaller triggered a WARNING in rtnl_stats_get due to this: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 754 at net/core/rtnetlink.c:5982 rtnl_stats_get+0x2f4/0x300 Modules linked in: CPU: 0 PID: 754 Comm: syz-executor148 Not tainted 6.6.0-rc2-g331b78eb12af #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:rtnl_stats_get+0x2f4/0x300 net/core/rtnetlink.c:5982 Code: ff ff 89 ee e8 7d 72 50 ff 83 fd a6 74 17 e8 33 6e 50 ff 4c 89 ef be 02 00 00 00 e8 86 00 fa ff e9 7b fe ff ff e8 1c 6e 50 ff <0f> 0b eb e5 e8 73 79 7b 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 RSP: 0018:ffffc900006837c0 EFLAGS: 00010293 RAX: ffffffff81cf7f24 RBX: ffff8881015d9000 RCX: ffff888101815a00 RDX: 0000000000000000 RSI: 00000000ffffffa6 RDI: 00000000ffffffa6 RBP: 00000000ffffffa6 R08: ffffffff81cf7f03 R09: 0000000000000001 R10: ffff888101ba47b9 R11: ffff888101815a00 R12: ffff8881017dae00 R13: ffff8881017dad00 R14: ffffc90000683ab8 R15: ffffffff83c1f740 FS: 00007fbc22dbc740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000046 CR3: 000000010264e003 CR4: 0000000000170ef0 Call Trace: <TASK> rtnetlink_rcv_msg+0x677/0x710 net/core/rtnetlink.c:6480 netlink_rcv_skb+0xea/0x1c0 net/netlink/af_netlink.c:2545 netlink_unicast+0x430/0x500 net/netlink/af_netlink.c:1342 netlink_sendmsg+0x4fc/0x620 net/netlink/af_netlink.c:1910 sock_sendmsg+0xa8/0xd0 net/socket.c:730 ____sys_sendmsg+0x22a/0x320 net/socket.c:2541 ___sys_sendmsg+0x143/0x190 net/socket.c:2595 __x64_sys_sendmsg+0xd8/0x150 net/socket.c:2624 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x47/0xa0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7fbc22e8d6a9 Code: 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4f 37 0d 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc4320e778 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000004007d0 RCX: 00007fbc22e8d6a9 RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000003 RBP: 0000000000000001 R08: 0000000000000000 R09: 00000000004007d0 R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffc4320e898 R13: 00007ffc4320e8a8 R14: 00000000004004a0 R15: 00007fbc22fa5a80 </TASK> ---[ end trace 0000000000000000 ]--- Which didn't happen prior to commit bf9f1baa279f ("net: add dedicated kmem_cache for typical/small skb->head") as the skb always was large enough. Fixes: 0e7788fd7622 ("net: rtnetlink: Add UAPI for obtaining L3 offload xstats") Signed-off-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Link: https://lore.kernel.org/r/20231013041448.8229-1-cpaasch@apple.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16net/smc: return the right falback reason when prefix checks failDust Li
In the smc_listen_work(), if smc_listen_prfx_check() failed, the real reason: SMC_CLC_DECL_DIFFPREFIX was dropped, and SMC_CLC_DECL_NOSMCDEV was returned. Althrough this is also kind of SMC_CLC_DECL_NOSMCDEV, but return the real reason is much friendly for debugging. Fixes: e49300a6bf62 ("net/smc: add listen processing for SMC-Rv2") Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Link: https://lore.kernel.org/r/20231012123729.29307-1-dust.li@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13Bluetooth: hci_sock: Correctly bounds check and pad HCI_MON_NEW_INDEX nameKees Cook
The code pattern of memcpy(dst, src, strlen(src)) is almost always wrong. In this case it is wrong because it leaves memory uninitialized if it is less than sizeof(ni->name), and overflows ni->name when longer. Normally strtomem_pad() could be used here, but since ni->name is a trailing array in struct hci_mon_new_index, compilers that don't support -fstrict-flex-arrays=3 can't tell how large this array is via __builtin_object_size(). Instead, open-code the helper and use sizeof() since it will work correctly. Additionally mark ni->name as __nonstring since it appears to not be a %NUL terminated C string. Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Cc: Edward AD <twuufnxlz@gmail.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: Johan Hedberg <johan.hedberg@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: linux-bluetooth@vger.kernel.org Cc: netdev@vger.kernel.org Fixes: 18f547f3fc07 ("Bluetooth: hci_sock: fix slab oob read in create_monitor_event") Link: https://lore.kernel.org/lkml/202310110908.F2639D3276@keescook/ Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-10-13Bluetooth: avoid memcmp() out of bounds warningArnd Bergmann
bacmp() is a wrapper around memcpy(), which contain compile-time checks for buffer overflow. Since the hci_conn_request_evt() also calls bt_dev_dbg() with an implicit NULL pointer check, the compiler is now aware of a case where 'hdev' is NULL and treats this as meaning that zero bytes are available: In file included from net/bluetooth/hci_event.c:32: In function 'bacmp', inlined from 'hci_conn_request_evt' at net/bluetooth/hci_event.c:3276:7: include/net/bluetooth/bluetooth.h:364:16: error: 'memcmp' specified bound 6 exceeds source size 0 [-Werror=stringop-overread] 364 | return memcmp(ba1, ba2, sizeof(bdaddr_t)); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Add another NULL pointer check before the bacmp() to ensure the compiler understands the code flow enough to not warn about it. Since the patch that introduced the warning is marked for stable backports, this one should also go that way to avoid introducing build regressions. Fixes: 1ffc6f8cc332 ("Bluetooth: Reject connection with the device which has same BD_ADDR") Cc: Kees Cook <keescook@chromium.org> Cc: "Lee, Chun-Yi" <jlee@suse.com> Cc: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Cc: Marcel Holtmann <marcel@holtmann.org> Cc: stable@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-10-13Bluetooth: hci_sock: fix slab oob read in create_monitor_eventEdward AD
When accessing hdev->name, the actual string length should prevail Reported-by: syzbot+c90849c50ed209d77689@syzkaller.appspotmail.com Fixes: dcda165706b9 ("Bluetooth: hci_core: Fix build warnings") Signed-off-by: Edward AD <twuufnxlz@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-10-13Bluetooth: hci_event: Fix coding styleLuiz Augusto von Dentz
This fixes the following code style problem: ERROR: that open brace { should be on the previous line + if (!bacmp(&hdev->bdaddr, &ev->bdaddr)) + { Fixes: 1ffc6f8cc332 ("Bluetooth: Reject connection with the device which has same BD_ADDR") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-10-13Bluetooth: hci_event: Fix using memcmp when comparing keysLuiz Augusto von Dentz
memcmp is not consider safe to use with cryptographic secrets: 'Do not use memcmp() to compare security critical data, such as cryptographic secrets, because the required CPU time depends on the number of equal bytes.' While usage of memcmp for ZERO_KEY may not be considered a security critical data, it can lead to more usage of memcmp with pairing keys which could introduce more security problems. Fixes: 455c2ff0a558 ("Bluetooth: Fix BR/EDR out-of-band pairing with only initiator data") Fixes: 33155c4aae52 ("Bluetooth: hci_event: Ignore NULL link key") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2023-10-13Merge tag 'nf-23-10-12' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florian Westphal says: ==================== netfilter updates for net Patch 1, from Pablo Neira Ayuso, fixes a performance regression (since 6.4) when a large pending set update has to be canceled towards the end of the transaction. Patch 2 from myself, silences an incorrect compiler warning reported with a few (older) compiler toolchains. Patch 3, from Kees Cook, adds __counted_by annotation to nft_pipapo set backend type. I took this for net instead of -next given infra is already in place and no actual code change is made. Patch 4, from Pablo Neira Ayso, disables timeout resets on stateful element reset. The rest should only affect internal object state, e.g. reset a quota or counter, but not affect a pending timeout. Patches 5 and 6 fix NULL dereferences in 'inner header' match, control plane doesn't test for netlink attribute presence before accessing them. Broken since feature was added in 6.2, fixes from Xingyuan Mo. Last patch, from myself, fixes a bogus rule match when skb has a 0-length mac header, in this case we'd fetch data from network header instead of canceling rule evaluation. This is a day 0 bug, present since nftables was merged in 3.13. * tag 'nf-23-10-12' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_payload: fix wrong mac header matching nf_tables: fix NULL pointer dereference in nft_expr_inner_parse() nf_tables: fix NULL pointer dereference in nft_inner_init() netfilter: nf_tables: do not refresh timeout when resetting element netfilter: nf_tables: Annotate struct nft_pipapo_match with __counted_by netfilter: nfnetlink_log: silence bogus compiler warning netfilter: nf_tables: do not remove elements if set backend implements .abort ==================== Link: https://lore.kernel.org/r/20231012085724.15155-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13net/smc: fix smc clc failed issue when netdevice not in init_netAlbert Huang
If the netdevice is within a container and communicates externally through network technologies such as VxLAN, we won't be able to find routing information in the init_net namespace. To address this issue, we need to add a struct net parameter to the smc_ib_find_route function. This allow us to locate the routing information within the corresponding net namespace, ensuring the correct completion of the SMC CLC interaction. Fixes: e5c4744cfb59 ("net/smc: add SMC-Rv2 connection establishment") Signed-off-by: Albert Huang <huangjie.albert@bytedance.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Link: https://lore.kernel.org/r/20231011074851.95280-1-huangjie.albert@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13tcp: allow again tcp_disconnect() when threads are waitingPaolo Abeni
As reported by Tom, .NET and applications build on top of it rely on connect(AF_UNSPEC) to async cancel pending I/O operations on TCP socket. The blamed commit below caused a regression, as such cancellation can now fail. As suggested by Eric, this change addresses the problem explicitly causing blocking I/O operation to terminate immediately (with an error) when a concurrent disconnect() is executed. Instead of tracking the number of threads blocked on a given socket, track the number of disconnect() issued on such socket. If such counter changes after a blocking operation releasing and re-acquiring the socket lock, error out the current operation. Fixes: 4faeee0cf8a5 ("tcp: deny tcp_disconnect() when threads are waiting") Reported-by: Tom Deseyn <tdeseyn@redhat.com> Closes: https://bugzilla.redhat.com/show_bug.cgi?id=1886305 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/f3b95e47e3dbed840960548aebaa8d954372db41.1697008693.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13Merge tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "Fixes for an overreaching WARN_ON, two error paths and a switch to kernel_connect() which recently grown protection against someone using BPF to rewrite the address. All but one marked for stable" * tag 'ceph-for-6.6-rc6' of https://github.com/ceph/ceph-client: ceph: fix type promotion bug on 32bit systems libceph: use kernel_connect() ceph: remove unnecessary IS_ERR() check in ceph_fname_to_usr() ceph: fix incorrect revoked caps assert in ceph_fill_file_size()
2023-10-13tcp: Fix listen() warning with v4-mapped-v6 address.Kuniyuki Iwashima
syzbot reported a warning [0] introduced by commit c48ef9c4aed3 ("tcp: Fix bind() regression for v4-mapped-v6 non-wildcard address."). After the cited commit, a v4 socket's address matches the corresponding v4-mapped-v6 tb2 in inet_bind2_bucket_match_addr(), not vice versa. During X.X.X.X -> ::ffff:X.X.X.X order bind()s, the second bind() uses bhash and conflicts properly without checking bhash2 so that we need not check if a v4-mapped-v6 sk matches the corresponding v4 address tb2 in inet_bind2_bucket_match_addr(). However, the repro shows that we need to check that in a no-conflict case. The repro bind()s two sockets to the 2-tuples using SO_REUSEPORT and calls listen() for the first socket: from socket import * s1 = socket() s1.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1) s1.bind(('127.0.0.1', 0)) s2 = socket(AF_INET6) s2.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1) s2.bind(('::ffff:127.0.0.1', s1.getsockname()[1])) s1.listen() The second socket should belong to the first socket's tb2, but the second bind() creates another tb2 bucket because inet_bind2_bucket_find() returns NULL in inet_csk_get_port() as the v4-mapped-v6 sk does not match the corresponding v4 address tb2. bhash2[] -> tb2(::ffff:X.X.X.X) -> tb2(X.X.X.X) Then, listen() for the first socket calls inet_csk_get_port(), where the v4 address matches the v4-mapped-v6 tb2 and WARN_ON() is triggered. To avoid that, we need to check if v4-mapped-v6 sk address matches with the corresponding v4 address tb2 in inet_bind2_bucket_match(). The same checks are needed in inet_bind2_bucket_addr_match() too, so we can move all checks there and call it from inet_bind2_bucket_match(). Note that now tb->family is just an address family of tb->(v6_)?rcv_saddr and not of sockets in the bucket. This could be refactored later by defining tb->rcv_saddr as tb->v6_rcv_saddr.s6_addr32[3] and prepending ::ffff: when creating v4 tb2. [0]: WARNING: CPU: 0 PID: 5049 at net/ipv4/inet_connection_sock.c:587 inet_csk_get_port+0xf96/0x2350 net/ipv4/inet_connection_sock.c:587 Modules linked in: CPU: 0 PID: 5049 Comm: syz-executor288 Not tainted 6.6.0-rc2-syzkaller-00018-g2cf0f7156238 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/04/2023 RIP: 0010:inet_csk_get_port+0xf96/0x2350 net/ipv4/inet_connection_sock.c:587 Code: 7c 24 08 e8 4c b6 8a 01 31 d2 be 88 01 00 00 48 c7 c7 e0 94 ae 8b e8 59 2e a3 f8 2e 2e 2e 31 c0 e9 04 fe ff ff e8 ca 88 d0 f8 <0f> 0b e9 0f f9 ff ff e8 be 88 d0 f8 49 8d 7e 48 e8 65 ca 5a 00 31 RSP: 0018:ffffc90003abfbf0 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff888026429100 RCX: 0000000000000000 RDX: ffff88807edcbb80 RSI: ffffffff88b73d66 RDI: ffff888026c49f38 RBP: ffff888026c49f30 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff9260f200 R13: ffff888026c49880 R14: 0000000000000000 R15: ffff888026429100 FS: 00005555557d5380(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000045ad50 CR3: 0000000025754000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> inet_csk_listen_start+0x155/0x360 net/ipv4/inet_connection_sock.c:1256 __inet_listen_sk+0x1b8/0x5c0 net/ipv4/af_inet.c:217 inet_listen+0x93/0xd0 net/ipv4/af_inet.c:239 __sys_listen+0x194/0x270 net/socket.c:1866 __do_sys_listen net/socket.c:1875 [inline] __se_sys_listen net/socket.c:1873 [inline] __x64_sys_listen+0x53/0x80 net/socket.c:1873 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f3a5bce3af9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffc1a1c79e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000032 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a5bce3af9 RDX: 00007f3a5bce3af9 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00007f3a5bd565f0 R08: 0000000000000006 R09: 0000000000000006 R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000001 R13: 431bde82d7b634db R14: 0000000000000001 R15: 0000000000000001 </TASK> Fixes: c48ef9c4aed3 ("tcp: Fix bind() regression for v4-mapped-v6 non-wildcard address.") Reported-by: syzbot+71e724675ba3958edb31@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=71e724675ba3958edb31 Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20231010013814.70571-1-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-13xfrm: fix a data-race in xfrm_lookup_with_ifid()Eric Dumazet
syzbot complains about a race in xfrm_lookup_with_ifid() [1] When preparing commit 0a9e5794b21e ("xfrm: annotate data-race around use_time") I thought xfrm_lookup_with_ifid() was modifying a still private structure. [1] BUG: KCSAN: data-race in xfrm_lookup_with_ifid / xfrm_lookup_with_ifid write to 0xffff88813ea41108 of 8 bytes by task 8150 on cpu 1: xfrm_lookup_with_ifid+0xce7/0x12d0 net/xfrm/xfrm_policy.c:3218 xfrm_lookup net/xfrm/xfrm_policy.c:3270 [inline] xfrm_lookup_route+0x3b/0x100 net/xfrm/xfrm_policy.c:3281 ip6_dst_lookup_flow+0x98/0xc0 net/ipv6/ip6_output.c:1246 send6+0x241/0x3c0 drivers/net/wireguard/socket.c:139 wg_socket_send_skb_to_peer+0xbd/0x130 drivers/net/wireguard/socket.c:178 wg_socket_send_buffer_to_peer+0xd6/0x100 drivers/net/wireguard/socket.c:200 wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:40 [inline] wg_packet_handshake_send_worker+0x10c/0x150 drivers/net/wireguard/send.c:51 process_one_work kernel/workqueue.c:2630 [inline] process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2703 worker_thread+0x525/0x730 kernel/workqueue.c:2784 kthread+0x1d7/0x210 kernel/kthread.c:388 ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 write to 0xffff88813ea41108 of 8 bytes by task 15867 on cpu 0: xfrm_lookup_with_ifid+0xce7/0x12d0 net/xfrm/xfrm_policy.c:3218 xfrm_lookup net/xfrm/xfrm_policy.c:3270 [inline] xfrm_lookup_route+0x3b/0x100 net/xfrm/xfrm_policy.c:3281 ip6_dst_lookup_flow+0x98/0xc0 net/ipv6/ip6_output.c:1246 send6+0x241/0x3c0 drivers/net/wireguard/socket.c:139 wg_socket_send_skb_to_peer+0xbd/0x130 drivers/net/wireguard/socket.c:178 wg_socket_send_buffer_to_peer+0xd6/0x100 drivers/net/wireguard/socket.c:200 wg_packet_send_handshake_initiation drivers/net/wireguard/send.c:40 [inline] wg_packet_handshake_send_worker+0x10c/0x150 drivers/net/wireguard/send.c:51 process_one_work kernel/workqueue.c:2630 [inline] process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2703 worker_thread+0x525/0x730 kernel/workqueue.c:2784 kthread+0x1d7/0x210 kernel/kthread.c:388 ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 value changed: 0x00000000651cd9d1 -> 0x00000000651cd9d2 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 15867 Comm: kworker/u4:58 Not tainted 6.6.0-rc4-syzkaller-00016-g5e62ed3b1c8a #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/06/2023 Workqueue: wg-kex-wg2 wg_packet_handshake_send_worker Fixes: 0a9e5794b21e ("xfrm: annotate data-race around use_time") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>