summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2024-05-23net: Add additional bit to support clockid_t timestamp typeAbhishek Chauhan
tstamp_type is now set based on actual clockid_t compressed into 2 bits. To make the design scalable for future needs this commit bring in the change to extend the tstamp_type:1 to tstamp_type:2 to support other clockid_t timestamp. We now support CLOCK_TAI as part of tstamp_type as part of this commit with existing support CLOCK_MONOTONIC and CLOCK_REALTIME. Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20240509211834.3235191-3-quic_abchauha@quicinc.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-23net: Rename mono_delivery_time to tstamp_type for scalabiltyAbhishek Chauhan
mono_delivery_time was added to check if skb->tstamp has delivery time in mono clock base (i.e. EDT) otherwise skb->tstamp has timestamp in ingress and delivery_time at egress. Renaming the bitfield from mono_delivery_time to tstamp_type is for extensibilty for other timestamps such as userspace timestamp (i.e. SO_TXTIME) set via sock opts. As we are renaming the mono_delivery_time to tstamp_type, it makes sense to start assigning tstamp_type based on enum defined in this commit. Earlier we used bool arg flag to check if the tstamp is mono in function skb_set_delivery_time, Now the signature of the functions accepts tstamp_type to distinguish between mono and real time. Also skb_set_delivery_type_by_clockid is a new function which accepts clockid to determine the tstamp_type. In future tstamp_type:1 can be extended to support userspace timestamp by increasing the bitfield. Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20240509211834.3235191-2-quic_abchauha@quicinc.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-23Merge tag 'nfs-for-6.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Stable fixes: - nfs: fix undefined behavior in nfs_block_bits() - NFSv4.2: Fix READ_PLUS when server doesn't support OP_READ_PLUS Bugfixes: - Fix mixing of the lock/nolock and local_lock mount options - NFSv4: Fixup smatch warning for ambiguous return - NFSv3: Fix remount when using the legacy binary mount api - SUNRPC: Fix the handling of expired RPCSEC_GSS contexts - SUNRPC: fix the NFSACL RPC retries when soft mounts are enabled - rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL Features and cleanups: - NFSv3: Use the atomic_open API to fix open(O_CREAT|O_TRUNC) - pNFS/filelayout: S layout segment range in LAYOUTGET - pNFS: rework pnfs_generic_pg_check_layout to check IO range - NFSv2: Turn off enabling of NFS v2 by default" * tag 'nfs-for-6.10-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: fix undefined behavior in nfs_block_bits() pNFS: rework pnfs_generic_pg_check_layout to check IO range pNFS/filelayout: check layout segment range pNFS/filelayout: fixup pNfs allocation modes rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVAL NFS: Don't enable NFS v2 by default NFS: Fix READ_PLUS when server doesn't support OP_READ_PLUS sunrpc: fix NFSACL RPC retry on soft mount SUNRPC: fix handling expired GSS context nfs: keep server info for remounts NFSv4: Fixup smatch warning for ambiguous return NFS: make sure lock/nolock overriding local_lock mount option NFS: add atomic_open for NFSv3 to handle O_TRUNC correctly. pNFS/filelayout: Specify the layout segment range in LAYOUTGET pNFS/filelayout: Remove the whole file layout requirement
2024-05-23Merge tag 'net-6.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Quite smaller than usual. Notably it includes the fix for the unix regression from the past weeks. The TCP window fix will require some follow-up, already queued. Current release - regressions: - af_unix: fix garbage collection of embryos Previous releases - regressions: - af_unix: fix race between GC and receive path - ipv6: sr: fix missing sk_buff release in seg6_input_core - tcp: remove 64 KByte limit for initial tp->rcv_wnd value - eth: r8169: fix rx hangup - eth: lan966x: remove ptp traps in case the ptp is not enabled - eth: ixgbe: fix link breakage vs cisco switches - eth: ice: prevent ethtool from corrupting the channels Previous releases - always broken: - openvswitch: set the skbuff pkt_type for proper pmtud support - tcp: Fix shift-out-of-bounds in dctcp_update_alpha() Misc: - a bunch of selftests stabilization patches" * tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (25 commits) r8169: Fix possible ring buffer corruption on fragmented Tx packets. idpf: Interpret .set_channels() input differently ice: Interpret .set_channels() input differently nfc: nci: Fix handling of zero-length payload packets in nci_rx_work() net: relax socket state check at accept time. tcp: remove 64 KByte limit for initial tp->rcv_wnd value net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe() tls: fix missing memory barrier in tls_init net: fec: avoid lock evasion when reading pps_enable Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI" testing: net-drv: use stats64 for testing net: mana: Fix the extra HZ in mana_hwc_send_request net: lan966x: Remove ptp traps in case the ptp is not enabled. openvswitch: Set the skbuff pkt_type for proper pmtud support. selftest: af_unix: Make SCM_RIGHTS into OOB data. af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS tcp: Fix shift-out-of-bounds in dctcp_update_alpha(). selftests/net: use tc rule to filter the na packet ipv6: sr: fix memleak in seg6_hmac_init_algo af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock. ...
2024-05-23Merge tag 'trace-assign-str-v6.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing cleanup from Steven Rostedt: "Remove second argument of __assign_str() The __assign_str() macro logic of the TRACE_EVENT() macro was optimized so that it no longer needs the second argument. The __assign_str() is always matched with __string() field that takes a field name and the source for that field: __string(field, source) The TRACE_EVENT() macro logic will save off the source value and then use that value to copy into the ring buffer via the __assign_str(). Before commit c1fa617caeb0 ("tracing: Rework __assign_str() and __string() to not duplicate getting the string"), the __assign_str() needed the second argument which would perform the same logic as the __string() source parameter did. Not only would this add overhead, but it was error prone as if the __assign_str() source produced something different, it may not have allocated enough for the string in the ring buffer (as the __string() source was used to determine how much to allocate) Now that the __assign_str() just uses the same string that was used in __string() it no longer needs the source parameter. It can now be removed" * tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/treewide: Remove second parameter of __assign_str()
2024-05-23Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: "Several new features here: - virtio-net is finally supported in vduse - virtio (balloon and mem) interaction with suspend is improved - vhost-scsi now handles signals better/faster And fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits) virtio-pci: Check if is_avq is NULL virtio: delete vq in vp_find_vqs_msix() when request_irq() fails MAINTAINERS: add Eugenio Pérez as reviewer vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API vp_vdpa: don't allocate unused msix vectors sound: virtio: drop owner assignment fuse: virtio: drop owner assignment scsi: virtio: drop owner assignment rpmsg: virtio: drop owner assignment nvdimm: virtio_pmem: drop owner assignment wifi: mac80211_hwsim: drop owner assignment vsock/virtio: drop owner assignment net: 9p: virtio: drop owner assignment net: virtio: drop owner assignment net: caif: virtio: drop owner assignment misc: nsm: drop owner assignment iommu: virtio: drop owner assignment drm/virtio: drop owner assignment gpio: virtio: drop owner assignment firmware: arm_scmi: virtio: drop owner assignment ...
2024-05-23nfc: nci: Fix handling of zero-length payload packets in nci_rx_work()Ryosuke Yasuoka
When nci_rx_work() receives a zero-length payload packet, it should not discard the packet and exit the loop. Instead, it should continue processing subsequent packets. Fixes: d24b03535e5e ("nfc: nci: Fix uninit-value in nci_dev_up and nci_ntf_packet") Signed-off-by: Ryosuke Yasuoka <ryasuoka@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20240521153444.535399-1-ryasuoka@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-23net: relax socket state check at accept time.Paolo Abeni
Christoph reported the following splat: WARNING: CPU: 1 PID: 772 at net/ipv4/af_inet.c:761 __inet_accept+0x1f4/0x4a0 Modules linked in: CPU: 1 PID: 772 Comm: syz-executor510 Not tainted 6.9.0-rc7-g7da7119fe22b #56 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:__inet_accept+0x1f4/0x4a0 net/ipv4/af_inet.c:759 Code: 04 38 84 c0 0f 85 87 00 00 00 41 c7 04 24 03 00 00 00 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 ec b7 da fd <0f> 0b e9 7f fe ff ff e8 e0 b7 da fd 0f 0b e9 fe fe ff ff 89 d9 80 RSP: 0018:ffffc90000c2fc58 EFLAGS: 00010293 RAX: ffffffff836bdd14 RBX: 0000000000000000 RCX: ffff888104668000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: dffffc0000000000 R08: ffffffff836bdb89 R09: fffff52000185f64 R10: dffffc0000000000 R11: fffff52000185f64 R12: dffffc0000000000 R13: 1ffff92000185f98 R14: ffff88810754d880 R15: ffff8881007b7800 FS: 000000001c772880(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb9fcf2e178 CR3: 00000001045d2002 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> inet_accept+0x138/0x1d0 net/ipv4/af_inet.c:786 do_accept+0x435/0x620 net/socket.c:1929 __sys_accept4_file net/socket.c:1969 [inline] __sys_accept4+0x9b/0x110 net/socket.c:1999 __do_sys_accept net/socket.c:2016 [inline] __se_sys_accept net/socket.c:2013 [inline] __x64_sys_accept+0x7d/0x90 net/socket.c:2013 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x58/0x100 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x4315f9 Code: fd ff 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 ab b4 fd ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffdb26d9c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002b RAX: ffffffffffffffda RBX: 0000000000400300 RCX: 00000000004315f9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004 RBP: 00000000006e1018 R08: 0000000000400300 R09: 0000000000400300 R10: 0000000000400300 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000040cdf0 R14: 000000000040ce80 R15: 0000000000000055 </TASK> The reproducer invokes shutdown() before entering the listener status. After commit 94062790aedb ("tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets"), the above causes the child to reach the accept syscall in FIN_WAIT1 status. Eric noted we can relax the existing assertion in __inet_accept() Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/490 Suggested-by: Eric Dumazet <edumazet@google.com> Fixes: 94062790aedb ("tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/23ab880a44d8cfd967e84de8b93dbf48848e3d8c.1716299669.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-23tcp: remove 64 KByte limit for initial tp->rcv_wnd valueJason Xing
Recently, we had some servers upgraded to the latest kernel and noticed the indicator from the user side showed worse results than before. It is caused by the limitation of tp->rcv_wnd. In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most CDN teams would not benefit from this change because they cannot have a large window to receive a big packet, which will be slowed down especially in long RTT. Small rcv_wnd means slow transfer speed, to some extent. It's the side effect for the latency/time-sensitive users. To avoid future confusion, current change doesn't affect the initial receive window on the wire in a SYN or SYN+ACK packet which are set within 65535 bytes according to RFC 7323 also due to the limit in __tcp_transmit_skb(): th->window = htons(min(tp->rcv_wnd, 65535U)); In one word, __tcp_transmit_skb() already ensures that constraint is respected, no matter how large tp->rcv_wnd is. The change doesn't violate RFC. Let me provide one example if with or without the patch: Before: client --- SYN: rwindow=65535 ---> server client <--- SYN+ACK: rwindow=65535 ---- server client --- ACK: rwindow=65536 ---> server Note: for the last ACK, the calculation is 512 << 7. After: client --- SYN: rwindow=65535 ---> server client <--- SYN+ACK: rwindow=65535 ---- server client --- ACK: rwindow=175232 ---> server Note: I use the following command to make it work: ip route change default via [ip] dev eth0 metric 100 initrwnd 120 For the last ACK, the calculation is 1369 << 7. When we apply such a patch, having a large rcv_wnd if the user tweak this knob can help transfer data more rapidly and save some rtts. Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20240521134220.12510-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-23tls: fix missing memory barrier in tls_initDae R. Jeong
In tls_init(), a write memory barrier is missing, and store-store reordering may cause NULL dereference in tls_{setsockopt,getsockopt}. CPU0 CPU1 ----- ----- // In tls_init() // In tls_ctx_create() ctx = kzalloc() ctx->sk_proto = READ_ONCE(sk->sk_prot) -(1) // In update_sk_prot() WRITE_ONCE(sk->sk_prot, tls_prots) -(2) // In sock_common_setsockopt() READ_ONCE(sk->sk_prot)->setsockopt() // In tls_{setsockopt,getsockopt}() ctx->sk_proto->setsockopt() -(3) In the above scenario, when (1) and (2) are reordered, (3) can observe the NULL value of ctx->sk_proto, causing NULL dereference. To fix it, we rely on rcu_assign_pointer() which implies the release barrier semantic. By moving rcu_assign_pointer() after ctx->sk_proto is initialized, we can ensure that ctx->sk_proto are visible when changing sk->sk_prot. Fixes: d5bee7374b68 ("net/tls: Annotate access to sk_prot with READ_ONCE/WRITE_ONCE") Signed-off-by: Yewon Choi <woni9911@gmail.com> Signed-off-by: Dae R. Jeong <threeearcat@gmail.com> Link: https://lore.kernel.org/netdev/ZU4OJG56g2V9z_H7@dragonet/T/ Link: https://lore.kernel.org/r/Zkx4vjSFp0mfpjQ2@libra05 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-23wifi: mac80211: send DelBA with correct BSSIDJohannes Berg
In MLO, the deflink BSSID is clearly invalid. Since we fill the addresses as MLD addresses and translate later, use the AP address here instead. This fixes an issue that happens with HW restart, where the DelBA frame is transmitted, but not processed correctly due to the wrong BSSID (or even just discarded entirely). As a result, the BA sessions are kept alive; however, as other state is reset during HW restart, this then fails (reorder, etc.) and data doesn't go through until new BA sessions are established. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240510112601.f4e1effdea29.I98e81f22166b68d4b6211191bcaaf8531b324a77@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: reset negotiated TTLM on disconnectJohannes Berg
The negotiated TTLM data must be reset on disconnect, otherwise it may end up getting reused on another connection. Fix that. Fixes: 8f500fbc6c65 ("wifi: mac80211: process and save negotiated TID to Link mapping request") Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506211858.04142e8fe01c.Ia144457e086ebd8ddcfa31bdf5ff210b4b351c22@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: don't stop TTLM works againJohannes Berg
There's no need to stop works that have already been stopped during disconnect, so don't. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506211034.f8434be19f56.I021afadc538508da3bc8f95c89f424ca62b94bef@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: cancel TTLM teardown work earlierJohannes Berg
It shouldn't be possible to run this after disconnecting, so cancel the work earlier. Fixes: a17a58ad2ff2 ("wifi: mac80211: add support for tearing down negotiated TTLM") Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506211034.096a10ccebec.I5584a21c27eb9b3e87b9e26380b627114b32ccba@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: cancel multi-link reconf work on disconnectJohannes Berg
This work shouldn't run after we're disconnecting. Cancel it earlier (and then don't cancel it in stop later.) Fixes: 8eb8dd2ffbbb ("wifi: mac80211: Support link removal using Reconfiguration ML element") Reviewed-by: Ilan Peer <ilan.peer@intel.com> Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506211034.ac754794279f.Ib9fbb1dab50c6b67f6de9be09a6c452ce89bbd50@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: fix TTLM teardown workJohannes Berg
The worker calculates the wrong sdata pointer, so if it ever runs, it'll crash. Fix that. Fixes: a17a58ad2ff2 ("wifi: mac80211: add support for tearing down negotiated TTLM") Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506211853.e6471800c76d.I8b7c2d6984c89a11cd33d1a610e9645fa965f6e1@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: Add EHT UL MU-MIMO flag in ieee80211_bss_confPradeep Kumar Chitrapu
Add flag for Full Bandwidth UL MU-MIMO for EHT. This is utilized to pass EHT MU-MIMO configurations from user space to driver in AP mode. Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.0.1-00029-QCAHKSWPL_SILICONZ-1 Signed-off-by: Pradeep Kumar Chitrapu <quic_pradeepc@quicinc.com> Link: https://msgid.link/20240515181327.12855-2-quic_pradeepc@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: refactor chanreq.ap settingJohannes Berg
There are now three places setting up chanreq.ap which always depends on the mode (EHT being used or not) and override flag. Refactor that code into a common function with a comment, to make that clearer. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506215543.5cd6a209e58a.I3be318959d9e2df5dccd2d0938c3d2fcc6688030@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: handle TPE element during CSAJohannes Berg
Handle the transmit power envelope (TPE) element during channel switch, applying it when the channel switch is done. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506215543.486c33157d18.Idf971ad801b6961c177bdf42cc323fd1a4ca8165@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: handle wider bandwidth OFDMA during CSAJohannes Berg
During channel switch, track the AP configuration in the chanreq, so that wider bandwidth OFDMA is taken into account correctly, since multiple channel contexts may be needed due to sharing not being possible due to wider bandwidth OFDMA. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506215543.b2c5a72dac1b.I69f65cb2e75d4a49a174b1aede68bf8ff0a3cab3@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: collect some CSA data into sub-structsJohannes Berg
Collect the CSA data in ieee80211_link_data_managed and ieee80211_link_data into a csa sub-struct to clean up a bit and make adding new things more obvious. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506215543.29f954b1f576.I9a683a9647c33d4dd3011aade6677982428c1082@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: mlme: handle cross-link CSAJohannes Berg
If we see a channel switch announcement on one link for another, handle that case and start the CSA. The driver can react to this in whatever way it needs. The stack will have the ability to track it via the RNR/MLE in the reporting link's beacon if it sees it for inactive links and adjust everything accordingly. Note that currently the timings for the CSA aren't set, the values are only used by the Intel drivers, and they don't need this for newer devices that support MLO, so I've left it out for now. Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240415112355.4d34b6a31be7.Ie8453979f5805873a8411c99346bcc3810cd6476@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: cfg80211: restrict operation during radar detectionJohannes Berg
Just like it's not currently possible to start radar detection while already operating, it shouldn't be possible to start operating while radar detection is running. Fix that. Also, improve the check whether operating (carrier might not be up if e.g. attempting to join IBSS). Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506211158.ae8dca3d0d6c.I7c70a66a5fbdbc63a78fee8a34f31d1995491bc3@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: move radar detect work to sdataJohannes Berg
At some point we thought perhaps this could be per link, but really that didn't happen, and it's confusing. Radar detection still uses the deflink to allocate the channel, but the work need not be there. Move it back. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506211311.43bd82c6da04.Ib39bec3aa198d137385f081e7e1910dcbde3aa1b@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: track changes in AP's TPEJohannes Berg
If the TPE (transmit power envelope) is changed, detect and report that to the driver. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506214536.103dda923f45.I990877e409ab8eade9ed7c172272e0cae57256cf@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: pass parsed TPE data to driversJohannes Berg
Instead of passing the full TPE elements, in all their glory and mixed up data formats for HE backward compatibility, parse them fully into the right values, and pass that to the drivers. Also introduce proper validation already in mac80211, so that drivers don't need to do it, and parse the EHT portions. The code now passes the values in the right order according to the channel used by an interface, which could also be a subset of the data advertised by the AP, if we couldn't connect with the full bandwidth (for whatever reason.) Also add kunit tests for the more complicated bits of it. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Acked-by: Kalle Valo <kvalo@kernel.org> Link: https://msgid.link/20240506214536.2aa839969b60.I265b28209e0b29772b2f125f7f83de44a4da877b@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: update 6 GHz AP power type before associationMukesh Sisodiya
6 GHz AP power type details are required to set proper tx power used to send frames. Update AP power type received in beacon while preparing for connection instead of after association so the frames before association can use the correct tx power. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Reviewed-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Mukesh Sisodiya <mukesh.sisodiya@intel.com> Link: https://msgid.link/20240506214536.310434f55f76.I6aca291ee06265e3f63e0f9024ba19a850b53a33@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: remove extra link STA functionsJohannes Berg
There's no need to have a lockdep assert and then call another function, just move everything into one place. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506211934.9759564a25f4.I88d43aa459d15c1d6230152e76b7757c2cdd6085@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: mac80211: remove outdated commentsJohannes Berg
These comments are no longer correct, it's a wiphy work now so it will go away immediately when canceled. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506211422.68bc10efbd8a.If80f43f4c8b9db1f5266f70d93a805f8c7463fe2@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: cfg80211: add tracing for wiphy workJohannes Berg
Add trace events to trace when wiphy works are queued (or delayed ones scheduled), and other APIs are called. Also add an event when the worker starts, before acquiring the mutex, to be able to see potential delays due to locking. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Reviewed-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://msgid.link/20240506210002.bf1840a1d22d.I4abba048c1c4017345640219cf1384a0b2288dd3@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: cfg80211: sort trace events againJohannes Berg
They were meant to be split into ops and APIs, but some ops were added in the wrong place. Fix that. Reviewed-by: Miriam Rachel Korenblit <miriam.rachel.korenblit@intel.com> Link: https://msgid.link/20240506210002.0b3a86a5d8d7.I5591c03223bdb95597e181de63a2eded424de34c@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23wifi: nl80211: disallow setting special AP channel widthsJohannes Berg
Setting the AP channel width is meant for use with the normal 20/40/... MHz channel width progression, and switching around in S1G or narrow channels isn't supported. Disallow that. Reported-by: syzbot+bc0f5b92cc7091f45fb6@syzkaller.appspotmail.com Link: https://msgid.link/20240515141600.d4a9590bfe32.I19a32d60097e81b527eafe6b0924f6c5fbb2dc45@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-05-23net: esp: cleanup esp_output_tail_tcp() in case of unsupported ESPINTCPHagar Hemdan
xmit() functions should consume skb or return error codes in error paths. When the configuration "CONFIG_INET_ESPINTCP" is not set, the implementation of the function "esp_output_tail_tcp" violates this rule. The function frees the skb and returns the error code. This change removes the kfree_skb from both functions, for both esp4 and esp6. WARN_ON is added because esp_output_tail_tcp() should never be called if CONFIG_INET_ESPINTCP is not set. This bug was discovered and resolved using Coverity Static Analysis Security Testing (SAST) by Synopsys, Inc. Fixes: e27cca96cd68 ("xfrm: add espintcp (RFC 8229)") Signed-off-by: Hagar Hemdan <hagarhem@amazon.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2024-05-22tracing/treewide: Remove second parameter of __assign_str()Steven Rostedt (Google)
With the rework of how the __string() handles dynamic strings where it saves off the source string in field in the helper structure[1], the assignment of that value to the trace event field is stored in the helper value and does not need to be passed in again. This means that with: __string(field, mystring) Which use to be assigned with __assign_str(field, mystring), no longer needs the second parameter and it is unused. With this, __assign_str() will now only get a single parameter. There's over 700 users of __assign_str() and because coccinelle does not handle the TRACE_EVENT() macro I ended up using the following sed script: git grep -l __assign_str | while read a ; do sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file; mv /tmp/test-file $a; done I then searched for __assign_str() that did not end with ';' as those were multi line assignments that the sed script above would fail to catch. Note, the same updates will need to be done for: __assign_str_len() __assign_rel_str() __assign_rel_str_len() I tested this with both an allmodconfig and an allyesconfig (build only for both). [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-22net: netfilter: Make ct zone opts configurable for bpf ct helpersBrad Cowie
Add ct zone id and direction to bpf_ct_opts so that arbitrary ct zones can be used for xdp/tc bpf ct helper functions bpf_{xdp,skb}_ct_alloc and bpf_{xdp,skb}_ct_lookup. Signed-off-by: Brad Cowie <brad@faucet.nz> Link: https://lore.kernel.org/r/20240522050712.732558-1-brad@faucet.nz Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-05-22vsock/virtio: drop owner assignmentKrzysztof Kozlowski
virtio core already sets the .owner, so driver does not need to. Acked-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Message-Id: <20240331-module-owner-virtio-v2-19-98f04bfaf46a@linaro.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2024-05-22net: 9p: virtio: drop owner assignmentKrzysztof Kozlowski
virtio core already sets the .owner, so driver does not need to. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Message-Id: <20240331-module-owner-virtio-v2-18-98f04bfaf46a@linaro.org> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2024-05-21Merge tag 's390-6.10-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull more s390 updates from Alexander Gordeev: - Switch read and write software bits for PUDs - Add missing hardware bits for PUDs and PMDs - Generate unwind information for C modules to fix GDB unwind error for vDSO functions - Create .build-id links for unstripped vDSO files to enable vDSO debugging with symbols - Use standard stack frame layout for vDSO generated stack frames to manually walk stack frames without DWARF information - Rework perf_callchain_user() and arch_stack_walk_user() functions to reduce code duplication - Skip first stack frame when walking user stack - Add basic checks to identify invalid instruction pointers when walking stack frames - Introduce and use struct stack_frame_vdso_wrapper within vDSO user wrapper code to automatically generate an asm-offset define. Also use STACK_FRAME_USER_OVERHEAD instead of STACK_FRAME_OVERHEAD to document that the code works with user space stack - Clear the backchain of the extra stack frame added by the vDSO user wrapper code. This allows the user stack walker to detect and skip the non-standard stack frame. Without this an incorrect instruction pointer would be added to stack traces. - Rewrite psw_idle() function in C to ease maintenance and further enhancements - Remove get_vtimer() function and use get_cpu_timer() instead - Mark psw variable in __load_psw_mask() as __unitialized to avoid superfluous clearing of PSW - Remove obsolete and superfluous comment about removed TIF_FPU flag - Replace memzero_explicit() and kfree() with kfree_sensitive() to fix warnings reported by Coccinelle - Wipe sensitive data and all copies of protected- or secure-keys from stack when an IOCTL fails - Both do_airq_interrupt() and do_io_interrupt() functions set CIF_NOHZ_DELAY flag. Move it in do_io_irq() to simplify the code - Provide iucv_alloc_device() and iucv_release_device() helpers, which can be used to deduplicate more or less identical IUCV device allocation and release code in four different drivers - Make use of iucv_alloc_device() and iucv_release_device() helpers to get rid of quite some code and also remove a cast to an incompatible function (clang W=1) - There is no user of iucv_root outside of the core IUCV code left. Therefore remove the EXPORT_SYMBOL - __apply_alternatives() contains a runtime check which verifies that the size of the to be patched code area is even. Convert this to a compile time check - Increase size of buffers for sending z/VM CP DIAGNOSE X'008' commands from 128 to 240 - Do not accept z/VM CP DIAGNOSE X'008' commands longer than maximally allowed - Use correct defines IPL_BP_NVME_LEN and IPL_BP0_NVME_LEN instead of IPL_BP_FCP_LEN and IPL_BP0_FCP_LEN ones to initialize NVMe reIPL block on 'scp_data' sysfs attribute update - Initialize the correct fields of the NVMe dump block, which were confused with FCP fields - Refactor macros for 'scp_data' (re-)IPL sysfs attribute to reduce code duplication - Introduce 'scp_data' sysfs attribute for dump IPL to allow tools such as dumpconf passing additional kernel command line parameters to a stand-alone dumper - Rework the CPACF query functions to use the correct RRE or RRF instruction formats and set instruction register fields correctly - Instead of calling BUG() at runtime force a link error during compile when a unsupported opcode is used with __cpacf_query() or __cpacf_check_opcode() functions - Fix a crash in ap_parse_bitmap_str() function on /sys/bus/ap/apmask or /sys/bus/ap/aqmask sysfs file update with a relative mask value - Fix "bindings complete" udev event which should be sent once all AP devices have been bound to device drivers and again when unbind/bind actions take place and all AP devices are bound again - Facility list alt_stfle_fac_list is nowhere used in the decompressor, therefore remove it there - Remove custom kprobes insn slot allocator in favour of the standard module_alloc() one, since kernel image and module areas are located within 4GB - Use kvcalloc() instead of kvmalloc_array() in zcrypt driver to avoid calling memset() with a large byte count and get rid of the sparse warning as result * tag 's390-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits) s390/zcrypt: Use kvcalloc() instead of kvmalloc_array() s390/kprobes: Remove custom insn slot allocator s390/boot: Remove alt_stfle_fac_list from decompressor s390/ap: Fix bind complete udev event sent after each AP bus scan s390/ap: Fix crash in AP internal function modify_bitmap() s390/cpacf: Make use of invalid opcode produce a link error s390/cpacf: Split and rework cpacf query functions s390/ipl: Introduce sysfs attribute 'scp_data' for dump ipl s390/ipl: Introduce macros for (re)ipl sysfs attribute 'scp_data' s390/ipl: Fix incorrect initialization of nvme dump block s390/ipl: Fix incorrect initialization of len fields in nvme reipl block s390/ipl: Do not accept z/VM CP diag X'008' cmds longer than max length s390/ipl: Fix size of vmcmd buffers for sending z/VM CP diag X'008' cmds s390/alternatives: Convert runtime sanity check into compile time check s390/iucv: Unexport iucv_root tty: hvc-iucv: Make use of iucv_alloc_device() s390/smsgiucv_app: Make use of iucv_alloc_device() s390/netiucv: Make use of iucv_alloc_device() s390/vmlogrdr: Make use of iucv_alloc_device() s390/iucv: Provide iucv_alloc_device() / iucv_release_device() ...
2024-05-21openvswitch: Set the skbuff pkt_type for proper pmtud support.Aaron Conole
Open vSwitch is originally intended to switch at layer 2, only dealing with Ethernet frames. With the introduction of l3 tunnels support, it crossed into the realm of needing to care a bit about some routing details when making forwarding decisions. If an oversized packet would need to be fragmented during this forwarding decision, there is a chance for pmtu to get involved and generate a routing exception. This is gated by the skbuff->pkt_type field. When a flow is already loaded into the openvswitch module this field is set up and transitioned properly as a packet moves from one port to another. In the case that a packet execute is invoked after a flow is newly installed this field is not properly initialized. This causes the pmtud mechanism to omit sending the required exception messages across the tunnel boundary and a second attempt needs to be made to make sure that the routing exception is properly setup. To fix this, we set the outgoing packet's pkt_type to PACKET_OUTGOING, since it can only get to the openvswitch module via a port device or packet command. Even for bridge ports as users, the pkt_type needs to be reset when doing the transmit as the packet is truly outgoing and routing needs to get involved post packet transformations, in the case of VXLAN/GENEVE/udp-tunnel packets. In general, the pkt_type on output gets ignored, since we go straight to the driver, but in the case of tunnel ports they go through IP routing layer. This issue is periodically encountered in complex setups, such as large openshift deployments, where multiple sets of tunnel traversal occurs. A way to recreate this is with the ovn-heater project that can setup a networking environment which mimics such large deployments. We need larger environments for this because we need to ensure that flow misses occur. In these environment, without this patch, we can see: ./ovn_cluster.sh start podman exec ovn-chassis-1 ip r a 170.168.0.5/32 dev eth1 mtu 1200 podman exec ovn-chassis-1 ip netns exec sw01p1 ip r flush cache podman exec ovn-chassis-1 ip netns exec sw01p1 \ ping 21.0.0.3 -M do -s 1300 -c2 PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data. From 21.0.0.3 icmp_seq=2 Frag needed and DF set (mtu = 1142) --- 21.0.0.3 ping statistics --- ... Using tcpdump, we can also see the expected ICMP FRAG_NEEDED message is not sent into the server. With this patch, setting the pkt_type, we see the following: podman exec ovn-chassis-1 ip netns exec sw01p1 \ ping 21.0.0.3 -M do -s 1300 -c2 PING 21.0.0.3 (21.0.0.3) 1300(1328) bytes of data. From 21.0.0.3 icmp_seq=1 Frag needed and DF set (mtu = 1222) ping: local error: message too long, mtu=1222 --- 21.0.0.3 ping statistics --- ... In this case, the first ping request receives the FRAG_NEEDED message and a local routing exception is created. Tested-by: Jaime Caamano <jcaamano@redhat.com> Reported-at: https://issues.redhat.com/browse/FDP-164 Fixes: 58264848a5a7 ("openvswitch: Add vxlan tunneling support.") Signed-off-by: Aaron Conole <aconole@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/20240516200941.16152-1-aconole@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-21net/9p: fix uninit-value in p9_client_rpc()Nikita Zhandarovich
Syzbot with the help of KMSAN reported the following error: BUG: KMSAN: uninit-value in trace_9p_client_res include/trace/events/9p.h:146 [inline] BUG: KMSAN: uninit-value in p9_client_rpc+0x1314/0x1340 net/9p/client.c:754 trace_9p_client_res include/trace/events/9p.h:146 [inline] p9_client_rpc+0x1314/0x1340 net/9p/client.c:754 p9_client_create+0x1551/0x1ff0 net/9p/client.c:1031 v9fs_session_init+0x1b9/0x28e0 fs/9p/v9fs.c:410 v9fs_mount+0xe2/0x12b0 fs/9p/vfs_super.c:122 legacy_get_tree+0x114/0x290 fs/fs_context.c:662 vfs_get_tree+0xa7/0x570 fs/super.c:1797 do_new_mount+0x71f/0x15e0 fs/namespace.c:3352 path_mount+0x742/0x1f20 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount+0x725/0x810 fs/namespace.c:3875 __x64_sys_mount+0xe4/0x150 fs/namespace.c:3875 do_syscall_64+0xd5/0x1f0 entry_SYSCALL_64_after_hwframe+0x6d/0x75 Uninit was created at: __alloc_pages+0x9d6/0xe70 mm/page_alloc.c:4598 __alloc_pages_node include/linux/gfp.h:238 [inline] alloc_pages_node include/linux/gfp.h:261 [inline] alloc_slab_page mm/slub.c:2175 [inline] allocate_slab mm/slub.c:2338 [inline] new_slab+0x2de/0x1400 mm/slub.c:2391 ___slab_alloc+0x1184/0x33d0 mm/slub.c:3525 __slab_alloc mm/slub.c:3610 [inline] __slab_alloc_node mm/slub.c:3663 [inline] slab_alloc_node mm/slub.c:3835 [inline] kmem_cache_alloc+0x6d3/0xbe0 mm/slub.c:3852 p9_tag_alloc net/9p/client.c:278 [inline] p9_client_prepare_req+0x20a/0x1770 net/9p/client.c:641 p9_client_rpc+0x27e/0x1340 net/9p/client.c:688 p9_client_create+0x1551/0x1ff0 net/9p/client.c:1031 v9fs_session_init+0x1b9/0x28e0 fs/9p/v9fs.c:410 v9fs_mount+0xe2/0x12b0 fs/9p/vfs_super.c:122 legacy_get_tree+0x114/0x290 fs/fs_context.c:662 vfs_get_tree+0xa7/0x570 fs/super.c:1797 do_new_mount+0x71f/0x15e0 fs/namespace.c:3352 path_mount+0x742/0x1f20 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount+0x725/0x810 fs/namespace.c:3875 __x64_sys_mount+0xe4/0x150 fs/namespace.c:3875 do_syscall_64+0xd5/0x1f0 entry_SYSCALL_64_after_hwframe+0x6d/0x75 If p9_check_errors() fails early in p9_client_rpc(), req->rc.tag will not be properly initialized. However, trace_9p_client_res() ends up trying to print it out anyway before p9_client_rpc() finishes. Fix this issue by assigning default values to p9_fcall fields such as 'tag' and (just in case KMSAN unearths something new) 'id' during the tag allocation stage. Reported-and-tested-by: syzbot+ff14db38f56329ef68df@syzkaller.appspotmail.com Fixes: 348b59012e5c ("net/9p: Convert net/9p protocol dumps to tracepoints") Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> Cc: stable@vger.kernel.org Message-ID: <20240408141039.30428-1-n.zhandarovich@fintech.ru> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2024-05-21af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTSMichal Luczaj
GC attempts to explicitly drop oob_skb's reference before purging the hit list. The problem is with embryos: kfree_skb(u->oob_skb) is never called on an embryo socket. The python script below [0] sends a listener's fd to its embryo as OOB data. While GC does collect the embryo's queue, it fails to drop the OOB skb's refcount. The skb which was in embryo's receive queue stays as unix_sk(sk)->oob_skb and keeps the listener's refcount [1]. Tell GC to dispose embryo's oob_skb. [0]: from array import array from socket import * addr = '\x00unix-oob' lis = socket(AF_UNIX, SOCK_STREAM) lis.bind(addr) lis.listen(1) s = socket(AF_UNIX, SOCK_STREAM) s.connect(addr) scm = (SOL_SOCKET, SCM_RIGHTS, array('i', [lis.fileno()])) s.sendmsg([b'x'], [scm], MSG_OOB) lis.close() [1] $ grep unix-oob /proc/net/unix $ ./unix-oob.py $ grep unix-oob /proc/net/unix 0000000000000000: 00000002 00000000 00000000 0001 02 0 @unix-oob 0000000000000000: 00000002 00000000 00010000 0001 01 6072 @unix-oob Fixes: 4090fa373f0e ("af_unix: Replace garbage collection algorithm.") Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-21tcp: Fix shift-out-of-bounds in dctcp_update_alpha().Kuniyuki Iwashima
In dctcp_update_alpha(), we use a module parameter dctcp_shift_g as follows: alpha -= min_not_zero(alpha, alpha >> dctcp_shift_g); ... delivered_ce <<= (10 - dctcp_shift_g); It seems syzkaller started fuzzing module parameters and triggered shift-out-of-bounds [0] by setting 100 to dctcp_shift_g: memcpy((void*)0x20000080, "/sys/module/tcp_dctcp/parameters/dctcp_shift_g\000", 47); res = syscall(__NR_openat, /*fd=*/0xffffffffffffff9cul, /*file=*/0x20000080ul, /*flags=*/2ul, /*mode=*/0ul); memcpy((void*)0x20000000, "100\000", 4); syscall(__NR_write, /*fd=*/r[0], /*val=*/0x20000000ul, /*len=*/4ul); Let's limit the max value of dctcp_shift_g by param_set_uint_minmax(). With this patch: # echo 10 > /sys/module/tcp_dctcp/parameters/dctcp_shift_g # cat /sys/module/tcp_dctcp/parameters/dctcp_shift_g 10 # echo 11 > /sys/module/tcp_dctcp/parameters/dctcp_shift_g -bash: echo: write error: Invalid argument [0]: UBSAN: shift-out-of-bounds in net/ipv4/tcp_dctcp.c:143:12 shift exponent 100 is too large for 32-bit type 'u32' (aka 'unsigned int') CPU: 0 PID: 8083 Comm: syz-executor345 Not tainted 6.9.0-05151-g1b294a1f3561 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x201/0x300 lib/dump_stack.c:114 ubsan_epilogue lib/ubsan.c:231 [inline] __ubsan_handle_shift_out_of_bounds+0x346/0x3a0 lib/ubsan.c:468 dctcp_update_alpha+0x540/0x570 net/ipv4/tcp_dctcp.c:143 tcp_in_ack_event net/ipv4/tcp_input.c:3802 [inline] tcp_ack+0x17b1/0x3bc0 net/ipv4/tcp_input.c:3948 tcp_rcv_state_process+0x57a/0x2290 net/ipv4/tcp_input.c:6711 tcp_v4_do_rcv+0x764/0xc40 net/ipv4/tcp_ipv4.c:1937 sk_backlog_rcv include/net/sock.h:1106 [inline] __release_sock+0x20f/0x350 net/core/sock.c:2983 release_sock+0x61/0x1f0 net/core/sock.c:3549 mptcp_subflow_shutdown+0x3d0/0x620 net/mptcp/protocol.c:2907 mptcp_check_send_data_fin+0x225/0x410 net/mptcp/protocol.c:2976 __mptcp_close+0x238/0xad0 net/mptcp/protocol.c:3072 mptcp_close+0x2a/0x1a0 net/mptcp/protocol.c:3127 inet_release+0x190/0x1f0 net/ipv4/af_inet.c:437 __sock_release net/socket.c:659 [inline] sock_close+0xc0/0x240 net/socket.c:1421 __fput+0x41b/0x890 fs/file_table.c:422 task_work_run+0x23b/0x300 kernel/task_work.c:180 exit_task_work include/linux/task_work.h:38 [inline] do_exit+0x9c8/0x2540 kernel/exit.c:878 do_group_exit+0x201/0x2b0 kernel/exit.c:1027 __do_sys_exit_group kernel/exit.c:1038 [inline] __se_sys_exit_group kernel/exit.c:1036 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1036 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xe4/0x240 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x67/0x6f RIP: 0033:0x7f6c2b5005b6 Code: Unable to access opcode bytes at 0x7f6c2b50058c. RSP: 002b:00007ffe883eb948 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 RAX: ffffffffffffffda RBX: 00007f6c2b5862f0 RCX: 00007f6c2b5005b6 RDX: 0000000000000001 RSI: 000000000000003c RDI: 0000000000000001 RBP: 0000000000000001 R08: 00000000000000e7 R09: ffffffffffffffc0 R10: 0000000000000006 R11: 0000000000000246 R12: 00007f6c2b5862f0 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000001 </TASK> Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: Yue Sun <samsun1006219@gmail.com> Reported-by: xingwei lee <xrivendell7@gmail.com> Closes: https://lore.kernel.org/netdev/CAEkJfYNJM=cw-8x7_Vmj1J6uYVCWMbbvD=EFmDPVBGpTsqOxEA@mail.gmail.com/ Fixes: e3118e8359bb ("net: tcp: add DCTCP congestion control algorithm") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240517091626.32772-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-21ipv6: sr: fix memleak in seg6_hmac_init_algoHangbin Liu
seg6_hmac_init_algo returns without cleaning up the previous allocations if one fails, so it's going to leak all that memory and the crypto tfms. Update seg6_hmac_exit to only free the memory when allocated, so we can reuse the code directly. Fixes: bf355b8d2c30 ("ipv6: sr: add core files for SR HMAC support") Reported-by: Sabrina Dubroca <sd@queasysnail.net> Closes: https://lore.kernel.org/netdev/Zj3bh-gE7eT6V6aH@hog/ Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://lore.kernel.org/r/20240517005435.2600277-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-21af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock.Kuniyuki Iwashima
Billy Jheng Bing-Jhong reported a race between __unix_gc() and queue_oob(). __unix_gc() tries to garbage-collect close()d inflight sockets, and then if the socket has MSG_OOB in unix_sk(sk)->oob_skb, GC will drop the reference and set NULL to it locklessly. However, the peer socket still can send MSG_OOB message and queue_oob() can update unix_sk(sk)->oob_skb concurrently, leading NULL pointer dereference. [0] To fix the issue, let's update unix_sk(sk)->oob_skb under the sk_receive_queue's lock and take it everywhere we touch oob_skb. Note that we defer kfree_skb() in manage_oob() to silence lockdep false-positive (See [1]). [0]: BUG: kernel NULL pointer dereference, address: 0000000000000008 PF: supervisor write access in kernel mode PF: error_code(0x0002) - not-present page PGD 8000000009f5e067 P4D 8000000009f5e067 PUD 9f5d067 PMD 0 Oops: 0002 [#1] PREEMPT SMP PTI CPU: 3 PID: 50 Comm: kworker/3:1 Not tainted 6.9.0-rc5-00191-gd091e579b864 #110 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: events delayed_fput RIP: 0010:skb_dequeue (./include/linux/skbuff.h:2386 ./include/linux/skbuff.h:2402 net/core/skbuff.c:3847) Code: 39 e3 74 3e 8b 43 10 48 89 ef 83 e8 01 89 43 10 49 8b 44 24 08 49 c7 44 24 08 00 00 00 00 49 8b 14 24 49 c7 04 24 00 00 00 00 <48> 89 42 08 48 89 10 e8 e7 c5 42 00 4c 89 e0 5b 5d 41 5c c3 cc cc RSP: 0018:ffffc900001bfd48 EFLAGS: 00000002 RAX: 0000000000000000 RBX: ffff8880088f5ae8 RCX: 00000000361289f9 RDX: 0000000000000000 RSI: 0000000000000206 RDI: ffff8880088f5b00 RBP: ffff8880088f5b00 R08: 0000000000080000 R09: 0000000000000001 R10: 0000000000000003 R11: 0000000000000001 R12: ffff8880056b6a00 R13: ffff8880088f5280 R14: 0000000000000001 R15: ffff8880088f5a80 FS: 0000000000000000(0000) GS:ffff88807dd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 0000000006314000 CR4: 00000000007506f0 PKRU: 55555554 Call Trace: <TASK> unix_release_sock (net/unix/af_unix.c:654) unix_release (net/unix/af_unix.c:1050) __sock_release (net/socket.c:660) sock_close (net/socket.c:1423) __fput (fs/file_table.c:423) delayed_fput (fs/file_table.c:444 (discriminator 3)) process_one_work (kernel/workqueue.c:3259) worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416) kthread (kernel/kthread.c:388) ret_from_fork (arch/x86/kernel/process.c:153) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) </TASK> Modules linked in: CR2: 0000000000000008 Link: https://lore.kernel.org/netdev/a00d3993-c461-43f2-be6d-07259c98509a@rbox.co/ [1] Fixes: 1279f9d9dec2 ("af_unix: Call kfree_skb() for dead unix_(sk)->oob_skb in GC.") Reported-by: Billy Jheng Bing-Jhong <billy@starlabs.sg> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240516134835.8332-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-20netfilter: nft_payload: restore vlan q-in-q match supportPablo Neira Ayuso
Revert f6ae9f120dad ("netfilter: nft_payload: add C-VLAN support"). f41f72d09ee1 ("netfilter: nft_payload: simplify vlan header handling") already allows to match on inner vlan tags by subtract the vlan header size to the payload offset which has been popped and stored in skbuff metadata fields. Fixes: f6ae9f120dad ("netfilter: nft_payload: add C-VLAN support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-20netfilter: ipset: Add list flush to cancel_gcAlexander Maltsev
Flushing list in cancel_gc drops references to other lists right away, without waiting for RCU to destroy list. Fixes race when referenced ipsets can't be destroyed while referring list is scheduled for destroy. Fixes: 97f7cf1cd80e ("netfilter: ipset: fix performance regression in swap operation") Signed-off-by: Alexander Maltsev <keltar.gw@gmail.com> Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-05-20Merge tag 'dma-mapping-6.10-2024-05-20' of ↵Linus Torvalds
git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - optimize DMA sync calls when they are no-ops (Alexander Lobakin) - fix swiotlb padding for untrusted devices (Michael Kelley) - add documentation for swiotb (Michael Kelley) * tag 'dma-mapping-6.10-2024-05-20' of git://git.infradead.org/users/hch/dma-mapping: dma: fix DMA sync for drivers not calling dma_set_mask*() xsk: use generic DMA sync shortcut instead of a custom one page_pool: check for DMA sync shortcut earlier page_pool: don't use driver-set flags field directly page_pool: make sure frag API fields don't span between cachelines iommu/dma: avoid expensive indirect calls for sync operations dma: avoid redundant calls for sync operations dma: compile-out DMA sync op calls when not used iommu/dma: fix zeroing of bounce buffer padding used by untrusted devices swiotlb: remove alloc_size argument to swiotlb_tbl_map_single() Documentation/core-api: add swiotlb documentation
2024-05-20rpcrdma: fix handling for RDMA_CM_EVENT_DEVICE_REMOVALDan Aloni
Under the scenario of IB device bonding, when bringing down one of the ports, or all ports, we saw xprtrdma entering a non-recoverable state where it is not even possible to complete the disconnect and shut it down the mount, requiring a reboot. Following debug, we saw that transport connect never ended after receiving the RDMA_CM_EVENT_DEVICE_REMOVAL callback. The DEVICE_REMOVAL callback is irrespective of whether the CM_ID is connected, and ESTABLISHED may not have happened. So need to work with each of these states accordingly. Fixes: 2acc5cae2923 ('xprtrdma: Prevent dereferencing r_xprt->rx_ep after it is freed') Cc: Sagi Grimberg <sagi.grimberg@vastdata.com> Signed-off-by: Dan Aloni <dan.aloni@vastdata.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20sunrpc: fix NFSACL RPC retry on soft mountDan Aloni
It used to be quite awhile ago since 1b63a75180c6 ('SUNRPC: Refactor rpc_clone_client()'), in 2012, that `cl_timeout` was copied in so that all mount parameters propagate to NFSACL clients. However since that change, if mount options as follows are given: soft,timeo=50,retrans=16,vers=3 The resultant NFSACL client receives: cl_softrtry: 1 cl_timeout: to_initval=60000, to_maxval=60000, to_increment=0, to_retries=2, to_exponential=0 These values lead to NFSACL operations not being retried under the condition of transient network outages with soft mount. Instead, getacl call fails after 60 seconds with EIO. The simple fix is to pass the existing client's `cl_timeout` as the new client timeout. Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Benjamin Coddington <bcodding@redhat.com> Link: https://lore.kernel.org/all/20231105154857.ryakhmgaptq3hb6b@gmail.com/T/ Fixes: 1b63a75180c6 ('SUNRPC: Refactor rpc_clone_client()') Signed-off-by: Dan Aloni <dan.aloni@vastdata.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2024-05-20SUNRPC: fix handling expired GSS contextOlga Kornievskaia
In the case where we have received a successful reply to an RPC request, but while processing the reply the client in rpc_decode_header() finds an expired context, the code ends up propagating the error to the caller instead of getting a new context and retrying the request. To give more details, in rpc_decode_header() we call rpcauth_checkverf() will call into the gss and internally will at some point call gss_validate() which has a check if the current’s context lifetime expired, and it would fail. The reason for the failure gets ‘scrubbed’ and translated to EACCES so when we get back to rpc_decode_header() we just go to “out_verifier” which for that error would get converted to “out_garbage” (ie it’s treated as garballed reply) and the next action is call_encode. Which (1) doesn’t reencode or re-send (not to mention no upcall happens because context expires as that reason just not known) and it again fails in the same decoding process. After re-trying it 3 times the error is propagated back to the caller (ie nfs4_write_done_cb() in the case a failing write). To fix this, instead we need to look to the case where the server decides that context has expired and replies with an RPC auth error. In that case, the rpc_decode_header() goes to "out_msg_denied" in that we return EKEYREJECTED which in call_decode() is sent to “call_reserve” which triggers an upcalls and a re-try of the operation. The proposed fix is in case of a failed rpc_decode_header() to check if credentials were set to be invalid and use that as a proxy for deciding that context has expired and then treat is same way as receiving an auth error. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>