summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2024-07-30mptcp: distinguish rcv vs sent backup flag in requestsMatthieu Baerts (NGI0)
When sending an MP_JOIN + SYN + ACK, it is possible to mark the subflow as 'backup' by setting the flag with the same name. Before this patch, the backup was set if the other peer set it in its MP_JOIN + SYN request. It is not correct: the backup flag should be set in the MPJ+SYN+ACK only if the host asks for it, and not mirroring what was done by the other peer. It is then required to have a dedicated bit for each direction, similar to what is done in the subflow context. Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-30mptcp: sched: check both directions for backupMatthieu Baerts (NGI0)
The 'mptcp_subflow_context' structure has two items related to the backup flags: - 'backup': the subflow has been marked as backup by the other peer - 'request_bkup': the backup flag has been set by the host Before this patch, the scheduler was only looking at the 'backup' flag. That can make sense in some cases, but it looks like that's not what we wanted for the general use, because either the path-manager was setting both of them when sending an MP_PRIO, or the receiver was duplicating the 'backup' flag in the subflow request. Note that the use of these two flags in the path-manager are going to be fixed in the next commits, but this change here is needed not to modify the behaviour. Fixes: f296234c98a8 ("mptcp: Add handling of incoming MP_JOIN requests") Cc: stable@vger.kernel.org Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-29bpf: Fail verification for sign-extension of packet data/data_end/data_metaYonghong Song
syzbot reported a kernel crash due to commit 1f1e864b6555 ("bpf: Handle sign-extenstin ctx member accesses"). The reason is due to sign-extension of 32-bit load for packet data/data_end/data_meta uapi field. The original code looks like: r2 = *(s32 *)(r1 + 76) /* load __sk_buff->data */ r3 = *(u32 *)(r1 + 80) /* load __sk_buff->data_end */ r0 = r2 r0 += 8 if r3 > r0 goto +1 ... Note that __sk_buff->data load has 32-bit sign extension. After verification and convert_ctx_accesses(), the final asm code looks like: r2 = *(u64 *)(r1 +208) r2 = (s32)r2 r3 = *(u64 *)(r1 +80) r0 = r2 r0 += 8 if r3 > r0 goto pc+1 ... Note that 'r2 = (s32)r2' may make the kernel __sk_buff->data address invalid which may cause runtime failure. Currently, in C code, typically we have void *data = (void *)(long)skb->data; void *data_end = (void *)(long)skb->data_end; ... and it will generate r2 = *(u64 *)(r1 +208) r3 = *(u64 *)(r1 +80) r0 = r2 r0 += 8 if r3 > r0 goto pc+1 If we allow sign-extension, void *data = (void *)(long)(int)skb->data; void *data_end = (void *)(long)skb->data_end; ... the generated code looks like r2 = *(u64 *)(r1 +208) r2 <<= 32 r2 s>>= 32 r3 = *(u64 *)(r1 +80) r0 = r2 r0 += 8 if r3 > r0 goto pc+1 and this will cause verification failure since "r2 <<= 32" is not allowed as "r2" is a packet pointer. To fix this issue for case r2 = *(s32 *)(r1 + 76) /* load __sk_buff->data */ this patch added additional checking in is_valid_access() callback function for packet data/data_end/data_meta access. If those accesses are with sign-extenstion, the verification will fail. [1] https://lore.kernel.org/bpf/000000000000c90eee061d236d37@google.com/ Reported-by: syzbot+ad9ec60c8eaf69e6f99c@syzkaller.appspotmail.com Fixes: 1f1e864b6555 ("bpf: Handle sign-extenstin ctx member accesses") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20240723153439.2429035-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-29bpf: Check unsupported ops from the bpf_struct_ops's cfi_stubsMartin KaFai Lau
The bpf_tcp_ca struct_ops currently uses a "u32 unsupported_ops[]" array to track which ops is not supported. After cfi_stubs had been added, the function pointer in cfi_stubs is also NULL for the unsupported ops. Thus, the "u32 unsupported_ops[]" becomes redundant. This observation was originally brought up in the bpf/cfi discussion: https://lore.kernel.org/bpf/CAADnVQJoEkdjyCEJRPASjBw1QGsKYrF33QdMGc1RZa9b88bAEA@mail.gmail.com/ The recent bpf qdisc patch (https://lore.kernel.org/bpf/20240714175130.4051012-6-amery.hung@bytedance.com/) also needs to specify quite many unsupported ops. It is a good time to clean it up. This patch removes the need of "u32 unsupported_ops[]" and tests for null-ness in the cfi_stubs instead. Testing the cfi_stubs is done in a new function bpf_struct_ops_supported(). The verifier will call bpf_struct_ops_supported() when loading the struct_ops program. The ".check_member" is removed from the bpf_tcp_ca in this patch. ".check_member" could still be useful for other subsytems to enforce other restrictions (e.g. sched_ext checks for prog->sleepable). To keep the same error return, ENOTSUPP is used. Cc: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20240722183049.2254692-2-martin.lau@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-07-29mptcp: fix NL PM announced address accountingPaolo Abeni
Currently the per connection announced address counter is never decreased. As a consequence, after connection establishment, if the NL PM deletes an endpoint and adds a new/different one, no additional subflow is created for the new endpoint even if the current limits allow that. Address the issue properly updating the signaled address counter every time the NL PM removes such addresses. Fixes: 01cacb00b35c ("mptcp: add netlink-based PM") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-29mptcp: fix user-space PM announced address accountingPaolo Abeni
Currently the per-connection announced address counter is never decreased. When the user-space PM is in use, this just affect the information exposed via diag/sockopt, but it could still foul the PM to wrong decision. Add the missing accounting for the user-space PM's sake. Fixes: 8b1c94da1e48 ("mptcp: only send RM_ADDR in nl_cmd_remove") Cc: stable@vger.kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-29rtnetlink: Don't ignore IFLA_TARGET_NETNSID when ifname is specified in ↵Kuniyuki Iwashima
rtnl_dellink(). The cited commit accidentally replaced tgt_net with net in rtnl_dellink(). As a result, IFLA_TARGET_NETNSID is ignored if the interface is specified with IFLA_IFNAME or IFLA_ALT_IFNAME. Let's pass tgt_net to rtnl_dev_get(). Fixes: cc6090e985d7 ("net: rtnetlink: introduce helper to get net_device instance by ifname") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-29tcp: Adjust clamping window for applications specifying SO_RCVBUFSubash Abhinov Kasiviswanathan
tp->scaling_ratio is not updated based on skb->len/skb->truesize once SO_RCVBUF is set leading to the maximum window scaling to be 25% of rcvbuf after commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") and 50% of rcvbuf after commit 697a6c8cec03 ("tcp: increase the default TCP scaling ratio"). 50% tries to emulate the behavior of older kernels using sysctl_tcp_adv_win_scale with default value. Systems which were using a different values of sysctl_tcp_adv_win_scale in older kernels ended up seeing reduced download speeds in certain cases as covered in https://lists.openwall.net/netdev/2024/05/15/13 While the sysctl scheme is no longer acceptable, the value of 50% is a bit conservative when the skb->len/skb->truesize ratio is later determined to be ~0.66. Applications not specifying SO_RCVBUF update the window scaling and the receiver buffer every time data is copied to userspace. This computation is now used for applications setting SO_RCVBUF to update the maximum window scaling while ensuring that the receive buffer is within the application specified limit. Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale") Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com> Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-29ethtool: fix the state of additional contexts with old APIJakub Kicinski
We expect drivers implementing the new create/modify/destroy API to populate the defaults in struct ethtool_rxfh_context. In legacy API ctx isn't even passed, and rxfh.indir / rxfh.key are NULL so drivers can't give us defaults even if they want to. Call get_rxfh() to fetch the values. We can reuse rxfh_dev for the get_rxfh(), rxfh stores the input from the user. This fixes IOCTL reporting 0s instead of the default key / indir table for drivers using legacy API. Add a check to try to catch drivers using the new API but not populating the key. Fixes: 7964e7884643 ("net: ethtool: use the tracking array for get_rxfh on custom RSS contexts") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-29ethtool: fix setting key and resetting indir at onceJakub Kicinski
The indirection table and the key follow struct ethtool_rxfh in user memory. To reset the indirection table user space calls SET_RXFH with table of size 0 (OTOH to say "no change" it should use -1 / ~0). The logic for calculating the offset where they key sits is incorrect in this case, as kernel would still offset by the full table length, while for the reset there is no indir table and key is immediately after the struct. $ ethtool -X eth0 default hkey 01:02:03... $ ethtool -x eth0 [...] RSS hash key: 00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00 [...] Fixes: 3de0b592394d ("ethtool: Support for configurable RSS hash key") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-29tun: Add missing bpf_net_ctx_clear() in do_xdp_generic()Jeongjun Park
There are cases where do_xdp_generic returns bpf_net_context without clearing it. This causes various memory corruptions, so the missing bpf_net_ctx_clear must be added. Reported-by: syzbot+44623300f057a28baf1e@syzkaller.appspotmail.com Fixes: fecef4cd42c6 ("tun: Assign missing bpf_net_context.") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reported-by: syzbot+3c2b6d5d4bec3b904933@syzkaller.appspotmail.com Reported-by: syzbot+707d98c8649695eaf329@syzkaller.appspotmail.com Reported-by: syzbot+c226757eb784a9da3e8b@syzkaller.appspotmail.com Reported-by: syzbot+61a1cfc2b6632363d319@syzkaller.appspotmail.com Reported-by: syzbot+709e4c85c904bcd62735@syzkaller.appspotmail.com Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-28minmax: add a few more MIN_T/MAX_T usersLinus Torvalds
Commit 3a7e02c040b1 ("minmax: avoid overly complicated constant expressions in VM code") added the simpler MIN_T/MAX_T macros in order to avoid some excessive expansion from the rather complicated regular min/max macros. The complexity of those macros stems from two issues: (a) trying to use them in situations that require a C constant expression (in static initializers and for array sizes) (b) the type sanity checking and MIN_T/MAX_T avoids both of these issues. Now, in the whole (long) discussion about all this, it was pointed out that the whole type sanity checking is entirely unnecessary for min_t/max_t which get a fixed type that the comparison is done in. But that still leaves min_t/max_t unnecessarily complicated due to worries about the C constant expression case. However, it turns out that there really aren't very many cases that use min_t/max_t for this, and we can just force-convert those. This does exactly that. Which in turn will then allow for much simpler implementations of min_t()/max_t(). All the usual "macros in all upper case will evaluate the arguments multiple times" rules apply. We should do all the same things for the regular min/max() vs MIN/MAX() cases, but that has the added complexity of various drivers defining their own local versions of MIN/MAX, so that needs another level of fixes first. Link: https://lore.kernel.org/all/b47fad1d0cf8449886ad148f8c013dae@AcuMS.aculab.com/ Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-07-26Merge tag 'for-net-2024-07-26' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - btmtk: Fix kernel crash when entering btmtk_usb_suspend - btmtk: Fix btmtk.c undefined reference build error - btintel: Fail setup on error - hci_sync: Fix suspending with wrong filter policy - hci_event: Fix setting DISCOVERY_FINDING for passive scanning * tag 'for-net-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: hci_event: Fix setting DISCOVERY_FINDING for passive scanning Bluetooth: btmtk: remove #ifdef around declarations Bluetooth: btmtk: Fix btmtk.c undefined reference build error harder Bluetooth: btmtk: Fix btmtk.c undefined reference build error Bluetooth: hci_sync: Fix suspending with wrong filter policy Bluetooth: btmtk: Fix kernel crash when entering btmtk_usb_suspend Bluetooth: btintel: Fail setup on error ==================== Link: https://patch.msgid.link/20240726150502.3300832-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-26Merge tag 'wireless-2024-07-26' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Couple of more urgent fixes: * ath12k: wowlan loop iteration issue * ath12k: fix soft lockup on suspend in certain scenarios * mt76: fix crash when removing an interface * mac80211: fix injection crash with some drivers that don't want monitor vif * cfg80211: fix S1G beacon parsing in scan * cfg80211: fix MLO link status reporting on connect * tag 'wireless-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: ath12k: fix soft lockup on suspend wifi: mt76: mt7921: fix null pointer access in mt792x_mac_link_bss_remove wifi: ath12k: fix reusing outside iterator in ath12k_wow_vif_set_wakeups() wifi: cfg80211: correct S1G beacon length calculation wifi: cfg80211: fix reporting failed MLO links status with cfg80211_connect_done wifi: mac80211: use monitor sdata with driver only if desired ==================== Link: https://patch.msgid.link/20240726122638.942420-3-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-26Bluetooth: hci_event: Fix setting DISCOVERY_FINDING for passive scanningLuiz Augusto von Dentz
DISCOVERY_FINDING shall only be set for active scanning as passive scanning is not meant to generate MGMT Device Found events causing discovering state to go out of sync since userspace would believe it is discovering when in fact it is just passive scanning. Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=219088 Fixes: 2e2515c1ba38 ("Bluetooth: hci_event: Set DISCOVERY_FINDING on SCAN_ENABLED") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-07-26Bluetooth: hci_sync: Fix suspending with wrong filter policyLuiz Augusto von Dentz
When suspending the scan filter policy cannot be 0x00 (no acceptlist) since that means the host has to process every advertisement report waking up the system, so this attempts to check if hdev is marked as suspended and if the resulting filter policy would be 0x00 (no acceptlist) then skip passive scanning if thre no devices in the acceptlist otherwise reset the filter policy to 0x01 so the acceptlist is used since the devices programmed there can still wakeup be system. Fixes: 182ee45da083 ("Bluetooth: hci_sync: Rework hci_suspend_notifier") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-07-26wifi: cfg80211: correct S1G beacon length calculationJohannes Berg
The minimum header length calculation (equivalent to the start of the elements) for the S1G long beacon erroneously required only up to the start of u.s1g_beacon rather than the start of u.s1g_beacon.variable. Fix that, and also shuffle the branches around a bit to not assign useless values that are overwritten later. Reported-by: syzbot+0f3afa93b91202f21939@syzkaller.appspotmail.com Fixes: 9eaffe5078ca ("cfg80211: convert S1G beacon to scan results") Link: https://patch.msgid.link/20240724132912.9662972db7c1.I8779675b5bbda4994cc66f876b6b87a2361c3c0b@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-26wifi: cfg80211: fix reporting failed MLO links status with cfg80211_connect_doneVeerendranath Jakkam
Individual MLO links connection status is not copied to EVENT_CONNECT_RESULT data while processing the connect response information in cfg80211_connect_done(). Due to this failed links are wrongly indicated with success status in EVENT_CONNECT_RESULT. To fix this, copy the individual MLO links status to the EVENT_CONNECT_RESULT data. Fixes: 53ad07e9823b ("wifi: cfg80211: support reporting failed links") Signed-off-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com> Reviewed-by: Carlos Llamas <cmllamas@google.com> Link: https://patch.msgid.link/20240724125327.3495874-1-quic_vjakkam@quicinc.com [commit message editorial changes] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-26wifi: mac80211: use monitor sdata with driver only if desiredJohannes Berg
In commit 0d9c2beed116 ("wifi: mac80211: fix monitor channel with chanctx emulation") I changed mac80211 to always have an internal monitor_sdata to have something to have the chanctx bound to. However, if the driver didn't also have the WANT_MONITOR flag this would cause mac80211 to allocate it without telling the driver (which was intentional) but also use it for later APIs to the driver without it ever having known about it which was _not_ intentional. Check through the code and only use the monitor_sdata in the relevant places (TX, MU-MIMO follow settings, TX power, and interface iteration) when the WANT_MONITOR flag is set. Cc: stable@vger.kernel.org Fixes: 0d9c2beed116 ("wifi: mac80211: fix monitor channel with chanctx emulation") Reported-by: ZeroBeat <ZeroBeat@gmx.de> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219086 Tested-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20240725184836.25d334157a8e.I02574086da2c5cf0e18264ce5807db6f14ffd9c0@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-07-26sched: act_ct: take care of padding in struct zones_ht_keyEric Dumazet
Blamed commit increased lookup key size from 2 bytes to 16 bytes, because zones_ht_key got a struct net pointer. Make sure rhashtable_lookup() is not using the padding bytes which are not initialized. BUG: KMSAN: uninit-value in rht_ptr_rcu include/linux/rhashtable.h:376 [inline] BUG: KMSAN: uninit-value in __rhashtable_lookup include/linux/rhashtable.h:607 [inline] BUG: KMSAN: uninit-value in rhashtable_lookup include/linux/rhashtable.h:646 [inline] BUG: KMSAN: uninit-value in rhashtable_lookup_fast include/linux/rhashtable.h:672 [inline] BUG: KMSAN: uninit-value in tcf_ct_flow_table_get+0x611/0x2260 net/sched/act_ct.c:329 rht_ptr_rcu include/linux/rhashtable.h:376 [inline] __rhashtable_lookup include/linux/rhashtable.h:607 [inline] rhashtable_lookup include/linux/rhashtable.h:646 [inline] rhashtable_lookup_fast include/linux/rhashtable.h:672 [inline] tcf_ct_flow_table_get+0x611/0x2260 net/sched/act_ct.c:329 tcf_ct_init+0xa67/0x2890 net/sched/act_ct.c:1408 tcf_action_init_1+0x6cc/0xb30 net/sched/act_api.c:1425 tcf_action_init+0x458/0xf00 net/sched/act_api.c:1488 tcf_action_add net/sched/act_api.c:2061 [inline] tc_ctl_action+0x4be/0x19d0 net/sched/act_api.c:2118 rtnetlink_rcv_msg+0x12fc/0x1410 net/core/rtnetlink.c:6647 netlink_rcv_skb+0x375/0x650 net/netlink/af_netlink.c:2550 rtnetlink_rcv+0x34/0x40 net/core/rtnetlink.c:6665 netlink_unicast_kernel net/netlink/af_netlink.c:1331 [inline] netlink_unicast+0xf52/0x1260 net/netlink/af_netlink.c:1357 netlink_sendmsg+0x10da/0x11e0 net/netlink/af_netlink.c:1901 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x30f/0x380 net/socket.c:745 ____sys_sendmsg+0x877/0xb60 net/socket.c:2597 ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2651 __sys_sendmsg net/socket.c:2680 [inline] __do_sys_sendmsg net/socket.c:2689 [inline] __se_sys_sendmsg net/socket.c:2687 [inline] __x64_sys_sendmsg+0x307/0x4a0 net/socket.c:2687 x64_sys_call+0x2dd6/0x3c10 arch/x86/include/generated/asm/syscalls_64.h:47 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Local variable key created at: tcf_ct_flow_table_get+0x4a/0x2260 net/sched/act_ct.c:324 tcf_ct_init+0xa67/0x2890 net/sched/act_ct.c:1408 Fixes: 88c67aeb1407 ("sched: act_ct: add netns into the key of tcf_ct_flow_table") Reported-by: syzbot+1b5e4e187cc586d05ea0@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-25ethtool: rss: echo the context number backJakub Kicinski
The response to a GET request in Netlink should fully identify the queried object. RSS_GET accepts context id as an input, so it must echo that attribute back to the response. After (assuming context 1 has been created): $ ./cli.py --spec netlink/specs/ethtool.yaml \ --do rss-get \ --json '{"header": {"dev-index": 2}, "context": 1}' {'context': 1, 'header': {'dev-index': 2, 'dev-name': 'eth0'}, [...] Fixes: 7112a04664bf ("ethtool: add netlink based get rss support") Acked-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Joe Damato <jdamato@fastly.com> Link: https://patch.msgid.link/20240724234249.2621109-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-25Merge tag 'net-6.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf and netfilter. A lot of networking people were at a conference last week, busy catching COVID, so relatively short PR. Current release - regressions: - tcp: process the 3rd ACK with sk_socket for TFO and MPTCP Current release - new code bugs: - l2tp: protect session IDR and tunnel session list with one lock, make sure the state is coherent to avoid a warning - eth: bnxt_en: update xdp_rxq_info in queue restart logic - eth: airoha: fix location of the MBI_RX_AGE_SEL_MASK field Previous releases - regressions: - xsk: require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len, the field reuses previously un-validated pad Previous releases - always broken: - tap/tun: drop short frames to prevent crashes later in the stack - eth: ice: add a per-VF limit on number of FDIR filters - af_unix: disable MSG_OOB handling for sockets in sockmap/sockhash" * tag 'net-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (34 commits) tun: add missing verification for short frame tap: add missing verification for short frame mISDN: Fix a use after free in hfcmulti_tx() gve: Fix an edge case for TSO skb validity check bnxt_en: update xdp_rxq_info in queue restart logic tcp: process the 3rd ACK with sk_socket for TFO/MPTCP selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len bpf: Fix a segment issue when downgrading gso_size net: mediatek: Fix potential NULL pointer dereference in dummy net_device handling MAINTAINERS: make Breno the netconsole maintainer MAINTAINERS: Update bonding entry net: nexthop: Initialize all fields in dumped nexthops net: stmmac: Correct byte order of perfect_match selftests: forwarding: skip if kernel not support setting bridge fdb learning limit tipc: Return non-zero value from tipc_udp_addr2str() on error netfilter: nft_set_pipapo_avx2: disable softinterrupts ice: Fix recipe read procedure ice: Add a per-VF limit on number of FDIR filters net: bonding: correctly annotate RCU in bond_should_notify_peers() ...
2024-07-25Merge tag 'constfy-sysctl-6.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl constification from Joel Granados: "Treewide constification of the ctl_table argument of proc_handlers using a coccinelle script and some manual code formatting fixups. This is a prerequisite to moving the static ctl_table structs into read-only data section which will ensure that proc_handler function pointers cannot be modified" * tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: treewide: constify the ctl_table argument of proc_handlers
2024-07-25Merge tag 'driver-core-6.11-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the big set of driver core changes for 6.11-rc1. Lots of stuff in here, with not a huge diffstat, but apis are evolving which required lots of files to be touched. Highlights of the changes in here are: - platform remove callback api final fixups (Uwe took many releases to get here, finally!) - Rust bindings for basic firmware apis and initial driver-core interactions. It's not all that useful for a "write a whole driver in rust" type of thing, but the firmware bindings do help out the phy rust drivers, and the driver core bindings give a solid base on which others can start their work. There is still a long way to go here before we have a multitude of rust drivers being added, but it's a great first step. - driver core const api changes. This reached across all bus types, and there are some fix-ups for some not-common bus types that linux-next and 0-day testing shook out. This work is being done to help make the rust bindings more safe, as well as the C code, moving toward the end-goal of allowing us to put driver structures into read-only memory. We aren't there yet, but are getting closer. - minor devres cleanups and fixes found by code inspection - arch_topology minor changes - other minor driver core cleanups All of these have been in linux-next for a very long time with no reported problems" * tag 'driver-core-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (55 commits) ARM: sa1100: make match function take a const pointer sysfs/cpu: Make crash_hotplug attribute world-readable dio: Have dio_bus_match() callback take a const * zorro: make match function take a const pointer driver core: module: make module_[add|remove]_driver take a const * driver core: make driver_find_device() take a const * driver core: make driver_[create|remove]_file take a const * firmware_loader: fix soundness issue in `request_internal` firmware_loader: annotate doctests as `no_run` devres: Correct code style for functions that return a pointer type devres: Initialize an uninitialized struct member devres: Fix memory leakage caused by driver API devm_free_percpu() devres: Fix devm_krealloc() wasting memory driver core: platform: Switch to use kmemdup_array() driver core: have match() callback in struct bus_type take a const * MAINTAINERS: add Rust device abstractions to DRIVER CORE device: rust: improve safety comments MAINTAINERS: add Danilo as FIRMWARE LOADER maintainer MAINTAINERS: add Rust FW abstractions to FIRMWARE LOADER firmware: rust: improve safety comments ...
2024-07-25Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-07-25 We've added 14 non-merge commits during the last 8 day(s) which contain a total of 19 files changed, 177 insertions(+), 70 deletions(-). The main changes are: 1) Fix af_unix to disable MSG_OOB handling for sockets in BPF sockmap and BPF sockhash. Also add test coverage for this case, from Michal Luczaj. 2) Fix a segmentation issue when downgrading gso_size in the BPF helper bpf_skb_adjust_room(), from Fred Li. 3) Fix a compiler warning in resolve_btfids due to a missing type cast, from Liwei Song. 4) Fix stack allocation for arm64 to align the stack pointer at a 16 byte boundary in the fexit_sleep BPF selftest, from Puranjay Mohan. 5) Fix a xsk regression to require a flag when actuating tx_metadata_len, from Stanislav Fomichev. 6) Fix function prototype BTF dumping in libbpf for prototypes that have no input arguments, from Andrii Nakryiko. 7) Fix stacktrace symbol resolution in perf script for BPF programs containing subprograms, from Hou Tao. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add XDP_UMEM_TX_METADATA_LEN to XSK TX metadata test xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_len bpf: Fix a segment issue when downgrading gso_size tools/resolve_btfids: Fix comparison of distinct pointer types warning in resolve_btfids bpf, events: Use prog to emit ksymbol event for main program selftests/bpf: Test sockmap redirect for AF_UNIX MSG_OOB selftests/bpf: Parametrize AF_UNIX redir functions to accept send() flags selftests/bpf: Support SOCK_STREAM in unix_inet_redir_to_connected() af_unix: Disable MSG_OOB handling for sockets in sockmap/sockhash bpftool: Fix typo in usage help libbpf: Fix no-args func prototype BTF dumping syntax MAINTAINERS: Update powerpc BPF JIT maintainers MAINTAINERS: Update email address of Naveen selftests/bpf: fexit_sleep: Fix stack allocation for arm64 ==================== Link: https://patch.msgid.link/20240725114312.32197-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-25tcp: process the 3rd ACK with sk_socket for TFO/MPTCPMatthieu Baerts (NGI0)
The 'Fixes' commit recently changed the behaviour of TCP by skipping the processing of the 3rd ACK when a sk->sk_socket is set. The goal was to skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an unnecessary ACK in case of simultaneous connect(). Unfortunately, that had an impact on TFO and MPTCP. I started to look at the impact on MPTCP, because the MPTCP CI found some issues with the MPTCP Packetdrill tests [1]. Then Paolo Abeni suggested me to look at the impact on TFO with "plain" TCP. For MPTCP, when receiving the 3rd ACK of a request adding a new path (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that has been created when the MPTCP connection got established before with the first path. The newly added 'goto' will then skip the processing of the segment text (step 7) and not go through tcp_data_queue() where the MPTCP options are validated, and some actions are triggered, e.g. sending the MPJ 4th ACK [2] as demonstrated by the new errors when running a packetdrill test [3] establishing a second subflow. This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be delayed. Still, we don't want to have this behaviour as it delays the switch to the fully established mode, and invalid MPTCP options in this 3rd ACK will not be caught any more. This modification also affects the MPTCP + TFO feature as well, and being the reason why the selftests started to be unstable the last few days [4]. For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer passing: if the 3rd ACK contains data, and the connection is accept()ed before receiving them, these data would no longer be processed, and thus not ACKed. One last thing about MPTCP, in case of simultaneous connect(), a fallback to TCP will be done, which seems fine: `../common/defaults.sh` 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3 +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) +0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> +0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey> +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]> +0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1> Simultaneous SYN-data crossing is also not supported by TFO, see [6]. Kuniyuki Iwashima suggested to restrict the processing to SYN+ACK only: that's a more generic solution than the one initially proposed, and also enough to fix the issues described above. Later on, Eric Dumazet mentioned that an ACK should still be sent in reaction to the second SYN+ACK that is received: not sending a DUPACK here seems wrong and could hurt: 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3 +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) +0 > S 0:0(0) <mss 1460, sackOK, TS val 1000 ecr 0,nop,wscale 8> +0 < S 0:0(0) win 1000 <mss 1000, sackOK, nop, nop> +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 3308134035 ecr 0,nop,wscale 8> +0 < S. 0:0(0) ack 1 win 1000 <mss 1000, sackOK, nop, nop> +0 > . 1:1(0) ack 1 <nop, nop, sack 0:1> // <== Here So in this version, the 'goto consume' is dropped, to always send an ACK when switching from TCP_SYN_RECV to TCP_ESTABLISHED. This ACK will be seen as a DUPACK -- with DSACK if SACK has been negotiated -- in case of simultaneous SYN crossing: that's what is expected here. Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1] Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2] Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3] Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4] Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5] Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6] Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().") Suggested-by: Paolo Abeni <pabeni@redhat.com> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240724-upstream-net-next-20240716-tcp-3rd-ack-consume-sk_socket-v3-1-d48339764ce9@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-25xsk: Require XDP_UMEM_TX_METADATA_LEN to actuate tx_metadata_lenStanislav Fomichev
Julian reports that commit 341ac980eab9 ("xsk: Support tx_metadata_len") can break existing use cases which don't zero-initialize xdp_umem_reg padding. Introduce new XDP_UMEM_TX_METADATA_LEN to make sure we interpret the padding as tx_metadata_len only when being explicitly asked. Fixes: 341ac980eab9 ("xsk: Support tx_metadata_len") Reported-by: Julian Schindel <mail@arctic-alpaca.de> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20240713015253.121248-2-sdf@fomichev.me
2024-07-25bpf: Fix a segment issue when downgrading gso_sizeFred Li
Linearize the skb when downgrading gso_size because it may trigger a BUG_ON() later when the skb is segmented as described in [1,2]. Fixes: 2be7e212d5419 ("bpf: add bpf_skb_adjust_room helper") Signed-off-by: Fred Li <dracodingfly@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/all/20240626065555.35460-2-dracodingfly@gmail.com [1] Link: https://lore.kernel.org/all/668d5cf1ec330_1c18c32947@willemb.c.googlers.com.notmuch [2] Link: https://lore.kernel.org/bpf/20240719024653.77006-1-dracodingfly@gmail.com
2024-07-25Merge tag 'nf-24-07-24' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains a Netfilter fix for net: Patch #1 if FPU is busy, then pipapo set backend falls back to standard set element lookup. Moreover, disable bh while at this. From Florian Westphal. netfilter pull request 24-07-24 * tag 'nf-24-07-24' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_set_pipapo_avx2: disable softinterrupts ==================== Link: https://patch.msgid.link/20240724081305.3152-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-24sysctl: treewide: constify the ctl_table argument of proc_handlersJoel Granados
const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified. This patch has been generated by the following coccinelle script: ``` virtual patch @r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer|interval)_handler|sched_(rt|rr)_handler|rds_tcp_skbuf_handler|proc_sctp_do_(hmac_alg|rto_min|rto_max|udp_port|alpha_beta|auth|probe_interval)"; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos); @r2@ identifier func, ctl, write, buffer, lenp, ppos; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int write, void *buffer, size_t *lenp, loff_t *ppos) { ... } @r3@ identifier func; @@ int func( - struct ctl_table * + const struct ctl_table * ,int , void *, size_t *, loff_t *); @r4@ identifier func, ctl; @@ int func( - struct ctl_table *ctl + const struct ctl_table *ctl ,int , void *, size_t *, loff_t *); @r5@ identifier func, write, buffer, lenp, ppos; @@ int func( - struct ctl_table * + const struct ctl_table * ,int write, void *buffer, size_t *lenp, loff_t *ppos); ``` * Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted. * The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration. Co-developed-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Co-developed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Joel Granados <j.granados@samsung.com>
2024-07-24net: nexthop: Initialize all fields in dumped nexthopsPetr Machata
struct nexthop_grp contains two reserved fields that are not initialized by nla_put_nh_group(), and carry garbage. This can be observed e.g. with strace (edited for clarity): # ip nexthop add id 1 dev lo # ip nexthop add id 101 group 1 # strace -e recvmsg ip nexthop get id 101 ... recvmsg(... [{nla_len=12, nla_type=NHA_GROUP}, [{id=1, weight=0, resvd1=0x69, resvd2=0x67}]] ...) = 52 The fields are reserved and therefore not currently used. But as they are, they leak kernel memory, and the fact they are not just zero complicates repurposing of the fields for new ends. Initialize the full structure. Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-24tipc: Return non-zero value from tipc_udp_addr2str() on errorShigeru Yoshida
tipc_udp_addr2str() should return non-zero value if the UDP media address is invalid. Otherwise, a buffer overflow access can occur in tipc_media_addr_printf(). Fix this by returning 1 on an invalid UDP media address. Fixes: d0f91938bede ("tipc: add ip/udp media type") Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Tung Nguyen <tung.q.nguyen@endava.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-24netfilter: nft_set_pipapo_avx2: disable softinterruptsFlorian Westphal
We need to disable softinterrupts, else we get following problem: 1. pipapo_avx2 called from process context; fpu usable 2. preempt_disable() called, pcpu scratchmap in use 3. softirq handles rx or tx, we re-enter pipapo_avx2 4. fpu busy, fallback to generic non-avx version 5. fallback reuses scratch map and index, which are in use by the preempted process Handle this same way as generic version by first disabling softinterrupts while the scratchmap is in use. Fixes: f0b3d338064e ("netfilter: nft_set_pipapo_avx2: Add irq_fpu_usable() check, fallback to non-AVX2 version") Cc: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-07-23l2tp: make session IDR and tunnel session list coherentJames Chapman
Modify l2tp_session_register and l2tp_session_unhash so that the session IDR and tunnel session lists remain coherent. To do so, hold the session IDR lock and the tunnel's session list lock when making any changes to either list. Without this change, a rare race condition could hit the WARN_ON_ONCE in l2tp_session_unhash if a thread replaced the IDR entry while another thread was registering the same ID. [ 7126.151795][T17511] WARNING: CPU: 3 PID: 17511 at net/l2tp/l2tp_core.c:1282 l2tp_session_delete.part.0+0x87e/0xbc0 [ 7126.163754][T17511] ? show_regs+0x93/0xa0 [ 7126.164157][T17511] ? __warn+0xe5/0x3c0 [ 7126.164536][T17511] ? l2tp_session_delete.part.0+0x87e/0xbc0 [ 7126.165070][T17511] ? report_bug+0x2e1/0x500 [ 7126.165486][T17511] ? l2tp_session_delete.part.0+0x87e/0xbc0 [ 7126.166013][T17511] ? handle_bug+0x99/0x130 [ 7126.166428][T17511] ? exc_invalid_op+0x35/0x80 [ 7126.166890][T17511] ? asm_exc_invalid_op+0x1a/0x20 [ 7126.167372][T17511] ? l2tp_session_delete.part.0+0x87d/0xbc0 [ 7126.167900][T17511] ? l2tp_session_delete.part.0+0x87e/0xbc0 [ 7126.168429][T17511] ? __local_bh_enable_ip+0xa4/0x120 [ 7126.168917][T17511] l2tp_session_delete+0x40/0x50 [ 7126.169369][T17511] pppol2tp_release+0x1a1/0x3f0 [ 7126.169817][T17511] __sock_release+0xb3/0x270 [ 7126.170247][T17511] ? __pfx_sock_close+0x10/0x10 [ 7126.170697][T17511] sock_close+0x1c/0x30 [ 7126.171087][T17511] __fput+0x40b/0xb90 [ 7126.171470][T17511] task_work_run+0x16c/0x260 [ 7126.171897][T17511] ? __pfx_task_work_run+0x10/0x10 [ 7126.172362][T17511] ? srso_alias_return_thunk+0x5/0xfbef5 [ 7126.172863][T17511] ? do_raw_spin_unlock+0x174/0x230 [ 7126.173348][T17511] do_exit+0xaae/0x2b40 [ 7126.173730][T17511] ? srso_alias_return_thunk+0x5/0xfbef5 [ 7126.174235][T17511] ? __pfx_lock_release+0x10/0x10 [ 7126.174690][T17511] ? srso_alias_return_thunk+0x5/0xfbef5 [ 7126.175190][T17511] ? do_raw_spin_lock+0x12c/0x2b0 [ 7126.175650][T17511] ? __pfx_do_exit+0x10/0x10 [ 7126.176072][T17511] ? _raw_spin_unlock_irq+0x23/0x50 [ 7126.176543][T17511] do_group_exit+0xd3/0x2a0 [ 7126.176990][T17511] __x64_sys_exit_group+0x3e/0x50 [ 7126.177456][T17511] x64_sys_call+0x1821/0x1830 [ 7126.177895][T17511] do_syscall_64+0xcb/0x250 [ 7126.178317][T17511] entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: aa5e17e1f5ec ("l2tp: store l2tpv3 sessions in per-net IDR") Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: Tom Parkin <tparkin@katalix.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240718134348.289865-1-jchapman@katalix.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-23ipv4: Fix incorrect source address in Record Route optionIdo Schimmel
The Record Route IP option records the addresses of the routers that routed the packet. In the case of forwarded packets, the kernel performs a route lookup via fib_lookup() and fills in the preferred source address of the matched route. The lookup is performed with the DS field of the forwarded packet, but using the RT_TOS() macro which only masks one of the two ECN bits. If the packet is ECT(0) or CE, the matched route might be different than the route via which the packet was forwarded as the input path masks both of the ECN bits, resulting in the wrong address being filled in the Record Route option. Fix by masking both of the ECN bits. Fixes: 8e36360ae876 ("ipv4: Remove route key identity dependencies in ip_rt_get_source().") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20240718123407.434778-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-21Merge tag 'mm-nonmm-stable-2024-07-21-15-07' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - In the series "treewide: Refactor heap related implementation", Kuan-Wei Chiu has significantly reworked the min_heap library code and has taught bcachefs to use the new more generic implementation. - Yury Norov's series "Cleanup cpumask.h inclusion in core headers" reworks the cpumask and nodemask headers to make things generally more rational. - Kuan-Wei Chiu has sent along some maintenance work against our sorting library code in the series "lib/sort: Optimizations and cleanups". - More library maintainance work from Christophe Jaillet in the series "Remove usage of the deprecated ida_simple_xx() API". - Ryusuke Konishi continues with the nilfs2 fixes and clanups in the series "nilfs2: eliminate the call to inode_attach_wb()". - Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix GDB command error". - Plus the usual shower of singleton patches all over the place. Please see the relevant changelogs for details. * tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (98 commits) ia64: scrub ia64 from poison.h watchdog/perf: properly initialize the turbo mode timestamp and rearm counter tsacct: replace strncpy() with strscpy() lib/bch.c: use swap() to improve code test_bpf: convert comma to semicolon init/modpost: conditionally check section mismatch to __meminit* init: remove unused __MEMINIT* macros nilfs2: Constify struct kobj_type nilfs2: avoid undefined behavior in nilfs_cnt32_ge macro math: rational: add missing MODULE_DESCRIPTION() macro lib/zlib: add missing MODULE_DESCRIPTION() macro fs: ufs: add MODULE_DESCRIPTION() lib/rbtree.c: fix the example typo ocfs2: add bounds checking to ocfs2_check_dir_entry() fs: add kernel-doc comments to ocfs2_prepare_orphan_dir() coredump: simplify zap_process() selftests/fpu: add missing MODULE_DESCRIPTION() macro compiler.h: simplify data_race() macro build-id: require program headers to be right after ELF header resource: add missing MODULE_DESCRIPTION() ...
2024-07-19Merge tag 'net-6.11-rc0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter. Notably this includes fixes for a s390 build breakage. Current release - new code bugs: - eth: fbnic: fix s390 build - eth: airoha: fix NULL pointer dereference in airoha_qdma_cleanup_rx_queue() Previous releases - regressions: - flow_dissector: use DEBUG_NET_WARN_ON_ONCE - ipv4: fix incorrect TOS in route get reply - dsa: fix chip-wide frame size config in some drivers Previous releases - always broken: - netfilter: nf_set_pipapo: fix initial map fill - eth: gve: fix XDP TX completion handling when counters overflow" * tag 'net-6.11-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: eth: fbnic: don't build the driver when skb has more than 21 frags net: dsa: b53: Limit chip-wide jumbo frame config to CPU ports net: dsa: mv88e6xxx: Limit chip-wide frame size config to CPU ports net: airoha: Fix NULL pointer dereference in airoha_qdma_cleanup_rx_queue() net: wwan: t7xx: add support for Dell DW5933e ipv4: Fix incorrect TOS in fibmatch route get reply ipv4: Fix incorrect TOS in route get reply net: flow_dissector: use DEBUG_NET_WARN_ON_ONCE driver core: auxiliary bus: Fix documentation of auxiliary_device net: airoha: fix error branch in airoha_dev_xmit and airoha_set_gdm_ports gve: Fix XDP TX completion handling when counters overflow ipvs: properly dereference pe in ip_vs_add_service selftests: netfilter: add test case for recent mismatch bug netfilter: nf_set_pipapo: fix initial map fill netfilter: ctnetlink: use helper function to calculate expect ID eth: fbnic: fix s390 build.
2024-07-19Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: "Several new features here: - Virtio find vqs API has been reworked (required to fix the scalability issue we have with adminq, which I hope to merge later in the cycle) - vDPA driver for Marvell OCTEON - virtio fs performance improvement - mlx5 migration speedups Fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (56 commits) virtio: rename virtio_find_vqs_info() to virtio_find_vqs() virtio: remove unused virtio_find_vqs() and virtio_find_vqs_ctx() helpers virtio: convert the rest virtio_find_vqs() users to virtio_find_vqs_info() virtio_balloon: convert to use virtio_find_vqs_info() virtiofs: convert to use virtio_find_vqs_info() scsi: virtio_scsi: convert to use virtio_find_vqs_info() virtio_net: convert to use virtio_find_vqs_info() virtio_crypto: convert to use virtio_find_vqs_info() virtio_console: convert to use virtio_find_vqs_info() virtio_blk: convert to use virtio_find_vqs_info() virtio: rename find_vqs_info() op to find_vqs() virtio: remove the original find_vqs() op virtio: call virtio_find_vqs_info() from virtio_find_single_vq() directly virtio: convert find_vqs() op implementations to find_vqs_info() virtio_pci: convert vp_*find_vqs() ops to find_vqs_info() virtio: introduce virtio_queue_info struct and find_vqs_info() config op virtio: make virtio_find_single_vq() call virtio_find_vqs() virtio: make virtio_find_vqs() call virtio_find_vqs_ctx() caif_virtio: use virtio_find_single_vq() for single virtqueue finding vdpa/mlx5: Don't enable non-active VQs in .set_vq_ready() ...
2024-07-19sunrpc: avoid -Wformat-security warningArnd Bergmann
Using a non-constant string as an sprintf-style is potentially dangerous: net/sunrpc/svc.c: In function 'param_get_pool_mode': net/sunrpc/svc.c:164:32: error: format not a string literal and no format arguments [-Werror=format-security] Use a literal "%s" format instead. Fixes: 5f71f3c32553 ("sunrpc: refactor pool_mode setting code") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-18Merge tag 'nfs-for-6.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Pull NFS client updates from Anna Schumaker: "New Features: - Add support for large folios - Implement rpcrdma generic device removal notification - Add client support for attribute delegations - Use a LAYOUTRETURN during reboot recovery to report layoutstats and errors - Improve throughput for random buffered writes - Add NVMe support to pnfs/blocklayout Bugfixes: - Fix rpcrdma_reqs_reset() - Avoid soft lockups when using UDP - Fix an nfs/blocklayout premature PR key unregestration - Another fix for EXCHGID4_FLAG_USE_PNFS_DS for DS server - Do not extend writes to the entire folio - Pass explicit offset and count values to tracepoints - Fix a race to wake up sleeping SUNRPC sync tasks - Fix gss_status tracepoint output Cleanups: - Add missing MODULE_DESCRIPTION() macros - Add blocklayout / SCSI layout tracepoints - Remove asm-generic headers from xprtrdma verbs.c - Remove unused 'struct mnt_fhstatus' - Other delegation related cleanups - Other folio related cleanups - Other pNFS related cleanups - Other xprtrdma cleanups" * tag 'nfs-for-6.11-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (63 commits) SUNRPC: Fixup gss_status tracepoint error output SUNRPC: Fix a race to wake a sync task nfs: split nfs_read_folio nfs: pass explicit offset/count to trace events nfs: do not extend writes to the entire folio nfs/blocklayout: add support for NVMe nfs: remove nfs_page_length nfs: remove the unused max_deviceinfo_size field from struct pnfs_layoutdriver_type nfs: don't reuse partially completed requests in nfs_lock_and_join_requests nfs: move nfs_wait_on_request to write.c nfs: fold nfs_page_group_lock_subrequests into nfs_lock_and_join_requests nfs: fold nfs_folio_find_and_lock_request into nfs_lock_and_join_requests nfs: simplify nfs_folio_find_and_lock_request nfs: remove nfs_folio_private_request nfs: remove dead code for the old swap over NFS implementation NFSv4.1 another fix for EXCHGID4_FLAG_USE_PNFS_DS for DS server nfs: Block on write congestion nfs: Properly initialize server->writeback nfs: Drop pointless check from nfs_commit_release_pages() nfs/blocklayout: SCSI layout trace points for reservation key reg/unreg ...
2024-07-18SUNRPC: Fix a race to wake a sync taskBenjamin Coddington
We've observed NFS clients with sync tasks sleeping in __rpc_execute waiting on RPC_TASK_QUEUED that have not responded to a wake-up from rpc_make_runnable(). I suspect this problem usually goes unnoticed, because on a busy client the task will eventually be re-awoken by another task completion or xprt event. However, if the state manager is draining the slot table, a sync task missing a wake-up can result in a hung client. We've been able to prove that the waker in rpc_make_runnable() successfully calls wake_up_bit() (ie- there's no race to tk_runstate), but the wake_up_bit() call fails to wake the waiter. I suspect the waker is missing the load of the bit's wait_queue_head, so waitqueue_active() is false. There are some very helpful comments about this problem above wake_up_bit(), prepare_to_wait(), and waitqueue_active(). Fix this by inserting smp_mb__after_atomic() before the wake_up_bit(), which pairs with prepare_to_wait() calling set_current_state(). Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2024-07-18Merge tag 'nf-24-07-17' of ↵Paolo Abeni
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for net: 1) Call nf_expect_get_id() to delete expectation by ID. By trial and error it is possible to leak the LSB of the expectation address on x86_64. This bug is a leftover when converting the existing code to use nf_expect_get_id(). 2) Incorrect initialization in pipapo set backend leads to packet mismatches. From Florian Westphal. 3) Extend netfilter's selftests to cover for the pipapo set backend, also from Florian. 4) Fix sparse warning in IPVS when adding service, from Chen Hanxiao. netfilter pull request 24-07-17 * tag 'nf-24-07-17' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: ipvs: properly dereference pe in ip_vs_add_service selftests: netfilter: add test case for recent mismatch bug netfilter: nf_set_pipapo: fix initial map fill netfilter: ctnetlink: use helper function to calculate expect ID ==================== Link: https://patch.msgid.link/20240717215214.225394-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-18ipv4: Fix incorrect TOS in fibmatch route get replyIdo Schimmel
The TOS value that is returned to user space in the route get reply is the one with which the lookup was performed ('fl4->flowi4_tos'). This is fine when the matched route is configured with a TOS as it would not match if its TOS value did not match the one with which the lookup was performed. However, matching on TOS is only performed when the route's TOS is not zero. It is therefore possible to have the kernel incorrectly return a non-zero TOS: # ip link add name dummy1 up type dummy # ip address add 192.0.2.1/24 dev dummy1 # ip route get fibmatch 192.0.2.2 tos 0xfc 192.0.2.0/24 tos 0x1c dev dummy1 proto kernel scope link src 192.0.2.1 Fix by instead returning the DSCP field from the FIB result structure which was populated during the route lookup. Output after the patch: # ip link add name dummy1 up type dummy # ip address add 192.0.2.1/24 dev dummy1 # ip route get fibmatch 192.0.2.2 tos 0xfc 192.0.2.0/24 dev dummy1 proto kernel scope link src 192.0.2.1 Extend the existing selftests to not only verify that the correct route is returned, but that it is also returned with correct "tos" value (or without it). Fixes: b61798130f1b ("net: ipv4: RTM_GETROUTE: return matched fib result when requested") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-18ipv4: Fix incorrect TOS in route get replyIdo Schimmel
The TOS value that is returned to user space in the route get reply is the one with which the lookup was performed ('fl4->flowi4_tos'). This is fine when the matched route is configured with a TOS as it would not match if its TOS value did not match the one with which the lookup was performed. However, matching on TOS is only performed when the route's TOS is not zero. It is therefore possible to have the kernel incorrectly return a non-zero TOS: # ip link add name dummy1 up type dummy # ip address add 192.0.2.1/24 dev dummy1 # ip route get 192.0.2.2 tos 0xfc 192.0.2.2 tos 0x1c dev dummy1 src 192.0.2.1 uid 0 cache Fix by adding a DSCP field to the FIB result structure (inside an existing 4 bytes hole), populating it in the route lookup and using it when filling the route get reply. Output after the patch: # ip link add name dummy1 up type dummy # ip address add 192.0.2.1/24 dev dummy1 # ip route get 192.0.2.2 tos 0xfc 192.0.2.2 dev dummy1 src 192.0.2.1 uid 0 cache Fixes: 1a00fee4ffb2 ("ipv4: Remove rt_key_{src,dst,tos} from struct rtable.") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-18net: flow_dissector: use DEBUG_NET_WARN_ON_ONCEPablo Neira Ayuso
The following splat is easy to reproduce upstream as well as in -stable kernels. Florian Westphal provided the following commit: d1dab4f71d37 ("net: add and use __skb_get_hash_symmetric_net") but this complementary fix has been also suggested by Willem de Bruijn and it can be easily backported to -stable kernel which consists in using DEBUG_NET_WARN_ON_ONCE instead to silence the following splat given __skb_get_hash() is used by the nftables tracing infrastructure to to identify packets in traces. [69133.561393] ------------[ cut here ]------------ [69133.561404] WARNING: CPU: 0 PID: 43576 at net/core/flow_dissector.c:1104 __skb_flow_dissect+0x134f/ [...] [69133.561944] CPU: 0 PID: 43576 Comm: socat Not tainted 6.10.0-rc7+ #379 [69133.561959] RIP: 0010:__skb_flow_dissect+0x134f/0x2ad0 [69133.561970] Code: 83 f9 04 0f 84 b3 00 00 00 45 85 c9 0f 84 aa 00 00 00 41 83 f9 02 0f 84 81 fc ff ff 44 0f b7 b4 24 80 00 00 00 e9 8b f9 ff ff <0f> 0b e9 20 f3 ff ff 41 f6 c6 20 0f 84 e4 ef ff ff 48 8d 7b 12 e8 [69133.561979] RSP: 0018:ffffc90000006fc0 EFLAGS: 00010246 [69133.561988] RAX: 0000000000000000 RBX: ffffffff82f33e20 RCX: ffffffff81ab7e19 [69133.561994] RDX: dffffc0000000000 RSI: ffffc90000007388 RDI: ffff888103a1b418 [69133.562001] RBP: ffffc90000007310 R08: 0000000000000000 R09: 0000000000000000 [69133.562007] R10: ffffc90000007388 R11: ffffffff810cface R12: ffff888103a1b400 [69133.562013] R13: 0000000000000000 R14: ffffffff82f33e2a R15: ffffffff82f33e28 [69133.562020] FS: 00007f40f7131740(0000) GS:ffff888390800000(0000) knlGS:0000000000000000 [69133.562027] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [69133.562033] CR2: 00007f40f7346ee0 CR3: 000000015d200001 CR4: 00000000001706f0 [69133.562040] Call Trace: [69133.562044] <IRQ> [69133.562049] ? __warn+0x9f/0x1a0 [ 1211.841384] ? __skb_flow_dissect+0x107e/0x2860 [...] [ 1211.841496] ? bpf_flow_dissect+0x160/0x160 [ 1211.841753] __skb_get_hash+0x97/0x280 [ 1211.841765] ? __skb_get_hash_symmetric+0x230/0x230 [ 1211.841776] ? mod_find+0xbf/0xe0 [ 1211.841786] ? get_stack_info_noinstr+0x12/0xe0 [ 1211.841798] ? bpf_ksym_find+0x56/0xe0 [ 1211.841807] ? __rcu_read_unlock+0x2a/0x70 [ 1211.841819] nft_trace_init+0x1b9/0x1c0 [nf_tables] [ 1211.841895] ? nft_trace_notify+0x830/0x830 [nf_tables] [ 1211.841964] ? get_stack_info+0x2b/0x80 [ 1211.841975] ? nft_do_chain_arp+0x80/0x80 [nf_tables] [ 1211.842044] nft_do_chain+0x79c/0x850 [nf_tables] Fixes: 9b52e3f267a6 ("flow_dissector: handle no-skb use case") Suggested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240715141442.43775-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-17ipvs: properly dereference pe in ip_vs_add_serviceChen Hanxiao
Use pe directly to resolve sparse warning: net/netfilter/ipvs/ip_vs_ctl.c:1471:27: warning: dereference of noderef expression Fixes: 39b972231536 ("ipvs: handle connections started by real-servers") Signed-off-by: Chen Hanxiao <chenhx.fnst@fujitsu.com> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-07-17af_unix: Disable MSG_OOB handling for sockets in sockmap/sockhashMichal Luczaj
AF_UNIX socket tracks the most recent OOB packet (in its receive queue) with an `oob_skb` pointer. BPF redirecting does not account for that: when an OOB packet is moved between sockets, `oob_skb` is left outdated. This results in a single skb that may be accessed from two different sockets. Take the easy way out: silently drop MSG_OOB data targeting any socket that is in a sockmap or a sockhash. Note that such silent drop is akin to the fate of redirected skb's scm_fp_list (SCM_RIGHTS, SCM_CREDENTIALS). For symmetry, forbid MSG_OOB in unix_bpf_recvmsg(). Fixes: 314001f0bf92 ("af_unix: Add OOB support") Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20240713200218.2140950-2-mhal@rbox.co
2024-07-17Merge tag 'nfsd-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds
Pull nfsd updates from Chuck Lever: "This is a light release containing optimizations, code clean-ups, and minor bug fixes. This development cycle focused on work outside of upstream kernel development: - Continuing to build upstream CI for NFSD based on kdevops - Continuing to focus on the quality of NFSD in LTS kernels - Participation in IETF nfsv4 WG discussions about NFSv4 ACLs, directory delegation, and NFSv4.2 COPY offload Notable features for v6.11 that do not come through the NFSD tree include NFS server-side support for the new pNFS NVMe layout type [RFC9561]. Functional testing for pNFS block layouts like this one has been introduced to our kdevops CI harness. Work on improving the resolution of file attribute time stamps in local filesystems is also ongoing tree-wide. As always I am grateful to NFSD contributors, reviewers, testers, and bug reporters who participated during this cycle" * tag 'nfsd-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: nfsd_file_lease_notifier_call gets a file_lease as an argument gss_krb5: Fix the error handling path for crypto_sync_skcipher_setkey MAINTAINERS: Add a bugzilla link for NFSD nfsd: new netlink ops to get/set server pool_mode sunrpc: refactor pool_mode setting code nfsd: allow passing in array of thread counts via netlink nfsd: make nfsd_svc take an array of thread counts sunrpc: fix up the special handling of sv_nrpools == 1 SUNRPC: Add a trace point in svc_xprt_deferred_close NFSD: Support write delegations in LAYOUTGET lockd: Use *-y instead of *-objs in Makefile NFSD: Fix nfsdcld warning svcrdma: Handle ADDR_CHANGE CM event properly svcrdma: Refactor the creation of listener CMA ID NFSD: remove unused structs 'nfsd3_voidargs' NFSD: harden svcxdr_dupstr() and svcxdr_tmpalloc() against integer overflows
2024-07-17netfilter: nf_set_pipapo: fix initial map fillFlorian Westphal
The initial buffer has to be inited to all-ones, but it must restrict it to the size of the first field, not the total field size. After each round in the map search step, the result and the fill map are swapped, so if we have a set where f->bsize of the first element is smaller than m->bsize_max, those one-bits are leaked into future rounds result map. This makes pipapo find an incorrect matching results for sets where first field size is not the largest. Followup patch adds a test case to nft_concat_range.sh selftest script. Thanks to Stefano Brivio for pointing out that we need to zero out the remainder explicitly, only correcting memset() argument isn't enough. Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Reported-by: Yi Chen <yiche@redhat.com> Cc: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-07-17netfilter: ctnetlink: use helper function to calculate expect IDPablo Neira Ayuso
Delete expectation path is missing a call to the nf_expect_get_id() helper function to calculate the expectation ID, otherwise LSB of the expectation object address is leaked to userspace. Fixes: 3c79107631db ("netfilter: ctnetlink: don't use conntrack/expect object addresses as id") Reported-by: zdi-disclosures@trendmicro.com Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>