summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2023-11-28net: page_pool: report when page pool was destroyedJakub Kicinski
Report when page pool was destroyed. Together with the inflight / memory use reporting this can serve as a replacement for the warning about leaked page pools we currently print to dmesg. Example output for a fake leaked page pool using some hacks in netdevsim (one "live" pool, and one "leaked" on the same dev): $ ./cli.py --no-schema --spec netlink/specs/netdev.yaml \ --dump page-pool-get [{'id': 2, 'ifindex': 3}, {'id': 1, 'ifindex': 3, 'destroyed': 133, 'inflight': 1}] Tested-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-28net: page_pool: report amount of memory held by page poolsJakub Kicinski
Advanced deployments need the ability to check memory use of various system components. It makes it possible to make informed decisions about memory allocation and to find regressions and leaks. Report memory use of page pools. Report both number of references and bytes held. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-28net: page_pool: add netlink notifications for state changesJakub Kicinski
Generate netlink notifications about page pool state changes. Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-28net: page_pool: implement GET in the netlink APIJakub Kicinski
Expose the very basic page pool information via netlink. Example using ynl-py for a system with 9 queues: $ ./cli.py --no-schema --spec netlink/specs/netdev.yaml \ --dump page-pool-get [{'id': 19, 'ifindex': 2, 'napi-id': 147}, {'id': 18, 'ifindex': 2, 'napi-id': 146}, {'id': 17, 'ifindex': 2, 'napi-id': 145}, {'id': 16, 'ifindex': 2, 'napi-id': 144}, {'id': 15, 'ifindex': 2, 'napi-id': 143}, {'id': 14, 'ifindex': 2, 'napi-id': 142}, {'id': 13, 'ifindex': 2, 'napi-id': 141}, {'id': 12, 'ifindex': 2, 'napi-id': 140}, {'id': 11, 'ifindex': 2, 'napi-id': 139}, {'id': 10, 'ifindex': 2, 'napi-id': 138}] Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-28net: page_pool: stash the NAPI ID for easier accessJakub Kicinski
To avoid any issues with race conditions on accessing napi and having to think about the lifetime of NAPI objects in netlink GET - stash the napi_id to which page pool was linked at creation time. Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-28net: page_pool: record pools per netdevJakub Kicinski
Link the page pools with netdevs. This needs to be netns compatible so we have two options. Either we record the pools per netns and have to worry about moving them as the netdev gets moved. Or we record them directly on the netdev so they move with the netdev without any extra work. Implement the latter option. Since pools may outlast netdev we need a place to store orphans. In time honored tradition use loopback for this purpose. Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-28net: page_pool: id the page poolsJakub Kicinski
To give ourselves the flexibility of creating netlink commands and ability to refer to page pool instances in uAPIs create IDs for page pools. Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-28net: page_pool: factor out uninitJakub Kicinski
We'll soon (next change in the series) need a fuller unwind path in page_pool_create() so create the inverse of page_pool_init(). Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-27Merge tag 'wireless-next-2023-11-27' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v6.8 The first features pull request for v6.8. Not so big in number of commits but we removed quite a few ancient drivers: libertas 16-bit PCMCIA support, atmel, hostap, zd1201, orinoco, ray_cs, wl3501 and rndis_wlan. Major changes: cfg80211/mac80211 - extend support for scanning while Multi-Link Operation (MLO) connected * tag 'wireless-next-2023-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (68 commits) wifi: nl80211: Documentation update for NL80211_CMD_PORT_AUTHORIZED event wifi: mac80211: Extend support for scanning while MLO connected wifi: cfg80211: Extend support for scanning while MLO connected wifi: ieee80211: fix PV1 frame control field name rfkill: return ENOTTY on invalid ioctl MAINTAINERS: update iwlwifi maintainers wifi: rtw89: 8922a: read efuse content from physical map wifi: rtw89: 8922a: read efuse content via efuse map struct from logic map wifi: rtw89: 8852c: read RX gain offset from efuse for 6GHz channels wifi: rtw89: mac: add to access efuse for WiFi 7 chips wifi: rtw89: mac: use mac_gen pointer to access about efuse wifi: rtw89: 8922a: add 8922A basic chip info wifi: rtlwifi: drop unused const_amdpci_aspm wifi: mwifiex: mwifiex_process_sleep_confirm_resp(): remove unused priv variable wifi: rtw89: regd: update regulatory map to R65-R44 wifi: rtw89: regd: handle policy of 6 GHz according to BIOS wifi: rtw89: acpi: process 6 GHz band policy from DSM wifi: rtlwifi: simplify rtl_action_proc() and rtl_tx_agg_start() wifi: rtw89: pci: update interrupt mitigation register for 8922AE wifi: rtw89: pci: correct interrupt mitigation register for 8852CE ... ==================== Link: https://lore.kernel.org/r/20231127180056.0B48DC433C8@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-27bpf: Fix a few selftest failures due to llvm18 changeYonghong Song
With latest upstream llvm18, the following test cases failed: $ ./test_progs -j #13/2 bpf_cookie/multi_kprobe_link_api:FAIL #13/3 bpf_cookie/multi_kprobe_attach_api:FAIL #13 bpf_cookie:FAIL #77 fentry_fexit:FAIL #78/1 fentry_test/fentry:FAIL #78 fentry_test:FAIL #82/1 fexit_test/fexit:FAIL #82 fexit_test:FAIL #112/1 kprobe_multi_test/skel_api:FAIL #112/2 kprobe_multi_test/link_api_addrs:FAIL [...] #112 kprobe_multi_test:FAIL #356/17 test_global_funcs/global_func17:FAIL #356 test_global_funcs:FAIL Further analysis shows llvm upstream patch [1] is responsible for the above failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c, without [1], the asm code is: 0000000000000400 <bpf_fentry_test7>: 400: f3 0f 1e fa endbr64 404: e8 00 00 00 00 callq 0x409 <bpf_fentry_test7+0x9> 409: 48 89 f8 movq %rdi, %rax 40c: c3 retq 40d: 0f 1f 00 nopl (%rax) ... and with [1], the asm code is: 0000000000005d20 <bpf_fentry_test7.specialized.1>: 5d20: e8 00 00 00 00 callq 0x5d25 <bpf_fentry_test7.specialized.1+0x5> 5d25: c3 retq ... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7> and this caused test failures for #13/#77 etc. except #356. For test case #356/17, with [1] (progs/test_global_func17.c)), the main prog looks like: 0000000000000000 <global_func17>: 0: b4 00 00 00 2a 00 00 00 w0 = 0x2a 1: 95 00 00 00 00 00 00 00 exit ... which passed verification while the test itself expects a verification failure. Let us add 'barrier_var' style asm code in both places to prevent function specialization which caused selftests failure. [1] https://github.com/llvm/llvm-project/pull/72903 Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231127050342.1945270-1-yonghong.song@linux.dev
2023-11-27wifi: mac80211: use wiphy locked debugfs for sdata/linkJohannes Berg
The debugfs files for netdevs (sdata) and links are removed with the wiphy mutex held, which may deadlock. Use the new wiphy locked debugfs to avoid that. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-11-27wifi: mac80211: use wiphy locked debugfs helpers for agg_statusJohannes Berg
The read is currently with RCU and the write can deadlock, convert both for the sake of illustration. Make mac80211 depend on cfg80211 debugfs to get the helpers, but mac80211 debugfs without it does nothing anyway. This also required some adjustments in ath9k. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-11-27wifi: cfg80211: add locked debugfs wrappersJohannes Berg
Add wrappers for debugfs files that should be called with the wiphy mutex held, while the file is also to be removed under the wiphy mutex. This could otherwise deadlock when a file is trying to acquire the wiphy mutex while the code removing it holds the mutex but waits for the removal. This actually works by pushing the execution of the read or write handler to a wiphy work that can be cancelled using the debugfs cancellation API. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-11-24wifi: mac80211: Extend support for scanning while MLO connectedIlan Peer
- If the scan request includes a link ID, validate that it is one of the active links. Otherwise, if the scan request doesn't include a valid link ID, select one of the active links. - When reporting the TSF for a BSS entry, use the link ID information from the Rx status or the scan request to set the parent BSSID. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20231113112844.68564692c404.Iae9605cbb7f9d52e00ce98260b3559a34cf18341@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-11-24wifi: cfg80211: Extend support for scanning while MLO connectedIlan Peer
To extend the support of TSF accounting in scan results for MLO connections, allow to indicate in the scan request the link ID corresponding to the BSS whose TSF should be used for the TSF accounting. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Signed-off-by: Gregory Greenman <gregory.greenman@intel.com> Link: https://lore.kernel.org/r/20231113112844.d4490bcdefb1.I8fcd158b810adddef4963727e9153096416b30ce@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-11-24rfkill: return ENOTTY on invalid ioctlThomas Weißschuh
For unknown ioctls the correct error is ENOTTY "Inappropriate ioctl for device". ENOSYS as returned before should only be used to indicate that a syscall is not available at all. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://lore.kernel.org/r/20231101-rfkill-ioctl-enosys-v1-1-5bf374fabffe@weissschuh.net [in theory this breaks userspace API, but it was discussed and researched, and nothing found relying on the current behaviour] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-11-24wifi: mac80211: handle 320 MHz in ieee80211_ht_cap_ie_to_sta_ht_capBen Greear
The new 320 MHz channel width wasn't handled, so connecting a station to a 320 MHz AP would limit the station to 20 MHz (on HT) after a warning, handle 320 MHz to fix that. Signed-off-by: Ben Greear <greearb@candelatech.com> Link: https://lore.kernel.org/r/20231109182201.495381-1-greearb@candelatech.com [write a proper commit message] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-11-24wifi: cfg80211: hold wiphy mutex for send_interfaceJohannes Berg
Given all the locking rework in mac80211, we pretty much need to get into the driver with the wiphy mutex held in all callbacks. This is already mostly the case, but as Johan reported, in the get_txpower it may not be true. Lock the wiphy mutex around nl80211_send_iface(), then is also around callers of nl80211_notify_iface(). This is easy to do, fixes the problem, and aligns the locking between various calls to it in different parts of the code of cfg80211. Fixes: 0e8185ce1dde ("wifi: mac80211: check wiphy mutex in ops") Reported-by: Johan Hovold <johan@kernel.org> Closes: https://lore.kernel.org/r/ZVOXX6qg4vXEx8dX@hovoldconsulting.com Tested-by: Johan Hovold <johan+linaro@kernel.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-11-24wifi: cfg80211: lock wiphy mutex for rfkill pollJohannes Berg
We want to guarantee the mutex is held for pretty much all operations, so ensure that here as well. Reported-by: syzbot+7e59a5bfc7a897247e18@syzkaller.appspotmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-11-24wifi: cfg80211: fix CQM for non-range useJohannes Berg
My prior race fix here broke CQM when ranges aren't used, as the reporting worker now requires the cqm_config to be set in the wdev, but isn't set when there's no range configured. Rather than continuing to special-case the range version, set the cqm_config always and configure accordingly, also tracking if range was used or not to be able to clear the configuration appropriately with the same API, which was actually not right if both were implemented by a driver for some reason, as is the case with mac80211 (though there the implementations are equivalent so it doesn't matter.) Also, the original multiple-RSSI commit lost checking for the callback, so might have potentially crashed if a driver had neither implementation, and userspace tried to use it despite not being advertised as supported. Cc: stable@vger.kernel.org Fixes: 4a4b8169501b ("cfg80211: Accept multiple RSSI thresholds for CQM") Fixes: 37c20b2effe9 ("wifi: cfg80211: fix cqm_config access race") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-11-24wifi: mac80211: do not pass AP_VLAN vif pointer to drivers during flushOldřich Jedlička
This fixes WARN_ONs when using AP_VLANs after station removal. The flush call passed AP_VLAN vif to driver, but because these vifs are virtual and not registered with drivers, we need to translate to the correct AP vif first. Closes: https://github.com/openwrt/openwrt/issues/12420 Fixes: 0b75a1b1e42e ("wifi: mac80211: flush queues on STA removal") Fixes: d00800a289c9 ("wifi: mac80211: add flush_sta method") Tested-by: Konstantin Demin <rockdrilla@gmail.com> Tested-by: Koen Vandeputte <koen.vandeputte@citymesh.com> Signed-off-by: Oldřich Jedlička <oldium.pro@gmail.com> Link: https://lore.kernel.org/r/20231104141333.3710-1-oldium.pro@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-11-24ipv4: igmp: fix refcnt uaf issue when receiving igmp query packetZhengchao Shao
When I perform the following test operations: 1.ip link add br0 type bridge 2.brctl addif br0 eth0 3.ip addr add 239.0.0.1/32 dev eth0 4.ip addr add 239.0.0.1/32 dev br0 5.ip addr add 224.0.0.1/32 dev br0 6.while ((1)) do ifconfig br0 up ifconfig br0 down done 7.send IGMPv2 query packets to port eth0 continuously. For example, ./mausezahn ethX -c 0 "01 00 5e 00 00 01 00 72 19 88 aa 02 08 00 45 00 00 1c 00 01 00 00 01 02 0e 7f c0 a8 0a b7 e0 00 00 01 11 64 ee 9b 00 00 00 00" The preceding tests may trigger the refcnt uaf issue of the mc list. The stack is as follows: refcount_t: addition on 0; use-after-free. WARNING: CPU: 21 PID: 144 at lib/refcount.c:25 refcount_warn_saturate (lib/refcount.c:25) CPU: 21 PID: 144 Comm: ksoftirqd/21 Kdump: loaded Not tainted 6.7.0-rc1-next-20231117-dirty #80 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:refcount_warn_saturate (lib/refcount.c:25) RSP: 0018:ffffb68f00657910 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff8a00c3bf96c0 RCX: ffff8a07b6160908 RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff8a07b6160900 RBP: ffff8a00cba36862 R08: 0000000000000000 R09: 00000000ffff7fff R10: ffffb68f006577c0 R11: ffffffffb0fdcdc8 R12: ffff8a00c3bf9680 R13: ffff8a00c3bf96f0 R14: 0000000000000000 R15: ffff8a00d8766e00 FS: 0000000000000000(0000) GS:ffff8a07b6140000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f10b520b28 CR3: 000000039741a000 CR4: 00000000000006f0 Call Trace: <TASK> igmp_heard_query (net/ipv4/igmp.c:1068) igmp_rcv (net/ipv4/igmp.c:1132) ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205) ip_local_deliver_finish (net/ipv4/ip_input.c:234) __netif_receive_skb_one_core (net/core/dev.c:5529) netif_receive_skb_internal (net/core/dev.c:5729) netif_receive_skb (net/core/dev.c:5788) br_handle_frame_finish (net/bridge/br_input.c:216) nf_hook_bridge_pre (net/bridge/br_input.c:294) __netif_receive_skb_core (net/core/dev.c:5423) __netif_receive_skb_list_core (net/core/dev.c:5606) __netif_receive_skb_list (net/core/dev.c:5674) netif_receive_skb_list_internal (net/core/dev.c:5764) napi_gro_receive (net/core/gro.c:609) e1000_clean_rx_irq (drivers/net/ethernet/intel/e1000/e1000_main.c:4467) e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3805) __napi_poll (net/core/dev.c:6533) net_rx_action (net/core/dev.c:6735) __do_softirq (kernel/softirq.c:554) run_ksoftirqd (kernel/softirq.c:913) smpboot_thread_fn (kernel/smpboot.c:164) kthread (kernel/kthread.c:388) ret_from_fork (arch/x86/kernel/process.c:153) ret_from_fork_asm (arch/x86/entry/entry_64.S:250) </TASK> The root causes are as follows: Thread A Thread B ... netif_receive_skb br_dev_stop ... br_multicast_leave_snoopers ... __ip_mc_dec_group ... __igmp_group_dropped igmp_rcv igmp_stop_timer igmp_heard_query //ref = 1 ip_ma_put igmp_mod_timer refcount_dec_and_test igmp_start_timer //ref = 0 ... refcount_inc //ref increases from 0 When the device receives an IGMPv2 Query message, it starts the timer immediately, regardless of whether the device is running. If the device is down and has left the multicast group, it will cause the mc list refcount uaf issue. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-24net/smc: remove unneeded atomic operations in smc_tx_sndbuf_nonemptyLi RongQing
The commit dcd2cf5f2fc0 ("net/smc: add autocorking support") adds an atomic variable tx_pushing in smc_connection to make sure only one can send to let it cork more and save CDC slot. since smc_tx_pending can be called in the soft IRQ without checking sock_owned_by_user() at that time, which would cause a race condition because bh_lock_sock() did not honor sock_lock() After commit 6b88af839d20 ("net/smc: don't send in the BH context if sock_owned_by_user"), the transmission is deferred to when sock_lock() is held by the user. Therefore, we no longer need tx_pending to hold message. So remove atomic variable tx_pushing and its operation, and smc_tx_sndbuf_nonempty becomes a wrapper of __smc_tx_sndbuf_nonempty, so rename __smc_tx_sndbuf_nonempty back to smc_tx_sndbuf_nonempty Suggested-by: Alexandra Winter <wintera@linux.ibm.com> Co-developed-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> diff v4: remove atomic variable tx_pushing diff v3: improvements in the commit body and comments diff v2: fix a typo in commit body and add net-next subject-prefix net/smc/smc.h | 1 - net/smc/smc_tx.c | 30 +----------------------------- 2 files changed, 1 insertion(+), 30 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-24mptcp: fix uninit-value in mptcp_incoming_optionsEdward Adam Davis
Added initialization use_ack to mptcp_parse_option(). Reported-by: syzbot+b834a6b2decad004cfa1@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-24net/smc: add sysctl for max conns per lgr for SMC-R v2.1Guangguan Wang
Add a new sysctl: net.smc.smcr_max_conns_per_lgr, which is used to control the preferred max connections per lgr for SMC-R v2.1. The default value of this sysctl is 255, and the acceptable value ranges from 16 to 255. Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-24net/smc: add sysctl for max links per lgr for SMC-R v2.1Guangguan Wang
Add a new sysctl: net.smc.smcr_max_links_per_lgr, which is used to control the preferred max links per lgr for SMC-R v2.1. The default value of this sysctl is 2, and the acceptable value ranges from 1 to 2. Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/intel/ice/ice_main.c c9663f79cd82 ("ice: adjust switchdev rebuild path") 7758017911a4 ("ice: restore timestamp configuration after device reset") https://lore.kernel.org/all/20231121211259.3348630-1-anthony.l.nguyen@intel.com/ Adjacent changes: kernel/bpf/verifier.c bb124da69c47 ("bpf: keep track of max number of bpf_loop callback iterations") 5f99f312bd3b ("bpf: add register bounds sanity checks and sanitization") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-23tls: fix NULL deref on tls_sw_splice_eof() with empty recordJann Horn
syzkaller discovered that if tls_sw_splice_eof() is executed as part of sendfile() when the plaintext/ciphertext sk_msg are empty, the send path gets confused because the empty ciphertext buffer does not have enough space for the encryption overhead. This causes tls_push_record() to go on the `split = true` path (which is only supposed to be used when interacting with an attached BPF program), and then get further confused and hit the tls_merge_open_record() path, which then assumes that there must be at least one populated buffer element, leading to a NULL deref. It is possible to have empty plaintext/ciphertext buffers if we previously bailed from tls_sw_sendmsg_locked() via the tls_trim_both_msgs() path. tls_sw_push_pending_record() already handles this case correctly; let's do the same check in tls_sw_splice_eof(). Fixes: df720d288dbb ("tls/sw: Use splice_eof() to flush") Cc: stable@vger.kernel.org Reported-by: syzbot+40d43509a099ea756317@syzkaller.appspotmail.com Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20231122214447.675768-1-jannh@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-22net/smc: avoid data corruption caused by declineD. Wythe
We found a data corruption issue during testing of SMC-R on Redis applications. The benchmark has a low probability of reporting a strange error as shown below. "Error: Protocol error, got "\xe2" as reply type byte" Finally, we found that the retrieved error data was as follows: 0xE2 0xD4 0xC3 0xD9 0x04 0x00 0x2C 0x20 0xA6 0x56 0x00 0x16 0x3E 0x0C 0xCB 0x04 0x02 0x01 0x00 0x00 0x20 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xE2 It is quite obvious that this is a SMC DECLINE message, which means that the applications received SMC protocol message. We found that this was caused by the following situations: client server ¦ clc proposal -------------> ¦ clc accept <------------- ¦ clc confirm -------------> wait llc confirm send llc confirm ¦failed llc confirm ¦ x------ (after 2s)timeout wait llc confirm rsp wait decline (after 1s) timeout (after 2s) timeout ¦ decline --------------> ¦ decline <-------------- As a result, a decline message was sent in the implementation, and this message was read from TCP by the already-fallback connection. This patch double the client timeout as 2x of the server value, With this simple change, the Decline messages should never cross or collide (during Confirm link timeout). This issue requires an immediate solution, since the protocol updates involve a more long-term solution. Fixes: 0fb0b02bd6fd ("net/smc: adapt SMC client code to use the LLC flow") Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-22net: hsr: Add support for MC filtering at the slave deviceMurali Karicheri
When MC (multicast) list is updated by the networking layer due to a user command and as well as when allmulti flag is set, it needs to be passed to the enslaved Ethernet devices. This patch allows this to happen by implementing ndo_change_rx_flags() and ndo_set_rx_mode() API calls that in turns pass it to the slave devices using existing API calls. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: Ravi Gunasekaran <r-gunasekaran@ti.com> Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-11-21net: page_pool: avoid touching slow on the fastpathJakub Kicinski
To fully benefit from previous commit add one byte of state in the first cache line recording if we need to look at the slow part. The packing isn't all that impressive right now, we create a 7B hole. I'm expecting Olek's rework will reshuffle this, anyway. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://lore.kernel.org/r/20231121000048.789613-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-21net: page_pool: split the page_pool_params into fast and slowJakub Kicinski
struct page_pool is rather performance critical and we use 16B of the first cache line to store 2 pointers used only by test code. Future patches will add more informational (non-fast path) attributes. It's convenient for the user of the API to not have to worry which fields are fast and which are slow path. Use struct groups to split the params into the two categories internally. Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://lore.kernel.org/r/20231121000048.789613-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-21Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-11-21 We've added 19 non-merge commits during the last 4 day(s) which contain a total of 18 files changed, 1043 insertions(+), 416 deletions(-). The main changes are: 1) Fix BPF verifier to validate callbacks as if they are called an unknown number of times in order to fix not detecting some unsafe programs, from Eduard Zingerman. 2) Fix bpf_redirect_peer() handling which missed proper stats accounting for veth and netkit and also generally fix missing stats for the latter, from Peilin Ye, Daniel Borkmann et al. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: check if max number of bpf_loop iterations is tracked bpf: keep track of max number of bpf_loop callback iterations selftests/bpf: test widening for iterating callbacks bpf: widening for callback iterators selftests/bpf: tests for iterating callbacks bpf: verify callbacks as if they are called unknown number of times bpf: extract setup_func_entry() utility function bpf: extract __check_reg_arg() utility function selftests/bpf: fix bpf_loop_bench for new callback verification scheme selftests/bpf: track string payload offset as scalar in strobemeta selftests/bpf: track tcp payload offset as scalar in xdp_synproxy selftests/bpf: Add netkit to tc_redirect selftest selftests/bpf: De-veth-ize the tc_redirect test case bpf, netkit: Add indirect call wrapper for fetching peer dev bpf: Fix dev's rx stats for bpf_redirect_peer traffic veth: Use tstats per-CPU traffic counters netkit: Add tstats per-CPU traffic counters net: Move {l,t,d}stats allocation to core and convert veth & vrf net, vrf: Move dstats structure to core ==================== Link: https://lore.kernel.org/r/20231121193113.11796-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-21net: do not send a MOVE event when netdev changes netnsJakub Kicinski
Networking supports changing netdevice's netns and name at the same time. This allows avoiding name conflicts and having to rename the interface in multiple steps. E.g. netns1={eth0, eth1}, netns2={eth1} - we want to move netns1:eth1 to netns2 and call it eth0 there. If we can't rename "in flight" we'd need to (1) rename eth1 -> $tmp, (2) change netns, (3) rename $tmp -> eth0. To rename the underlying struct device we have to call device_rename(). The rename()'s MOVE event, however, doesn't "belong" to either the old or the new namespace. If there are conflicts on both sides it's actually impossible to issue a real MOVE (old name -> new name) without confusing user space. And Daniel reports that such confusions do in fact happen for systemd, in real life. Since we already issue explicit REMOVE and ADD events manually - suppress the MOVE event completely. Move the ADD after the rename, so that the REMOVE uses the old name, and the ADD the new one. If there is no rename this changes the picture as follows: Before: old ns | KERNEL[213.399289] remove /devices/virtual/net/eth0 (net) new ns | KERNEL[213.401302] add /devices/virtual/net/eth0 (net) new ns | KERNEL[213.401397] move /devices/virtual/net/eth0 (net) After: old ns | KERNEL[266.774257] remove /devices/virtual/net/eth0 (net) new ns | KERNEL[266.774509] add /devices/virtual/net/eth0 (net) If there is a rename and a conflict (using the exact eth0/eth1 example explained above) we get this: Before: old ns | KERNEL[224.316833] remove /devices/virtual/net/eth1 (net) new ns | KERNEL[224.318551] add /devices/virtual/net/eth1 (net) new ns | KERNEL[224.319662] move /devices/virtual/net/eth0 (net) After: old ns | KERNEL[333.033166] remove /devices/virtual/net/eth1 (net) new ns | KERNEL[333.035098] add /devices/virtual/net/eth0 (net) Note that "in flight" rename is only performed when needed. If there is no conflict for old name in the target netns - the rename will be performed separately by dev_change_name(), as if the rename was a different command, and there will still be a MOVE event for the rename: Before: old ns | KERNEL[194.416429] remove /devices/virtual/net/eth0 (net) new ns | KERNEL[194.418809] add /devices/virtual/net/eth0 (net) new ns | KERNEL[194.418869] move /devices/virtual/net/eth0 (net) new ns | KERNEL[194.420866] move /devices/virtual/net/eth1 (net) After: old ns | KERNEL[71.917520] remove /devices/virtual/net/eth0 (net) new ns | KERNEL[71.919155] add /devices/virtual/net/eth0 (net) new ns | KERNEL[71.920729] move /devices/virtual/net/eth1 (net) If deleting the MOVE event breaks some user space we should insert an explicit kobject_uevent(MOVE) after the ADD, like this: @@ -11192,6 +11192,12 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net, kobject_uevent(&dev->dev.kobj, KOBJ_ADD); netdev_adjacent_add_links(dev); + /* User space wants an explicit MOVE event, issue one unless + * dev_change_name() will get called later and issue one. + */ + if (!pat || new_name[0]) + kobject_uevent(&dev->dev.kobj, KOBJ_MOVE); + /* Adapt owner in case owning user namespace of target network * namespace is different from the original one. */ Reported-by: Daniel Gröber <dxld@darkboxed.org> Link: https://lore.kernel.org/all/20231010121003.x3yi6fihecewjy4e@House.clients.dxld.at/ Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/all/20231120184140.578375-1-kuba@kernel.org/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-21ipv4: Correct/silence an endian warning in __ip_do_redirectKunwu Chan
net/ipv4/route.c:783:46: warning: incorrect type in argument 2 (different base types) net/ipv4/route.c:783:46: expected unsigned int [usertype] key net/ipv4/route.c:783:46: got restricted __be32 [usertype] new_gw Fixes: 969447f226b4 ("ipv4: use new_gw for redirect neigh lookup") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20231119141759.420477-1-chentao@kylinos.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-11-20bpf, netkit: Add indirect call wrapper for fetching peer devDaniel Borkmann
ndo_get_peer_dev is used in tcx BPF fast path, therefore make use of indirect call wrapper and therefore optimize the bpf_redirect_peer() internal handling a bit. Add a small skb_get_peer_dev() wrapper which utilizes the INDIRECT_CALL_1() macro instead of open coding. Future work could potentially add a peer pointer directly into struct net_device in future and convert veth and netkit over to use it so that eventually ndo_get_peer_dev can be removed. Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20231114004220.6495-7-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20bpf: Fix dev's rx stats for bpf_redirect_peer trafficPeilin Ye
Traffic redirected by bpf_redirect_peer() (used by recent CNIs like Cilium) is not accounted for in the RX stats of supported devices (that is, veth and netkit), confusing user space metrics collectors such as cAdvisor [0], as reported by Youlun. Fix it by calling dev_sw_netstats_rx_add() in skb_do_redirect(), to update RX traffic counters. Devices that support ndo_get_peer_dev _must_ use the @tstats per-CPU counters (instead of @lstats, or @dstats). To make this more fool-proof, error out when ndo_get_peer_dev is set but @tstats are not selected. [0] Specifically, the "container_network_receive_{byte,packet}s_total" counters are affected. Fixes: 9aa1206e8f48 ("bpf: Add redirect_peer helper") Reported-by: Youlun Zhang <zhangyoulun@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20231114004220.6495-6-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20net: Move {l,t,d}stats allocation to core and convert veth & vrfDaniel Borkmann
Move {l,t,d}stats allocation to the core and let netdevs pick the stats type they need. That way the driver doesn't have to bother with error handling (allocation failure checking, making sure free happens in the right spot, etc) - all happening in the core. Co-developed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Cc: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20231114004220.6495-3-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-11-20ieee802154: Give the user the association listMiquel Raynal
Upon request, we must be able to provide to the user the list of associations currently in place. Let's add a new netlink command and attribute for this purpose. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-12-miquel.raynal@bootlin.com
2023-11-20mac802154: Handle disassociation notifications from peersMiquel Raynal
Peers may decided to disassociate from us, their coordinator, in this case they will send a disassociation notification which we must acknowledge. If we don't, the peer device considers itself disassociated anyway. We also need to drop the reference to this child from our internal structures. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-11-miquel.raynal@bootlin.com
2023-11-20mac802154: Follow the number of associated devicesMiquel Raynal
Track the count of associated devices. Limit the number of associations using the value provided by the user if any. If we reach the maximum number of associations, we tell the device we are at capacity. If the user do not want to accept any more associations, it may specify the value 0 to the maximum number of associations, which will lead to an access denied error status returned to the peers trying to associate. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-10-miquel.raynal@bootlin.com
2023-11-20ieee802154: Add support for limiting the number of associated devicesMiquel Raynal
Coordinators may refuse associations. We need a user input for that. Let's add a new netlink command which can provide a maximum number of devices we accept to associate with as a first step. Later, we could also forward the request to userspace and check whether the association should be accepted or not. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-9-miquel.raynal@bootlin.com
2023-11-20mac802154: Handle association requests from peersMiquel Raynal
Coordinators may have to handle association requests from peers which want to join the PAN. The logic involves: - Acknowledging the request (done by hardware) - If requested, a random short address that is free on this PAN should be chosen for the device. - Sending an association response with the short address allocated for the peer and expecting it to be ack'ed. If anything fails during this procedure, the peer is considered not associated. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-8-miquel.raynal@bootlin.com
2023-11-20mac802154: Handle disassociationsMiquel Raynal
Devices may decide to disassociate from their coordinator for different reasons (device turning off, coordinator signal strength too low, etc), the MAC layer just has to send a disassociation notification. If the ack of the disassociation notification is not received, the device may consider itself disassociated anyway. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-7-miquel.raynal@bootlin.com
2023-11-20ieee802154: Add support for user disassociation requestsMiquel Raynal
A device may decide at some point to disassociate from a PAN, let's introduce a netlink command for this purpose. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-6-miquel.raynal@bootlin.com
2023-11-20mac802154: Handle associatingMiquel Raynal
Joining a PAN officially goes by associating with a coordinator. This coordinator may have been discovered thanks to the beacons it sent in the past. Add support to the MAC layer for these associations, which require: - Sending an association request - Receiving an association response The association response contains the association status, eventually a reason if the association was unsuccessful, and finally a short address that we should use for intra-PAN communication from now on, if we required one (which is the default, and not yet configurable). Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-5-miquel.raynal@bootlin.com
2023-11-20ieee802154: Add support for user association requestsMiquel Raynal
Users may decide to associate with a peer, which becomes our parent coordinator. Let's add the necessary netlink support for this. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-4-miquel.raynal@bootlin.com
2023-11-20ieee802154: Internal PAN managementMiquel Raynal
Introduce structures to describe peer devices in a PAN as well as a few related helpers. We basically care about: - Our unique parent after associating with a coordinator. - Peer devices, children, which successfully associated with us. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-3-miquel.raynal@bootlin.com
2023-11-20ieee802154: Let PAN IDs be resetMiquel Raynal
Soon association and disassociation will be implemented, which will require to be able to either change the PAN ID from 0xFFFF to a real value when association succeeded, or to reset the PAN ID to 0xFFFF upon disassociation. Let's allow to do that manually for now. Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Acked-by: Alexander Aring <aahringo@redhat.com> Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-2-miquel.raynal@bootlin.com
2023-11-19net: fill in MODULE_DESCRIPTION()s for SOCK_DIAG modulesJakub Kicinski
W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Add descriptions to all the sock diag modules in one fell swoop. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>