Age | Commit message (Collapse) | Author |
|
Report when page pool was destroyed. Together with the inflight
/ memory use reporting this can serve as a replacement for the
warning about leaked page pools we currently print to dmesg.
Example output for a fake leaked page pool using some hacks
in netdevsim (one "live" pool, and one "leaked" on the same dev):
$ ./cli.py --no-schema --spec netlink/specs/netdev.yaml \
--dump page-pool-get
[{'id': 2, 'ifindex': 3},
{'id': 1, 'ifindex': 3, 'destroyed': 133, 'inflight': 1}]
Tested-by: Dragos Tatulea <dtatulea@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Advanced deployments need the ability to check memory use
of various system components. It makes it possible to make informed
decisions about memory allocation and to find regressions and leaks.
Report memory use of page pools. Report both number of references
and bytes held.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Generate netlink notifications about page pool state changes.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Expose the very basic page pool information via netlink.
Example using ynl-py for a system with 9 queues:
$ ./cli.py --no-schema --spec netlink/specs/netdev.yaml \
--dump page-pool-get
[{'id': 19, 'ifindex': 2, 'napi-id': 147},
{'id': 18, 'ifindex': 2, 'napi-id': 146},
{'id': 17, 'ifindex': 2, 'napi-id': 145},
{'id': 16, 'ifindex': 2, 'napi-id': 144},
{'id': 15, 'ifindex': 2, 'napi-id': 143},
{'id': 14, 'ifindex': 2, 'napi-id': 142},
{'id': 13, 'ifindex': 2, 'napi-id': 141},
{'id': 12, 'ifindex': 2, 'napi-id': 140},
{'id': 11, 'ifindex': 2, 'napi-id': 139},
{'id': 10, 'ifindex': 2, 'napi-id': 138}]
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
To avoid any issues with race conditions on accessing napi
and having to think about the lifetime of NAPI objects
in netlink GET - stash the napi_id to which page pool
was linked at creation time.
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Link the page pools with netdevs. This needs to be netns compatible
so we have two options. Either we record the pools per netns and
have to worry about moving them as the netdev gets moved.
Or we record them directly on the netdev so they move with the netdev
without any extra work.
Implement the latter option. Since pools may outlast netdev we need
a place to store orphans. In time honored tradition use loopback
for this purpose.
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
To give ourselves the flexibility of creating netlink commands
and ability to refer to page pool instances in uAPIs create
IDs for page pools.
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
We'll soon (next change in the series) need a fuller unwind path
in page_pool_create() so create the inverse of page_pool_init().
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next
Kalle Valo says:
====================
wireless-next patches for v6.8
The first features pull request for v6.8. Not so big in number of
commits but we removed quite a few ancient drivers: libertas 16-bit
PCMCIA support, atmel, hostap, zd1201, orinoco, ray_cs, wl3501 and
rndis_wlan.
Major changes:
cfg80211/mac80211
- extend support for scanning while Multi-Link Operation (MLO) connected
* tag 'wireless-next-2023-11-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (68 commits)
wifi: nl80211: Documentation update for NL80211_CMD_PORT_AUTHORIZED event
wifi: mac80211: Extend support for scanning while MLO connected
wifi: cfg80211: Extend support for scanning while MLO connected
wifi: ieee80211: fix PV1 frame control field name
rfkill: return ENOTTY on invalid ioctl
MAINTAINERS: update iwlwifi maintainers
wifi: rtw89: 8922a: read efuse content from physical map
wifi: rtw89: 8922a: read efuse content via efuse map struct from logic map
wifi: rtw89: 8852c: read RX gain offset from efuse for 6GHz channels
wifi: rtw89: mac: add to access efuse for WiFi 7 chips
wifi: rtw89: mac: use mac_gen pointer to access about efuse
wifi: rtw89: 8922a: add 8922A basic chip info
wifi: rtlwifi: drop unused const_amdpci_aspm
wifi: mwifiex: mwifiex_process_sleep_confirm_resp(): remove unused priv variable
wifi: rtw89: regd: update regulatory map to R65-R44
wifi: rtw89: regd: handle policy of 6 GHz according to BIOS
wifi: rtw89: acpi: process 6 GHz band policy from DSM
wifi: rtlwifi: simplify rtl_action_proc() and rtl_tx_agg_start()
wifi: rtw89: pci: update interrupt mitigation register for 8922AE
wifi: rtw89: pci: correct interrupt mitigation register for 8852CE
...
====================
Link: https://lore.kernel.org/r/20231127180056.0B48DC433C8@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With latest upstream llvm18, the following test cases failed:
$ ./test_progs -j
#13/2 bpf_cookie/multi_kprobe_link_api:FAIL
#13/3 bpf_cookie/multi_kprobe_attach_api:FAIL
#13 bpf_cookie:FAIL
#77 fentry_fexit:FAIL
#78/1 fentry_test/fentry:FAIL
#78 fentry_test:FAIL
#82/1 fexit_test/fexit:FAIL
#82 fexit_test:FAIL
#112/1 kprobe_multi_test/skel_api:FAIL
#112/2 kprobe_multi_test/link_api_addrs:FAIL
[...]
#112 kprobe_multi_test:FAIL
#356/17 test_global_funcs/global_func17:FAIL
#356 test_global_funcs:FAIL
Further analysis shows llvm upstream patch [1] is responsible for the above
failures. For example, for function bpf_fentry_test7() in net/bpf/test_run.c,
without [1], the asm code is:
0000000000000400 <bpf_fentry_test7>:
400: f3 0f 1e fa endbr64
404: e8 00 00 00 00 callq 0x409 <bpf_fentry_test7+0x9>
409: 48 89 f8 movq %rdi, %rax
40c: c3 retq
40d: 0f 1f 00 nopl (%rax)
... and with [1], the asm code is:
0000000000005d20 <bpf_fentry_test7.specialized.1>:
5d20: e8 00 00 00 00 callq 0x5d25 <bpf_fentry_test7.specialized.1+0x5>
5d25: c3 retq
... and <bpf_fentry_test7.specialized.1> is called instead of <bpf_fentry_test7>
and this caused test failures for #13/#77 etc. except #356.
For test case #356/17, with [1] (progs/test_global_func17.c)), the main prog
looks like:
0000000000000000 <global_func17>:
0: b4 00 00 00 2a 00 00 00 w0 = 0x2a
1: 95 00 00 00 00 00 00 00 exit
... which passed verification while the test itself expects a verification
failure.
Let us add 'barrier_var' style asm code in both places to prevent function
specialization which caused selftests failure.
[1] https://github.com/llvm/llvm-project/pull/72903
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20231127050342.1945270-1-yonghong.song@linux.dev
|
|
The debugfs files for netdevs (sdata) and links are removed
with the wiphy mutex held, which may deadlock. Use the new
wiphy locked debugfs to avoid that.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The read is currently with RCU and the write can deadlock,
convert both for the sake of illustration.
Make mac80211 depend on cfg80211 debugfs to get the helpers,
but mac80211 debugfs without it does nothing anyway. This
also required some adjustments in ath9k.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Add wrappers for debugfs files that should be called with
the wiphy mutex held, while the file is also to be removed
under the wiphy mutex. This could otherwise deadlock when
a file is trying to acquire the wiphy mutex while the code
removing it holds the mutex but waits for the removal.
This actually works by pushing the execution of the read
or write handler to a wiphy work that can be cancelled
using the debugfs cancellation API.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
- If the scan request includes a link ID, validate that it is
one of the active links. Otherwise, if the scan request doesn't
include a valid link ID, select one of the active links.
- When reporting the TSF for a BSS entry, use the link ID information
from the Rx status or the scan request to set the parent BSSID.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20231113112844.68564692c404.Iae9605cbb7f9d52e00ce98260b3559a34cf18341@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
To extend the support of TSF accounting in scan results for MLO
connections, allow to indicate in the scan request the link ID
corresponding to the BSS whose TSF should be used for the TSF
accounting.
Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Gregory Greenman <gregory.greenman@intel.com>
Link: https://lore.kernel.org/r/20231113112844.d4490bcdefb1.I8fcd158b810adddef4963727e9153096416b30ce@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
For unknown ioctls the correct error is
ENOTTY "Inappropriate ioctl for device".
ENOSYS as returned before should only be used to
indicate that a syscall is not available at all.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://lore.kernel.org/r/20231101-rfkill-ioctl-enosys-v1-1-5bf374fabffe@weissschuh.net
[in theory this breaks userspace API, but it was discussed and
researched, and nothing found relying on the current behaviour]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The new 320 MHz channel width wasn't handled, so connecting
a station to a 320 MHz AP would limit the station to 20 MHz
(on HT) after a warning, handle 320 MHz to fix that.
Signed-off-by: Ben Greear <greearb@candelatech.com>
Link: https://lore.kernel.org/r/20231109182201.495381-1-greearb@candelatech.com
[write a proper commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Given all the locking rework in mac80211, we pretty much
need to get into the driver with the wiphy mutex held in
all callbacks. This is already mostly the case, but as
Johan reported, in the get_txpower it may not be true.
Lock the wiphy mutex around nl80211_send_iface(), then
is also around callers of nl80211_notify_iface(). This
is easy to do, fixes the problem, and aligns the locking
between various calls to it in different parts of the
code of cfg80211.
Fixes: 0e8185ce1dde ("wifi: mac80211: check wiphy mutex in ops")
Reported-by: Johan Hovold <johan@kernel.org>
Closes: https://lore.kernel.org/r/ZVOXX6qg4vXEx8dX@hovoldconsulting.com
Tested-by: Johan Hovold <johan+linaro@kernel.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
We want to guarantee the mutex is held for pretty much
all operations, so ensure that here as well.
Reported-by: syzbot+7e59a5bfc7a897247e18@syzkaller.appspotmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
My prior race fix here broke CQM when ranges aren't used, as
the reporting worker now requires the cqm_config to be set in
the wdev, but isn't set when there's no range configured.
Rather than continuing to special-case the range version, set
the cqm_config always and configure accordingly, also tracking
if range was used or not to be able to clear the configuration
appropriately with the same API, which was actually not right
if both were implemented by a driver for some reason, as is
the case with mac80211 (though there the implementations are
equivalent so it doesn't matter.)
Also, the original multiple-RSSI commit lost checking for the
callback, so might have potentially crashed if a driver had
neither implementation, and userspace tried to use it despite
not being advertised as supported.
Cc: stable@vger.kernel.org
Fixes: 4a4b8169501b ("cfg80211: Accept multiple RSSI thresholds for CQM")
Fixes: 37c20b2effe9 ("wifi: cfg80211: fix cqm_config access race")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
This fixes WARN_ONs when using AP_VLANs after station removal. The flush
call passed AP_VLAN vif to driver, but because these vifs are virtual and
not registered with drivers, we need to translate to the correct AP vif
first.
Closes: https://github.com/openwrt/openwrt/issues/12420
Fixes: 0b75a1b1e42e ("wifi: mac80211: flush queues on STA removal")
Fixes: d00800a289c9 ("wifi: mac80211: add flush_sta method")
Tested-by: Konstantin Demin <rockdrilla@gmail.com>
Tested-by: Koen Vandeputte <koen.vandeputte@citymesh.com>
Signed-off-by: Oldřich Jedlička <oldium.pro@gmail.com>
Link: https://lore.kernel.org/r/20231104141333.3710-1-oldium.pro@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
When I perform the following test operations:
1.ip link add br0 type bridge
2.brctl addif br0 eth0
3.ip addr add 239.0.0.1/32 dev eth0
4.ip addr add 239.0.0.1/32 dev br0
5.ip addr add 224.0.0.1/32 dev br0
6.while ((1))
do
ifconfig br0 up
ifconfig br0 down
done
7.send IGMPv2 query packets to port eth0 continuously. For example,
./mausezahn ethX -c 0 "01 00 5e 00 00 01 00 72 19 88 aa 02 08 00 45 00 00
1c 00 01 00 00 01 02 0e 7f c0 a8 0a b7 e0 00 00 01 11 64 ee 9b 00 00 00 00"
The preceding tests may trigger the refcnt uaf issue of the mc list. The
stack is as follows:
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 21 PID: 144 at lib/refcount.c:25 refcount_warn_saturate (lib/refcount.c:25)
CPU: 21 PID: 144 Comm: ksoftirqd/21 Kdump: loaded Not tainted 6.7.0-rc1-next-20231117-dirty #80
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:refcount_warn_saturate (lib/refcount.c:25)
RSP: 0018:ffffb68f00657910 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8a00c3bf96c0 RCX: ffff8a07b6160908
RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff8a07b6160900
RBP: ffff8a00cba36862 R08: 0000000000000000 R09: 00000000ffff7fff
R10: ffffb68f006577c0 R11: ffffffffb0fdcdc8 R12: ffff8a00c3bf9680
R13: ffff8a00c3bf96f0 R14: 0000000000000000 R15: ffff8a00d8766e00
FS: 0000000000000000(0000) GS:ffff8a07b6140000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055f10b520b28 CR3: 000000039741a000 CR4: 00000000000006f0
Call Trace:
<TASK>
igmp_heard_query (net/ipv4/igmp.c:1068)
igmp_rcv (net/ipv4/igmp.c:1132)
ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205)
ip_local_deliver_finish (net/ipv4/ip_input.c:234)
__netif_receive_skb_one_core (net/core/dev.c:5529)
netif_receive_skb_internal (net/core/dev.c:5729)
netif_receive_skb (net/core/dev.c:5788)
br_handle_frame_finish (net/bridge/br_input.c:216)
nf_hook_bridge_pre (net/bridge/br_input.c:294)
__netif_receive_skb_core (net/core/dev.c:5423)
__netif_receive_skb_list_core (net/core/dev.c:5606)
__netif_receive_skb_list (net/core/dev.c:5674)
netif_receive_skb_list_internal (net/core/dev.c:5764)
napi_gro_receive (net/core/gro.c:609)
e1000_clean_rx_irq (drivers/net/ethernet/intel/e1000/e1000_main.c:4467)
e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3805)
__napi_poll (net/core/dev.c:6533)
net_rx_action (net/core/dev.c:6735)
__do_softirq (kernel/softirq.c:554)
run_ksoftirqd (kernel/softirq.c:913)
smpboot_thread_fn (kernel/smpboot.c:164)
kthread (kernel/kthread.c:388)
ret_from_fork (arch/x86/kernel/process.c:153)
ret_from_fork_asm (arch/x86/entry/entry_64.S:250)
</TASK>
The root causes are as follows:
Thread A Thread B
... netif_receive_skb
br_dev_stop ...
br_multicast_leave_snoopers ...
__ip_mc_dec_group ...
__igmp_group_dropped igmp_rcv
igmp_stop_timer igmp_heard_query //ref = 1
ip_ma_put igmp_mod_timer
refcount_dec_and_test igmp_start_timer //ref = 0
... refcount_inc //ref increases from 0
When the device receives an IGMPv2 Query message, it starts the timer
immediately, regardless of whether the device is running. If the device is
down and has left the multicast group, it will cause the mc list refcount
uaf issue.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The commit dcd2cf5f2fc0 ("net/smc: add autocorking support") adds an
atomic variable tx_pushing in smc_connection to make sure only one can
send to let it cork more and save CDC slot. since smc_tx_pending can be
called in the soft IRQ without checking sock_owned_by_user() at that
time, which would cause a race condition because bh_lock_sock() did
not honor sock_lock()
After commit 6b88af839d20 ("net/smc: don't send in the BH context if
sock_owned_by_user"), the transmission is deferred to when sock_lock()
is held by the user. Therefore, we no longer need tx_pending to hold
message.
So remove atomic variable tx_pushing and its operation, and
smc_tx_sndbuf_nonempty becomes a wrapper of __smc_tx_sndbuf_nonempty,
so rename __smc_tx_sndbuf_nonempty back to smc_tx_sndbuf_nonempty
Suggested-by: Alexandra Winter <wintera@linux.ibm.com>
Co-developed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>
diff v4: remove atomic variable tx_pushing
diff v3: improvements in the commit body and comments
diff v2: fix a typo in commit body and add net-next subject-prefix
net/smc/smc.h | 1 -
net/smc/smc_tx.c | 30 +-----------------------------
2 files changed, 1 insertion(+), 30 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Added initialization use_ack to mptcp_parse_option().
Reported-by: syzbot+b834a6b2decad004cfa1@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a new sysctl: net.smc.smcr_max_conns_per_lgr, which is
used to control the preferred max connections per lgr for
SMC-R v2.1. The default value of this sysctl is 255, and
the acceptable value ranges from 16 to 255.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Add a new sysctl: net.smc.smcr_max_links_per_lgr, which is
used to control the preferred max links per lgr for SMC-R
v2.1. The default value of this sysctl is 2, and the acceptable
value ranges from 1 to 2.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/intel/ice/ice_main.c
c9663f79cd82 ("ice: adjust switchdev rebuild path")
7758017911a4 ("ice: restore timestamp configuration after device reset")
https://lore.kernel.org/all/20231121211259.3348630-1-anthony.l.nguyen@intel.com/
Adjacent changes:
kernel/bpf/verifier.c
bb124da69c47 ("bpf: keep track of max number of bpf_loop callback iterations")
5f99f312bd3b ("bpf: add register bounds sanity checks and sanitization")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
syzkaller discovered that if tls_sw_splice_eof() is executed as part of
sendfile() when the plaintext/ciphertext sk_msg are empty, the send path
gets confused because the empty ciphertext buffer does not have enough
space for the encryption overhead. This causes tls_push_record() to go on
the `split = true` path (which is only supposed to be used when interacting
with an attached BPF program), and then get further confused and hit the
tls_merge_open_record() path, which then assumes that there must be at
least one populated buffer element, leading to a NULL deref.
It is possible to have empty plaintext/ciphertext buffers if we previously
bailed from tls_sw_sendmsg_locked() via the tls_trim_both_msgs() path.
tls_sw_push_pending_record() already handles this case correctly; let's do
the same check in tls_sw_splice_eof().
Fixes: df720d288dbb ("tls/sw: Use splice_eof() to flush")
Cc: stable@vger.kernel.org
Reported-by: syzbot+40d43509a099ea756317@syzkaller.appspotmail.com
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20231122214447.675768-1-jannh@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
We found a data corruption issue during testing of SMC-R on Redis
applications.
The benchmark has a low probability of reporting a strange error as
shown below.
"Error: Protocol error, got "\xe2" as reply type byte"
Finally, we found that the retrieved error data was as follows:
0xE2 0xD4 0xC3 0xD9 0x04 0x00 0x2C 0x20 0xA6 0x56 0x00 0x16 0x3E 0x0C
0xCB 0x04 0x02 0x01 0x00 0x00 0x20 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xE2
It is quite obvious that this is a SMC DECLINE message, which means that
the applications received SMC protocol message.
We found that this was caused by the following situations:
client server
¦ clc proposal
------------->
¦ clc accept
<-------------
¦ clc confirm
------------->
wait llc confirm
send llc confirm
¦failed llc confirm
¦ x------
(after 2s)timeout
wait llc confirm rsp
wait decline
(after 1s) timeout
(after 2s) timeout
¦ decline
-------------->
¦ decline
<--------------
As a result, a decline message was sent in the implementation, and this
message was read from TCP by the already-fallback connection.
This patch double the client timeout as 2x of the server value,
With this simple change, the Decline messages should never cross or
collide (during Confirm link timeout).
This issue requires an immediate solution, since the protocol updates
involve a more long-term solution.
Fixes: 0fb0b02bd6fd ("net/smc: adapt SMC client code to use the LLC flow")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When MC (multicast) list is updated by the networking layer due to a
user command and as well as when allmulti flag is set, it needs to be
passed to the enslaved Ethernet devices. This patch allows this
to happen by implementing ndo_change_rx_flags() and ndo_set_rx_mode()
API calls that in turns pass it to the slave devices using
existing API calls.
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Ravi Gunasekaran <r-gunasekaran@ti.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
To fully benefit from previous commit add one byte of state
in the first cache line recording if we need to look at
the slow part.
The packing isn't all that impressive right now, we create
a 7B hole. I'm expecting Olek's rework will reshuffle this,
anyway.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://lore.kernel.org/r/20231121000048.789613-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
struct page_pool is rather performance critical and we use
16B of the first cache line to store 2 pointers used only
by test code. Future patches will add more informational
(non-fast path) attributes.
It's convenient for the user of the API to not have to worry
which fields are fast and which are slow path. Use struct
groups to split the params into the two categories internally.
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Link: https://lore.kernel.org/r/20231121000048.789613-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-11-21
We've added 19 non-merge commits during the last 4 day(s) which contain
a total of 18 files changed, 1043 insertions(+), 416 deletions(-).
The main changes are:
1) Fix BPF verifier to validate callbacks as if they are called an unknown
number of times in order to fix not detecting some unsafe programs,
from Eduard Zingerman.
2) Fix bpf_redirect_peer() handling which missed proper stats accounting
for veth and netkit and also generally fix missing stats for the latter,
from Peilin Ye, Daniel Borkmann et al.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: check if max number of bpf_loop iterations is tracked
bpf: keep track of max number of bpf_loop callback iterations
selftests/bpf: test widening for iterating callbacks
bpf: widening for callback iterators
selftests/bpf: tests for iterating callbacks
bpf: verify callbacks as if they are called unknown number of times
bpf: extract setup_func_entry() utility function
bpf: extract __check_reg_arg() utility function
selftests/bpf: fix bpf_loop_bench for new callback verification scheme
selftests/bpf: track string payload offset as scalar in strobemeta
selftests/bpf: track tcp payload offset as scalar in xdp_synproxy
selftests/bpf: Add netkit to tc_redirect selftest
selftests/bpf: De-veth-ize the tc_redirect test case
bpf, netkit: Add indirect call wrapper for fetching peer dev
bpf: Fix dev's rx stats for bpf_redirect_peer traffic
veth: Use tstats per-CPU traffic counters
netkit: Add tstats per-CPU traffic counters
net: Move {l,t,d}stats allocation to core and convert veth & vrf
net, vrf: Move dstats structure to core
====================
Link: https://lore.kernel.org/r/20231121193113.11796-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Networking supports changing netdevice's netns and name
at the same time. This allows avoiding name conflicts
and having to rename the interface in multiple steps.
E.g. netns1={eth0, eth1}, netns2={eth1} - we want
to move netns1:eth1 to netns2 and call it eth0 there.
If we can't rename "in flight" we'd need to (1) rename
eth1 -> $tmp, (2) change netns, (3) rename $tmp -> eth0.
To rename the underlying struct device we have to call
device_rename(). The rename()'s MOVE event, however, doesn't
"belong" to either the old or the new namespace.
If there are conflicts on both sides it's actually impossible
to issue a real MOVE (old name -> new name) without confusing
user space. And Daniel reports that such confusions do in fact
happen for systemd, in real life.
Since we already issue explicit REMOVE and ADD events
manually - suppress the MOVE event completely. Move
the ADD after the rename, so that the REMOVE uses
the old name, and the ADD the new one.
If there is no rename this changes the picture as follows:
Before:
old ns | KERNEL[213.399289] remove /devices/virtual/net/eth0 (net)
new ns | KERNEL[213.401302] add /devices/virtual/net/eth0 (net)
new ns | KERNEL[213.401397] move /devices/virtual/net/eth0 (net)
After:
old ns | KERNEL[266.774257] remove /devices/virtual/net/eth0 (net)
new ns | KERNEL[266.774509] add /devices/virtual/net/eth0 (net)
If there is a rename and a conflict (using the exact eth0/eth1
example explained above) we get this:
Before:
old ns | KERNEL[224.316833] remove /devices/virtual/net/eth1 (net)
new ns | KERNEL[224.318551] add /devices/virtual/net/eth1 (net)
new ns | KERNEL[224.319662] move /devices/virtual/net/eth0 (net)
After:
old ns | KERNEL[333.033166] remove /devices/virtual/net/eth1 (net)
new ns | KERNEL[333.035098] add /devices/virtual/net/eth0 (net)
Note that "in flight" rename is only performed when needed.
If there is no conflict for old name in the target netns -
the rename will be performed separately by dev_change_name(),
as if the rename was a different command, and there will still
be a MOVE event for the rename:
Before:
old ns | KERNEL[194.416429] remove /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.418809] add /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.418869] move /devices/virtual/net/eth0 (net)
new ns | KERNEL[194.420866] move /devices/virtual/net/eth1 (net)
After:
old ns | KERNEL[71.917520] remove /devices/virtual/net/eth0 (net)
new ns | KERNEL[71.919155] add /devices/virtual/net/eth0 (net)
new ns | KERNEL[71.920729] move /devices/virtual/net/eth1 (net)
If deleting the MOVE event breaks some user space we should insert
an explicit kobject_uevent(MOVE) after the ADD, like this:
@@ -11192,6 +11192,12 @@ int __dev_change_net_namespace(struct net_device *dev, struct net *net,
kobject_uevent(&dev->dev.kobj, KOBJ_ADD);
netdev_adjacent_add_links(dev);
+ /* User space wants an explicit MOVE event, issue one unless
+ * dev_change_name() will get called later and issue one.
+ */
+ if (!pat || new_name[0])
+ kobject_uevent(&dev->dev.kobj, KOBJ_MOVE);
+
/* Adapt owner in case owning user namespace of target network
* namespace is different from the original one.
*/
Reported-by: Daniel Gröber <dxld@darkboxed.org>
Link: https://lore.kernel.org/all/20231010121003.x3yi6fihecewjy4e@House.clients.dxld.at/
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/all/20231120184140.578375-1-kuba@kernel.org/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
net/ipv4/route.c:783:46: warning: incorrect type in argument 2 (different base types)
net/ipv4/route.c:783:46: expected unsigned int [usertype] key
net/ipv4/route.c:783:46: got restricted __be32 [usertype] new_gw
Fixes: 969447f226b4 ("ipv4: use new_gw for redirect neigh lookup")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Link: https://lore.kernel.org/r/20231119141759.420477-1-chentao@kylinos.cn
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
ndo_get_peer_dev is used in tcx BPF fast path, therefore make use of
indirect call wrapper and therefore optimize the bpf_redirect_peer()
internal handling a bit. Add a small skb_get_peer_dev() wrapper which
utilizes the INDIRECT_CALL_1() macro instead of open coding.
Future work could potentially add a peer pointer directly into struct
net_device in future and convert veth and netkit over to use it so
that eventually ndo_get_peer_dev can be removed.
Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20231114004220.6495-7-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Traffic redirected by bpf_redirect_peer() (used by recent CNIs like Cilium)
is not accounted for in the RX stats of supported devices (that is, veth
and netkit), confusing user space metrics collectors such as cAdvisor [0],
as reported by Youlun.
Fix it by calling dev_sw_netstats_rx_add() in skb_do_redirect(), to update
RX traffic counters. Devices that support ndo_get_peer_dev _must_ use the
@tstats per-CPU counters (instead of @lstats, or @dstats).
To make this more fool-proof, error out when ndo_get_peer_dev is set but
@tstats are not selected.
[0] Specifically, the "container_network_receive_{byte,packet}s_total"
counters are affected.
Fixes: 9aa1206e8f48 ("bpf: Add redirect_peer helper")
Reported-by: Youlun Zhang <zhangyoulun@bytedance.com>
Signed-off-by: Peilin Ye <peilin.ye@bytedance.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Link: https://lore.kernel.org/r/20231114004220.6495-6-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Move {l,t,d}stats allocation to the core and let netdevs pick the stats
type they need. That way the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc) - all happening in the core.
Co-developed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20231114004220.6495-3-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Upon request, we must be able to provide to the user the list of
associations currently in place. Let's add a new netlink command and
attribute for this purpose.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-12-miquel.raynal@bootlin.com
|
|
Peers may decided to disassociate from us, their coordinator, in this
case they will send a disassociation notification which we must
acknowledge. If we don't, the peer device considers itself disassociated
anyway. We also need to drop the reference to this child from our
internal structures.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-11-miquel.raynal@bootlin.com
|
|
Track the count of associated devices. Limit the number of associations
using the value provided by the user if any. If we reach the maximum
number of associations, we tell the device we are at capacity. If the
user do not want to accept any more associations, it may specify the
value 0 to the maximum number of associations, which will lead to an
access denied error status returned to the peers trying to associate.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-10-miquel.raynal@bootlin.com
|
|
Coordinators may refuse associations. We need a user input for
that. Let's add a new netlink command which can provide a maximum number
of devices we accept to associate with as a first step. Later, we could
also forward the request to userspace and check whether the association
should be accepted or not.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-9-miquel.raynal@bootlin.com
|
|
Coordinators may have to handle association requests from peers which
want to join the PAN. The logic involves:
- Acknowledging the request (done by hardware)
- If requested, a random short address that is free on this PAN should
be chosen for the device.
- Sending an association response with the short address allocated for
the peer and expecting it to be ack'ed.
If anything fails during this procedure, the peer is considered not
associated.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-8-miquel.raynal@bootlin.com
|
|
Devices may decide to disassociate from their coordinator for different
reasons (device turning off, coordinator signal strength too low, etc),
the MAC layer just has to send a disassociation notification.
If the ack of the disassociation notification is not received, the
device may consider itself disassociated anyway.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-7-miquel.raynal@bootlin.com
|
|
A device may decide at some point to disassociate from a PAN, let's
introduce a netlink command for this purpose.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-6-miquel.raynal@bootlin.com
|
|
Joining a PAN officially goes by associating with a coordinator. This
coordinator may have been discovered thanks to the beacons it sent in
the past. Add support to the MAC layer for these associations, which
require:
- Sending an association request
- Receiving an association response
The association response contains the association status, eventually a
reason if the association was unsuccessful, and finally a short address
that we should use for intra-PAN communication from now on, if we
required one (which is the default, and not yet configurable).
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-5-miquel.raynal@bootlin.com
|
|
Users may decide to associate with a peer, which becomes our parent
coordinator. Let's add the necessary netlink support for this.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-4-miquel.raynal@bootlin.com
|
|
Introduce structures to describe peer devices in a PAN as well as a few
related helpers. We basically care about:
- Our unique parent after associating with a coordinator.
- Peer devices, children, which successfully associated with us.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-3-miquel.raynal@bootlin.com
|
|
Soon association and disassociation will be implemented, which will
require to be able to either change the PAN ID from 0xFFFF to a real
value when association succeeded, or to reset the PAN ID to 0xFFFF upon
disassociation. Let's allow to do that manually for now.
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Alexander Aring <aahringo@redhat.com>
Link: https://lore.kernel.org/linux-wpan/20230927181214.129346-2-miquel.raynal@bootlin.com
|
|
W=1 builds now warn if module is built without a MODULE_DESCRIPTION().
Add descriptions to all the sock diag modules in one fell swoop.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|