Age | Commit message (Collapse) | Author |
|
Add three qdisc-specific drop reasons and use them in sch_cake:
1) SKB_DROP_REASON_QDISC_OVERLIMIT
Whenever the total queue limit for a qdisc instance is exceeded
and a packet is dropped to make room.
2) SKB_DROP_REASON_QDISC_CONGESTED
Whenever a packet is dropped by the qdisc AQM algorithm because
congestion is detected.
3) SKB_DROP_REASON_CAKE_FLOOD
Whenever a packet is dropped by the flood protection part of the
CAKE AQM algorithm (BLUE).
Also use the existing SKB_DROP_REASON_QUEUE_PURGE in cake_clear_tin().
Reasons show up as:
perf record -a -e skb:kfree_skb sleep 1; perf script
iperf3 665 [005] 848.656964: skb:kfree_skb: skbaddr=0xffff98168a333500 rx_sk=(nil) protocol=34525 location=__dev_queue_xmit+0x10f0 reason: QDISC_OVERLIMIT
swapper 0 [001] 909.166055: skb:kfree_skb: skbaddr=0xffff98168280cee0 rx_sk=(nil) protocol=34525 location=cake_dequeue+0x5ef reason: QDISC_CONGESTED
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20241211-cake-drop-reason-v2-1-920afadf4d1b@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Cross-merge networking fixes after downstream PR (net-6.13-rc3).
No conflicts or adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth, netfilter and wireless.
Current release - fix to a fix:
- rtnetlink: fix error code in rtnl_newlink()
- tipc: fix NULL deref in cleanup_bearer()
Current release - regressions:
- ip: fix warning about invalid return from in ip_route_input_rcu()
Current release - new code bugs:
- udp: fix L4 hash after reconnect
- eth: lan969x: fix cyclic dependency between modules
- eth: bnxt_en: fix potential crash when dumping FW log coredump
Previous releases - regressions:
- wifi: mac80211:
- fix a queue stall in certain cases of channel switch
- wake the queues in case of failure in resume
- splice: do not checksum AF_UNIX sockets
- virtio_net: fix BUG()s in BQL support due to incorrect accounting
of purged packets during interface stop
- eth:
- stmmac: fix TSO DMA API mis-usage causing oops
- bnxt_en: fixes for HW GRO: GSO type on 5750X chips and oops
due to incorrect aggregation ID mask on 5760X chips
Previous releases - always broken:
- Bluetooth: improve setsockopt() handling of malformed user input
- eth: ocelot: fix PTP timestamping in presence of packet loss
- ptp: kvm: x86: avoid "fail to initialize ptp_kvm" when simply not
supported"
* tag 'net-6.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
net: dsa: tag_ocelot_8021q: fix broken reception
net: dsa: microchip: KSZ9896 register regmap alignment to 32 bit boundaries
net: renesas: rswitch: fix initial MPIC register setting
Bluetooth: btmtk: avoid UAF in btmtk_process_coredump
Bluetooth: iso: Fix circular lock in iso_conn_big_sync
Bluetooth: iso: Fix circular lock in iso_listen_bis
Bluetooth: SCO: Add support for 16 bits transparent voice setting
Bluetooth: iso: Fix recursive locking warning
Bluetooth: iso: Always release hdev at the end of iso_listen_bis
Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating
Bluetooth: hci_core: Fix sleeping function called from invalid context
team: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
team: Fix initial vlan_feature set in __team_compute_features
bonding: Fix feature propagation of NETIF_F_GSO_ENCAP_ALL
bonding: Fix initial {vlan,mpls}_feature set in bond_compute_features
net, team, bonding: Add netdev_base_features helper
net/sched: netem: account for backlog updates from child qdisc
net: dsa: felix: fix stuck CPU-injected packets with short taprio windows
splice: do not checksum AF_UNIX sockets
net: usb: qmi_wwan: add Telit FE910C04 compositions
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- SCO: Fix transparent voice setting
- ISO: Locking fixes
- hci_core: Fix sleeping function called from invalid context
- hci_event: Fix using rcu_read_(un)lock while iterating
- btmtk: avoid UAF in btmtk_process_coredump
* tag 'for-net-2024-12-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: btmtk: avoid UAF in btmtk_process_coredump
Bluetooth: iso: Fix circular lock in iso_conn_big_sync
Bluetooth: iso: Fix circular lock in iso_listen_bis
Bluetooth: SCO: Add support for 16 bits transparent voice setting
Bluetooth: iso: Fix recursive locking warning
Bluetooth: iso: Always release hdev at the end of iso_listen_bis
Bluetooth: hci_event: Fix using rcu_read_(un)lock while iterating
Bluetooth: hci_core: Fix sleeping function called from invalid context
Bluetooth: Improve setsockopt() handling of malformed user input
====================
Link: https://patch.msgid.link/20241212142806.2046274-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
The blamed commit changed the dsa_8021q_rcv() calling convention to
accept pre-populated source_port and switch_id arguments. If those are
not available, as in the case of tag_ocelot_8021q, the arguments must be
pre-initialized with -1.
Due to the bug of passing uninitialized arguments in tag_ocelot_8021q,
dsa_8021q_rcv() does not detect that it needs to populate the
source_port and switch_id, and this makes dsa_conduit_find_user() fail,
which leads to packet loss on reception.
Fixes: dcfe7673787b ("net: dsa: tag_sja1105: absorb logic for not overwriting precise info into dsa_8021q_rcv()")
Signed-off-by: Robert Hodaszi <robert.hodaszi@digi.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241211144741.1415758-1-robert.hodaszi@digi.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This fixes the circular locking dependency warning below, by reworking
iso_sock_recvmsg, to ensure that the socket lock is always released
before calling a function that locks hdev.
[ 561.670344] ======================================================
[ 561.670346] WARNING: possible circular locking dependency detected
[ 561.670349] 6.12.0-rc6+ #26 Not tainted
[ 561.670351] ------------------------------------------------------
[ 561.670353] iso-tester/3289 is trying to acquire lock:
[ 561.670355] ffff88811f600078 (&hdev->lock){+.+.}-{3:3},
at: iso_conn_big_sync+0x73/0x260 [bluetooth]
[ 561.670405]
but task is already holding lock:
[ 561.670407] ffff88815af58258 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0},
at: iso_sock_recvmsg+0xbf/0x500 [bluetooth]
[ 561.670450]
which lock already depends on the new lock.
[ 561.670452]
the existing dependency chain (in reverse order) is:
[ 561.670453]
-> #2 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0}:
[ 561.670458] lock_acquire+0x7c/0xc0
[ 561.670463] lock_sock_nested+0x3b/0xf0
[ 561.670467] bt_accept_dequeue+0x1a5/0x4d0 [bluetooth]
[ 561.670510] iso_sock_accept+0x271/0x830 [bluetooth]
[ 561.670547] do_accept+0x3dd/0x610
[ 561.670550] __sys_accept4+0xd8/0x170
[ 561.670553] __x64_sys_accept+0x74/0xc0
[ 561.670556] x64_sys_call+0x17d6/0x25f0
[ 561.670559] do_syscall_64+0x87/0x150
[ 561.670563] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 561.670567]
-> #1 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}:
[ 561.670571] lock_acquire+0x7c/0xc0
[ 561.670574] lock_sock_nested+0x3b/0xf0
[ 561.670577] iso_sock_listen+0x2de/0xf30 [bluetooth]
[ 561.670617] __sys_listen_socket+0xef/0x130
[ 561.670620] __x64_sys_listen+0xe1/0x190
[ 561.670623] x64_sys_call+0x2517/0x25f0
[ 561.670626] do_syscall_64+0x87/0x150
[ 561.670629] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 561.670632]
-> #0 (&hdev->lock){+.+.}-{3:3}:
[ 561.670636] __lock_acquire+0x32ad/0x6ab0
[ 561.670639] lock_acquire.part.0+0x118/0x360
[ 561.670642] lock_acquire+0x7c/0xc0
[ 561.670644] __mutex_lock+0x18d/0x12f0
[ 561.670647] mutex_lock_nested+0x1b/0x30
[ 561.670651] iso_conn_big_sync+0x73/0x260 [bluetooth]
[ 561.670687] iso_sock_recvmsg+0x3e9/0x500 [bluetooth]
[ 561.670722] sock_recvmsg+0x1d5/0x240
[ 561.670725] sock_read_iter+0x27d/0x470
[ 561.670727] vfs_read+0x9a0/0xd30
[ 561.670731] ksys_read+0x1a8/0x250
[ 561.670733] __x64_sys_read+0x72/0xc0
[ 561.670736] x64_sys_call+0x1b12/0x25f0
[ 561.670738] do_syscall_64+0x87/0x150
[ 561.670741] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 561.670744]
other info that might help us debug this:
[ 561.670745] Chain exists of:
&hdev->lock --> sk_lock-AF_BLUETOOTH-BTPROTO_ISO --> sk_lock-AF_BLUETOOTH
[ 561.670751] Possible unsafe locking scenario:
[ 561.670753] CPU0 CPU1
[ 561.670754] ---- ----
[ 561.670756] lock(sk_lock-AF_BLUETOOTH);
[ 561.670758] lock(sk_lock
AF_BLUETOOTH-BTPROTO_ISO);
[ 561.670761] lock(sk_lock-AF_BLUETOOTH);
[ 561.670764] lock(&hdev->lock);
[ 561.670767]
*** DEADLOCK ***
Fixes: 07a9342b94a9 ("Bluetooth: ISO: Send BIG Create Sync via hci_sync")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
This fixes the circular locking dependency warning below, by
releasing the socket lock before enterning iso_listen_bis, to
avoid any potential deadlock with hdev lock.
[ 75.307983] ======================================================
[ 75.307984] WARNING: possible circular locking dependency detected
[ 75.307985] 6.12.0-rc6+ #22 Not tainted
[ 75.307987] ------------------------------------------------------
[ 75.307987] kworker/u81:2/2623 is trying to acquire lock:
[ 75.307988] ffff8fde1769da58 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO)
at: iso_connect_cfm+0x253/0x840 [bluetooth]
[ 75.308021]
but task is already holding lock:
[ 75.308022] ffff8fdd61a10078 (&hdev->lock)
at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth]
[ 75.308053]
which lock already depends on the new lock.
[ 75.308054]
the existing dependency chain (in reverse order) is:
[ 75.308055]
-> #1 (&hdev->lock){+.+.}-{3:3}:
[ 75.308057] __mutex_lock+0xad/0xc50
[ 75.308061] mutex_lock_nested+0x1b/0x30
[ 75.308063] iso_sock_listen+0x143/0x5c0 [bluetooth]
[ 75.308085] __sys_listen_socket+0x49/0x60
[ 75.308088] __x64_sys_listen+0x4c/0x90
[ 75.308090] x64_sys_call+0x2517/0x25f0
[ 75.308092] do_syscall_64+0x87/0x150
[ 75.308095] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 75.308098]
-> #0 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}:
[ 75.308100] __lock_acquire+0x155e/0x25f0
[ 75.308103] lock_acquire+0xc9/0x300
[ 75.308105] lock_sock_nested+0x32/0x90
[ 75.308107] iso_connect_cfm+0x253/0x840 [bluetooth]
[ 75.308128] hci_connect_cfm+0x6c/0x190 [bluetooth]
[ 75.308155] hci_le_per_adv_report_evt+0x27b/0x2f0 [bluetooth]
[ 75.308180] hci_le_meta_evt+0xe7/0x200 [bluetooth]
[ 75.308206] hci_event_packet+0x21f/0x5c0 [bluetooth]
[ 75.308230] hci_rx_work+0x3ae/0xb10 [bluetooth]
[ 75.308254] process_one_work+0x212/0x740
[ 75.308256] worker_thread+0x1bd/0x3a0
[ 75.308258] kthread+0xe4/0x120
[ 75.308259] ret_from_fork+0x44/0x70
[ 75.308261] ret_from_fork_asm+0x1a/0x30
[ 75.308263]
other info that might help us debug this:
[ 75.308264] Possible unsafe locking scenario:
[ 75.308264] CPU0 CPU1
[ 75.308265] ---- ----
[ 75.308265] lock(&hdev->lock);
[ 75.308267] lock(sk_lock-
AF_BLUETOOTH-BTPROTO_ISO);
[ 75.308268] lock(&hdev->lock);
[ 75.308269] lock(sk_lock-AF_BLUETOOTH-BTPROTO_ISO);
[ 75.308270]
*** DEADLOCK ***
[ 75.308271] 4 locks held by kworker/u81:2/2623:
[ 75.308272] #0: ffff8fdd66e52148 ((wq_completion)hci0#2){+.+.}-{0:0},
at: process_one_work+0x443/0x740
[ 75.308276] #1: ffffafb488b7fe48 ((work_completion)(&hdev->rx_work)),
at: process_one_work+0x1ce/0x740
[ 75.308280] #2: ffff8fdd61a10078 (&hdev->lock){+.+.}-{3:3}
at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth]
[ 75.308304] #3: ffffffffb6ba4900 (rcu_read_lock){....}-{1:2},
at: hci_connect_cfm+0x29/0x190 [bluetooth]
Fixes: 02171da6e86a ("Bluetooth: ISO: Add hcon for listening bis sk")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
The voice setting is used by sco_connect() or sco_conn_defer_accept()
after being set by sco_sock_setsockopt().
The PCM part of the voice setting is used for offload mode through PCM
chipset port.
This commits add support for mSBC 16 bits offloading, i.e. audio data
not transported over HCI.
The BCM4349B1 supports 16 bits transparent data on its I2S port.
If BT_VOICE_TRANSPARENT is used when accepting a SCO connection, this
gives only garbage audio while using BT_VOICE_TRANSPARENT_16BIT gives
correct audio.
This has been tested with connection to iPhone 14 and Samsung S24.
Fixes: ad10b1a48754 ("Bluetooth: Add Bluetooth socket voice option")
Signed-off-by: Frédéric Danis <frederic.danis@collabora.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
This updates iso_sock_accept to use nested locking for the parent
socket, to avoid lockdep warnings caused because the parent and
child sockets are locked by the same thread:
[ 41.585683] ============================================
[ 41.585688] WARNING: possible recursive locking detected
[ 41.585694] 6.12.0-rc6+ #22 Not tainted
[ 41.585701] --------------------------------------------
[ 41.585705] iso-tester/3139 is trying to acquire lock:
[ 41.585711] ffff988b29530a58 (sk_lock-AF_BLUETOOTH)
at: bt_accept_dequeue+0xe3/0x280 [bluetooth]
[ 41.585905]
but task is already holding lock:
[ 41.585909] ffff988b29533a58 (sk_lock-AF_BLUETOOTH)
at: iso_sock_accept+0x61/0x2d0 [bluetooth]
[ 41.586064]
other info that might help us debug this:
[ 41.586069] Possible unsafe locking scenario:
[ 41.586072] CPU0
[ 41.586076] ----
[ 41.586079] lock(sk_lock-AF_BLUETOOTH);
[ 41.586086] lock(sk_lock-AF_BLUETOOTH);
[ 41.586093]
*** DEADLOCK ***
[ 41.586097] May be due to missing lock nesting notation
[ 41.586101] 1 lock held by iso-tester/3139:
[ 41.586107] #0: ffff988b29533a58 (sk_lock-AF_BLUETOOTH)
at: iso_sock_accept+0x61/0x2d0 [bluetooth]
Fixes: ccf74f2390d6 ("Bluetooth: Add BTPROTO_ISO socket type")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
Since hci_get_route holds the device before returning, the hdev
should be released with hci_dev_put at the end of iso_listen_bis
even if the function returns with an error.
Fixes: 02171da6e86a ("Bluetooth: ISO: Add hcon for listening bis sk")
Signed-off-by: Iulia Tanasescu <iulia.tanasescu@nxp.com>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
The usage of rcu_read_(un)lock while inside list_for_each_entry_rcu is
not safe since for the most part entries fetched this way shall be
treated as rcu_dereference:
Note that the value returned by rcu_dereference() is valid
only within the enclosing RCU read-side critical section [1]_.
For example, the following is **not** legal::
rcu_read_lock();
p = rcu_dereference(head.next);
rcu_read_unlock();
x = p->address; /* BUG!!! */
rcu_read_lock();
y = p->data; /* BUG!!! */
rcu_read_unlock();
Fixes: a0bfde167b50 ("Bluetooth: ISO: Add support for connecting multiple BISes")
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
This reworks hci_cb_list to not use mutex hci_cb_list_lock to avoid bugs
like the bellow:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 5070, name: kworker/u9:2
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
4 locks held by kworker/u9:2/5070:
#0: ffff888015be3948 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3229 [inline]
#0: ffff888015be3948 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_scheduled_works+0x8e0/0x1770 kernel/workqueue.c:3335
#1: ffffc90003b6fd00 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3230 [inline]
#1: ffffc90003b6fd00 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_scheduled_works+0x91b/0x1770 kernel/workqueue.c:3335
#2: ffff8880665d0078 (&hdev->lock){+.+.}-{3:3}, at: hci_le_create_big_complete_evt+0xcf/0xae0 net/bluetooth/hci_event.c:6914
#3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:298 [inline]
#3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:750 [inline]
#3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: hci_le_create_big_complete_evt+0xdb/0xae0 net/bluetooth/hci_event.c:6915
CPU: 0 PID: 5070 Comm: kworker/u9:2 Not tainted 6.8.0-syzkaller-08073-g480e035fc4c7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Workqueue: hci0 hci_rx_work
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
__might_resched+0x5d4/0x780 kernel/sched/core.c:10187
__mutex_lock_common kernel/locking/mutex.c:585 [inline]
__mutex_lock+0xc1/0xd70 kernel/locking/mutex.c:752
hci_connect_cfm include/net/bluetooth/hci_core.h:2004 [inline]
hci_le_create_big_complete_evt+0x3d9/0xae0 net/bluetooth/hci_event.c:6939
hci_event_func net/bluetooth/hci_event.c:7514 [inline]
hci_event_packet+0xa53/0x1540 net/bluetooth/hci_event.c:7569
hci_rx_work+0x3e8/0xca0 net/bluetooth/hci_core.c:4171
process_one_work kernel/workqueue.c:3254 [inline]
process_scheduled_works+0xa00/0x1770 kernel/workqueue.c:3335
worker_thread+0x86d/0xd70 kernel/workqueue.c:3416
kthread+0x2f0/0x390 kernel/kthread.c:388
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
</TASK>
Reported-by: syzbot+2fb0835e0c9cefc34614@syzkaller.appspotmail.com
Tested-by: syzbot+2fb0835e0c9cefc34614@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2fb0835e0c9cefc34614
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
AF_INET6 is not supported for smc-r v2 client before, even if the
ipv6 addr is ipv4 mapped. Thus, when using AF_INET6, smc-r connection
will fallback to tcp, especially for java applications running smc-r.
This patch support ipv4 mapped ipv6 addr client for smc-r v2. Clients
using real global ipv6 addr is still not supported yet.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
For SMC-R V2, llc msg can be larger than SMC_WR_BUF_SIZE, thus every
recv wr has 2 sges, the first sge with length SMC_WR_BUF_SIZE is for
V1/V2 compatible llc/cdc msg, and the second sge with length
SMC_WR_BUF_V2_SIZE-SMC_WR_TX_SIZE is for V2 specific llc msg, like
SMC_LLC_DELETE_RKEY and SMC_LLC_ADD_LINK for SMC-R V2. The memory
buffer in the second sge is shared by all recv wr in one link and
all link in one lgr for saving memory usage purpose.
But not all RDMA devices with max_recv_sge greater than 1. Thus SMC-R
V2 can not support on such RDMA devices and SMC_CLC_DECL_INTERR fallback
happens because of the failure of create qp.
This patch introduce the support for SMC-R V2 on RDMA devices with
max_recv_sge equals to 1. Every recv wr has only one sge with individual
buffer whose size is SMC_WR_BUF_V2_SIZE once the RDMA device's max_recv_sge
equals to 1. It may use more memory, but it is better than
SMC_CLC_DECL_INTERR fallback.
Co-developed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for net:
1) Fix bogus test reports in rpath.sh selftest by adding permanent
neighbor entries, from Phil Sutter.
2) Lockdep reports possible ABBA deadlock in xt_IDLETIMER, fix it by
removing sysfs out of the mutex section, also from Phil Sutter.
3) It is illegal to release basechain via RCU callback, for several
reasons. Keep it simple and safe by calling synchronize_rcu() instead.
This is a partially reverting a botched recent attempt of me to fix
this basechain release path on netdevice removal.
From Florian Westphal.
netfilter pull request 24-12-11
* tag 'nf-24-12-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: do not defer rule destruction via call_rcu
netfilter: IDLETIMER: Fix for possible ABBA deadlock
selftests: netfilter: Stabilize rpath.sh
====================
Link: https://patch.msgid.link/20241211230130.176937-1-pablo@netfilter.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Now that we have updated all drivers, switch DSA to require an
implementation of the .support_eee() method for EEE to be usable,
rather than defaulting to being permissive when not implemented.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/E1tL14e-006cZy-AT@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Provide a trivial implementation for the .support_eee() method which
switch drivers can use to simply indicate that they support EEE on
all their user ports.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/E1tL149-006cZJ-JJ@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a hook to determine whether the switch supports EEE. This will
return false if the switch does not, or true if it does. If the
method is not implemented, we assume (currently) that the switch
supports EEE.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/E1tL144-006cZD-El@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When user ports are initialised, a phylink instance is always created,
and so dp->pl will always be non-NULL. The EEE methods are only used
for user ports, so checking for dp->pl to be NULL makes no sense. No
other phylink-calling method implements similar checks in DSA. Remove
this unnecessary check.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://patch.msgid.link/E1tL13z-006cZ7-BZ@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
In general, 'qlen' of any classful qdisc should keep track of the
number of packets that the qdisc itself and all of its children holds.
In case of netem, 'qlen' only accounts for the packets in its internal
tfifo. When netem is used with a child qdisc, the child qdisc can use
'qdisc_tree_reduce_backlog' to inform its parent, netem, about created
or dropped SKBs. This function updates 'qlen' and the backlog statistics
of netem, but netem does not account for changes made by a child qdisc.
'qlen' then indicates the wrong number of packets in the tfifo.
If a child qdisc creates new SKBs during enqueue and informs its parent
about this, netem's 'qlen' value is increased. When netem dequeues the
newly created SKBs from the child, the 'qlen' in netem is not updated.
If 'qlen' reaches the configured sch->limit, the enqueue function stops
working, even though the tfifo is not full.
Reproduce the bug:
Ensure that the sender machine has GSO enabled. Configure netem as root
qdisc and tbf as its child on the outgoing interface of the machine
as follows:
$ tc qdisc add dev <oif> root handle 1: netem delay 100ms limit 100
$ tc qdisc add dev <oif> parent 1:0 tbf rate 50Mbit burst 1542 latency 50ms
Send bulk TCP traffic out via this interface, e.g., by running an iPerf3
client on the machine. Check the qdisc statistics:
$ tc -s qdisc show dev <oif>
Statistics after 10s of iPerf3 TCP test before the fix (note that
netem's backlog > limit, netem stopped accepting packets):
qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
Sent 2767766 bytes 1848 pkt (dropped 652, overlimits 0 requeues 0)
backlog 4294528236b 1155p requeues 0
qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
Sent 2767766 bytes 1848 pkt (dropped 327, overlimits 7601 requeues 0)
backlog 0b 0p requeues 0
Statistics after the fix:
qdisc netem 1: root refcnt 2 limit 1000 delay 100ms
Sent 37766372 bytes 24974 pkt (dropped 9, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
qdisc tbf 10: parent 1:1 rate 50Mbit burst 1537b lat 50ms
Sent 37766372 bytes 24974 pkt (dropped 327, overlimits 96017 requeues 0)
backlog 0b 0p requeues 0
tbf segments the GSO SKBs (tbf_segment) and updates the netem's 'qlen'.
The interface fully stops transferring packets and "locks". In this case,
the child qdisc and tfifo are empty, but 'qlen' indicates the tfifo is at
its limit and no more packets are accepted.
This patch adds a counter for the entries in the tfifo. Netem's 'qlen' is
only decreased when a packet is returned by its dequeue function, and not
during enqueuing into the child qdisc. External updates to 'qlen' are thus
accounted for and only the behavior of the backlog statistics changes. As
in other qdiscs, 'qlen' then keeps track of how many packets are held in
netem and all of its children. As before, sch->limit remains as the
maximum number of packets in the tfifo. The same applies to netem's
backlog statistics.
Fixes: 50612537e9ab ("netem: fix classful handling")
Signed-off-by: Martin Ottens <martin.ottens@fau.de>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Link: https://patch.msgid.link/20241210131412.1837202-1-martin.ottens@fau.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here are some batman-adv bugfixes:
- fix TT unitialized data and size limit issues, by Remi Pommarel
(3 patches)
* tag 'batadv-net-pullrequest-20241210' of git://git.open-mesh.org/linux-merge:
batman-adv: Do not let TT changes list grows indefinitely
batman-adv: Remove uninitialized data in full table TT response
batman-adv: Do not send uninitialized TT changes
====================
Link: https://patch.msgid.link/20241210135024.39068-1-sw@simonwunderlich.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When `skb_splice_from_iter` was introduced, it inadvertently added
checksumming for AF_UNIX sockets. This resulted in significant
slowdowns, for example when using sendfile over unix sockets.
Using the test code in [1] in my test setup (2G single core qemu),
the client receives a 1000M file in:
- without the patch: 1482ms (+/- 36ms)
- with the patch: 652.5ms (+/- 22.9ms)
This commit addresses the issue by marking checksumming as unnecessary in
`unix_stream_sendmsg`
Cc: stable@vger.kernel.org
Signed-off-by: Frederik Deweerdt <deweerdt.lkml@gmail.com>
Fixes: 2e910b95329c ("net: Add a function to splice pages into an skbuff for MSG_SPLICE_PAGES")
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Joe Damato <jdamato@fastly.com>
Link: https://patch.msgid.link/Z1fMaHkRf8cfubuE@xiberoa
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Today we have a hardcoded delay of 1 sec before a TIME-WAIT socket can be
reused by reopening a connection. This is a safe choice based on an
assumption that the other TCP timestamp clock frequency, which is unknown
to us, may be as low as 1 Hz (RFC 7323, section 5.4).
However, this means that in the presence of short lived connections with an
RTT of couple of milliseconds, the time during which a 4-tuple is blocked
from reuse can be orders of magnitude longer that the connection lifetime.
Combined with a reduced pool of ephemeral ports, when using
IP_LOCAL_PORT_RANGE to share an egress IP address between hosts [1], the
long TIME-WAIT reuse delay can lead to port exhaustion, where all available
4-tuples are tied up in TIME-WAIT state.
Turn the reuse delay into a per-netns setting so that sysadmins can make
more aggressive assumptions about remote TCP timestamp clock frequency and
shorten the delay in order to allow connections to reincarnate faster.
Note that applications can completely bypass the TIME-WAIT delay protection
already today by locking the local port with bind() before connecting. Such
immediate connection reuse may result in PAWS failing to detect old
duplicate segments, leaving us with just the sequence number check as a
safety net.
This new configurable offers a trade off where the sysadmin can balance
between the risk of PAWS detection failing to act versus exhausting ports
by having sockets tied up in TIME-WAIT state for too long.
[1] https://lpc.events/event/16/contributions/1349/
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20241209-jakub-krn-909-poc-msec-tw-tstamp-v2-2-66aca0eed03e@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Prepare ground for TIME-WAIT socket reuse with subsecond delay.
Today the last TS.Recent update timestamp, recorded in seconds and stored
tp->ts_recent_stamp and tw->tw_ts_recent_stamp fields, has two purposes.
Firstly, it is used to track the age of the last recorded TS.Recent value
to detect when that value becomes outdated due to potential wrap-around of
the other TCP timestamp clock (RFC 7323, section 5.5).
For this purpose a second-based timestamp is completely sufficient as even
in the worst case scenario of a peer using a high resolution microsecond
timestamp, the wrap-around interval is ~36 minutes long.
Secondly, it serves as a threshold value for allowing TIME-WAIT socket
reuse. A TIME-WAIT socket can be reused only once the virtual 1 Hz clock,
ktime_get_seconds, is past the TS.Recent update timestamp.
The purpose behind delaying the TIME-WAIT socket reuse is to wait for the
other TCP timestamp clock to tick at least once before reusing the
connection. It is only then that the PAWS mechanism for the reopened
connection can detect old duplicate segments from the previous connection
incarnation (RFC 7323, appendix B.2).
In this case using a timestamp with second resolution not only blocks the
way toward allowing faster TIME-WAIT reuse after shorter subsecond delay,
but also makes it impossible to reliably delay TW reuse by one second.
As Eric Dumazet has pointed out [1], due to timestamp rounding, the TW
reuse delay will actually be between (0, 1] seconds, and 0.5 seconds on
average. We delay TW reuse for one full second only when last TS.Recent
update coincides with our virtual 1 Hz clock tick.
Considering the above, introduce a dedicated field to store a millisecond
timestamp of transition into the TIME-WAIT state. Place it in an existing
4-byte hole inside inet_timewait_sock structure to avoid an additional
memory cost.
Use the new timestamp to (i) reliably delay TIME-WAIT reuse by one second,
and (ii) prepare for configurable subsecond reuse delay in the subsequent
change.
We assume here that a full one second delay was the original intention in
[2] because it accounts for the worst case scenario of the other TCP using
the slowest recommended 1 Hz timestamp clock.
A more involved alternative would be to change the resolution of the last
TS.Recent update timestamp, tw->tw_ts_recent_stamp, to milliseconds.
[1] https://lore.kernel.org/netdev/CANn89iKB4GFd8sVzCbRttqw_96o3i2wDhX-3DraQtsceNGYwug@mail.gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Link: https://patch.msgid.link/20241209-jakub-krn-909-poc-msec-tw-tstamp-v2-1-66aca0eed03e@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
psf->sf_count[MCAST_XXX] fields are read locklessly from
ipv6_chk_mcast_addr() and igmp6_mcf_seq_show().
Add READ_ONCE() and WRITE_ONCE() annotations accordingly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241210183352.86530-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
mc->mca_sfcount[MCAST_EXCLUDE] is read locklessly from
ipv6_chk_mcast_addr().
Add READ_ONCE() and WRITE_ONCE() annotations accordingly.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241210183352.86530-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add a label and two gotos to shorten lines by two tabulations,
to ease code review of following patches.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20241210183352.86530-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
nf_tables_chain_destroy can sleep, it can't be used from call_rcu
callbacks.
Moreover, nf_tables_rule_release() is only safe for error unwinding,
while transaction mutex is held and the to-be-desroyed rule was not
exposed to either dataplane or dumps, as it deactives+frees without
the required synchronize_rcu() in-between.
nft_rule_expr_deactivate() callbacks will change ->use counters
of other chains/sets, see e.g. nft_lookup .deactivate callback, these
must be serialized via transaction mutex.
Also add a few lockdep asserts to make this more explicit.
Calling synchronize_rcu() isn't ideal, but fixing this without is hard
and way more intrusive. As-is, we can get:
WARNING: .. net/netfilter/nf_tables_api.c:5515 nft_set_destroy+0x..
Workqueue: events nf_tables_trans_destroy_work
RIP: 0010:nft_set_destroy+0x3fe/0x5c0
Call Trace:
<TASK>
nf_tables_trans_destroy_work+0x6b7/0xad0
process_one_work+0x64a/0xce0
worker_thread+0x613/0x10d0
In case the synchronize_rcu becomes an issue, we can explore alternatives.
One way would be to allocate nft_trans_rule objects + one nft_trans_chain
object, deactivate the rules + the chain and then defer the freeing to the
nft destroy workqueue. We'd still need to keep the synchronize_rcu path as
a fallback to handle -ENOMEM corner cases though.
Reported-by: syzbot+b26935466701e56cfdc2@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/67478d92.050a0220.253251.0062.GAE@google.com/T/
Fixes: c03d278fdf35 ("netfilter: nf_tables: wait for rcu grace period on net_device removal")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
Deletion of the last rule referencing a given idletimer may happen at
the same time as a read of its file in sysfs:
| ======================================================
| WARNING: possible circular locking dependency detected
| 6.12.0-rc7-01692-g5e9a28f41134-dirty #594 Not tainted
| ------------------------------------------------------
| iptables/3303 is trying to acquire lock:
| ffff8881057e04b8 (kn->active#48){++++}-{0:0}, at: __kernfs_remove+0x20
|
| but task is already holding lock:
| ffffffffa0249068 (list_mutex){+.+.}-{3:3}, at: idletimer_tg_destroy_v]
|
| which lock already depends on the new lock.
A simple reproducer is:
| #!/bin/bash
|
| while true; do
| iptables -A INPUT -i foo -j IDLETIMER --timeout 10 --label "testme"
| iptables -D INPUT -i foo -j IDLETIMER --timeout 10 --label "testme"
| done &
| while true; do
| cat /sys/class/xt_idletimer/timers/testme >/dev/null
| done
Avoid this by freeing list_mutex right after deleting the element from
the list, then continuing with the teardown.
Fixes: 0902b469bd25 ("netfilter: xtables: idletimer target implementation")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The bt_copy_from_sockptr() return value is being misinterpreted by most
users: a non-zero result is mistakenly assumed to represent an error code,
but actually indicates the number of bytes that could not be copied.
Remove bt_copy_from_sockptr() and adapt callers to use
copy_safe_from_sockptr().
For sco_sock_setsockopt() (case BT_CODEC) use copy_struct_from_sockptr() to
scrub parts of uninitialized buffer.
Opportunistically, rename `len` to `optlen` in hci_sock_setsockopt_old()
and hci_sock_setsockopt().
Fixes: 51eda36d33e4 ("Bluetooth: SCO: Fix not validating setsockopt user input")
Fixes: a97de7bff13b ("Bluetooth: RFCOMM: Fix not validating setsockopt user input")
Fixes: 4f3951242ace ("Bluetooth: L2CAP: Fix not validating setsockopt user input")
Fixes: 9e8742cdfc4b ("Bluetooth: ISO: Fix not validating setsockopt user input")
Fixes: b2186061d604 ("Bluetooth: hci_sock: Fix not validating setsockopt user input")
Reviewed-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Reviewed-by: David Wei <dw@davidwei.uk>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
|
|
l2tp_eth uses the TSTATS infrastructure (dev_sw_netstats_*()) for RX
and TX packet counters and DEV_STATS_INC for dropped counters.
Consolidate that using the DSTATS infrastructure, which can
handle both packet counters and packet drops. Statistics that don't
fit DSTATS are still updated atomically with DEV_STATS_INC().
This change is inspired by the introduction of DSTATS helpers and
their use in other udp tunnel drivers:
Link: https://lore.kernel.org/all/cover.1733313925.git.gnault@redhat.com/
Signed-off-by: James Chapman <jchapman@katalix.com>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
A small set of fixes:
- avoid CSA warnings during link removal
(by changing link bitmap after remove)
- fix # of spatial streams initialisation
- fix queues getting stuck in some CSA cases
and resume failures
- fix interface address when switching monitor mode
- fix MBSS change flags 32-bit stack corruption
- more UBSAN __counted_by "fixes" ...
- fix link ID netlink validation
* tag 'wireless-2024-12-10' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: cfg80211: sme: init n_channels before channels[] access
wifi: mac80211: fix station NSS capability initialization order
wifi: mac80211: fix vif addr when switching from monitor to station
wifi: mac80211: fix a queue stall in certain cases of CSA
wifi: mac80211: wake the queues in case of failure in resume
wifi: cfg80211: clear link ID from bitmap during link delete after clean up
wifi: mac80211: init cnt before accessing elem in ieee80211_copy_mbssid_beacon
wifi: mac80211: fix mbss changed flags corruption on 32 bit systems
wifi: nl80211: fix NL80211_ATTR_MLO_LINK_ID off-by-one
====================
Link: https://patch.msgid.link/20241210130145.28618-3-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This is the last netdev iterator still using net->dev_index_head[].
Convert to modern for_each_netdev_dump() for better scalability,
and use common patterns in our stack.
Following patch in this series removes the pad field
in struct ndo_fdb_dump_context.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241209100747.2269613-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
rtnl_fdb_dump() and various ndo_fdb_dump() helpers share
a hidden layout of cb->ctx.
Before switching rtnl_fdb_dump() to for_each_netdev_dump()
in the following patch, make this more explicit.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20241209100747.2269613-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Ensure there is enough space before adding MPTCP options in
tcp_syn_options().
Without this check, 'remaining' could underflow, and causes issues. If
there is not enough space, MPTCP should not be used.
Signed-off-by: MoYuanhao <moyuanhao3676@163.com>
Fixes: cec37a6e41aa ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Cc: stable@vger.kernel.org
Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
[ Matt: Add Fixes, cc Stable, update Description ]
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241209-net-mptcp-check-space-syn-v1-1-2da992bb6f74@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Use the proper API instead of open coding it.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://patch.msgid.link/20241208234955.31910-1-frederic@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Tail-called programs could execute any of the helpers that invalidate
packet pointers. Hence, conservatively assume that each tail call
invalidates packet pointers.
Making the change in bpf_helper_changes_pkt_data() automatically makes
use of check_cfg() logic that computes 'changes_pkt_data' effect for
global sub-programs, such that the following program could be
rejected:
int tail_call(struct __sk_buff *sk)
{
bpf_tail_call_static(sk, &jmp_table, 0);
return 0;
}
SEC("tc")
int not_safe(struct __sk_buff *sk)
{
int *p = (void *)(long)sk->data;
... make p valid ...
tail_call(sk);
*p = 42; /* this is unsafe */
...
}
The tc_bpf2bpf.c:subprog_tc() needs change: mark it as a function that
can invalidate packet pointers. Otherwise, it can't be freplaced with
tailcall_freplace.c:entry_freplace() that does a tail call.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-8-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Use BPF helper number instead of function pointer in
bpf_helper_changes_pkt_data(). This would simplify usage of this
function in verifier.c:check_cfg() (in a follow-up patch),
where only helper number is easily available and there is no real need
to lookup helper proto.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241210041100.1898468-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Element replace (with a socket different from the one stored) may race
with socket's close() link popping & unlinking. __sock_map_delete()
unconditionally unrefs the (wrong) element:
// set map[0] = s0
map_update_elem(map, 0, s0)
// drop fd of s0
close(s0)
sock_map_close()
lock_sock(sk) (s0!)
sock_map_remove_links(sk)
link = sk_psock_link_pop()
sock_map_unlink(sk, link)
sock_map_delete_from_link
// replace map[0] with s1
map_update_elem(map, 0, s1)
sock_map_update_elem
(s1!) lock_sock(sk)
sock_map_update_common
psock = sk_psock(sk)
spin_lock(&stab->lock)
osk = stab->sks[idx]
sock_map_add_link(..., &stab->sks[idx])
sock_map_unref(osk, &stab->sks[idx])
psock = sk_psock(osk)
sk_psock_put(sk, psock)
if (refcount_dec_and_test(&psock))
sk_psock_drop(sk, psock)
spin_unlock(&stab->lock)
unlock_sock(sk)
__sock_map_delete
spin_lock(&stab->lock)
sk = *psk // s1 replaced s0; sk == s1
if (!sk_test || sk_test == sk) // sk_test (s0) != sk (s1); no branch
sk = xchg(psk, NULL)
if (sk)
sock_map_unref(sk, psk) // unref s1; sks[idx] will dangle
psock = sk_psock(sk)
sk_psock_put(sk, psock)
if (refcount_dec_and_test())
sk_psock_drop(sk, psock)
spin_unlock(&stab->lock)
release_sock(sk)
Then close(map) enqueues bpf_map_free_deferred, which finally calls
sock_map_free(). This results in some refcount_t warnings along with
a KASAN splat [1].
Fix __sock_map_delete(), do not allow sock_map_unref() on elements that
may have been replaced.
[1]:
BUG: KASAN: slab-use-after-free in sock_map_free+0x10e/0x330
Write of size 4 at addr ffff88811f5b9100 by task kworker/u64:12/1063
CPU: 14 UID: 0 PID: 1063 Comm: kworker/u64:12 Not tainted 6.12.0+ #125
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
Workqueue: events_unbound bpf_map_free_deferred
Call Trace:
<TASK>
dump_stack_lvl+0x68/0x90
print_report+0x174/0x4f6
kasan_report+0xb9/0x190
kasan_check_range+0x10f/0x1e0
sock_map_free+0x10e/0x330
bpf_map_free_deferred+0x173/0x320
process_one_work+0x846/0x1420
worker_thread+0x5b3/0xf80
kthread+0x29e/0x360
ret_from_fork+0x2d/0x70
ret_from_fork_asm+0x1a/0x30
</TASK>
Allocated by task 1202:
kasan_save_stack+0x1e/0x40
kasan_save_track+0x10/0x30
__kasan_slab_alloc+0x85/0x90
kmem_cache_alloc_noprof+0x131/0x450
sk_prot_alloc+0x5b/0x220
sk_alloc+0x2c/0x870
unix_create1+0x88/0x8a0
unix_create+0xc5/0x180
__sock_create+0x241/0x650
__sys_socketpair+0x1ce/0x420
__x64_sys_socketpair+0x92/0x100
do_syscall_64+0x93/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Freed by task 46:
kasan_save_stack+0x1e/0x40
kasan_save_track+0x10/0x30
kasan_save_free_info+0x37/0x60
__kasan_slab_free+0x4b/0x70
kmem_cache_free+0x1a1/0x590
__sk_destruct+0x388/0x5a0
sk_psock_destroy+0x73e/0xa50
process_one_work+0x846/0x1420
worker_thread+0x5b3/0xf80
kthread+0x29e/0x360
ret_from_fork+0x2d/0x70
ret_from_fork_asm+0x1a/0x30
The buggy address belongs to the object at ffff88811f5b9080
which belongs to the cache UNIX-STREAM of size 1984
The buggy address is located 128 bytes inside of
freed 1984-byte region [ffff88811f5b9080, ffff88811f5b9840)
The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11f5b8
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
memcg:ffff888127d49401
flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
page_type: f5(slab)
raw: 0017ffffc0000040 ffff8881042e4500 dead000000000122 0000000000000000
raw: 0000000000000000 00000000800f000f 00000001f5000000 ffff888127d49401
head: 0017ffffc0000040 ffff8881042e4500 dead000000000122 0000000000000000
head: 0000000000000000 00000000800f000f 00000001f5000000 ffff888127d49401
head: 0017ffffc0000003 ffffea00047d6e01 ffffffffffffffff 0000000000000000
head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff88811f5b9000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff88811f5b9080: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88811f5b9180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88811f5b9200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Disabling lock debugging due to kernel taint
refcount_t: addition on 0; use-after-free.
WARNING: CPU: 14 PID: 1063 at lib/refcount.c:25 refcount_warn_saturate+0xce/0x150
CPU: 14 UID: 0 PID: 1063 Comm: kworker/u64:12 Tainted: G B 6.12.0+ #125
Tainted: [B]=BAD_PAGE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
Workqueue: events_unbound bpf_map_free_deferred
RIP: 0010:refcount_warn_saturate+0xce/0x150
Code: 34 73 eb 03 01 e8 82 53 ad fe 0f 0b eb b1 80 3d 27 73 eb 03 00 75 a8 48 c7 c7 80 bd 95 84 c6 05 17 73 eb 03 01 e8 62 53 ad fe <0f> 0b eb 91 80 3d 06 73 eb 03 00 75 88 48 c7 c7 e0 bd 95 84 c6 05
RSP: 0018:ffff88815c49fc70 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff88811f5b9100 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001
RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed10bcde6349
R10: ffff8885e6f31a4b R11: 0000000000000000 R12: ffff88813be0b000
R13: ffff88811f5b9100 R14: ffff88811f5b9080 R15: ffff88813be0b024
FS: 0000000000000000(0000) GS:ffff8885e6f00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055dda99b0250 CR3: 000000015dbac000 CR4: 0000000000752ef0
PKRU: 55555554
Call Trace:
<TASK>
? __warn.cold+0x5f/0x1ff
? refcount_warn_saturate+0xce/0x150
? report_bug+0x1ec/0x390
? handle_bug+0x58/0x90
? exc_invalid_op+0x13/0x40
? asm_exc_invalid_op+0x16/0x20
? refcount_warn_saturate+0xce/0x150
sock_map_free+0x2e5/0x330
bpf_map_free_deferred+0x173/0x320
process_one_work+0x846/0x1420
worker_thread+0x5b3/0xf80
kthread+0x29e/0x360
ret_from_fork+0x2d/0x70
ret_from_fork_asm+0x1a/0x30
</TASK>
irq event stamp: 10741
hardirqs last enabled at (10741): [<ffffffff84400ec6>] asm_sysvec_apic_timer_interrupt+0x16/0x20
hardirqs last disabled at (10740): [<ffffffff811e532d>] handle_softirqs+0x60d/0x770
softirqs last enabled at (10506): [<ffffffff811e55a9>] __irq_exit_rcu+0x109/0x210
softirqs last disabled at (10301): [<ffffffff811e55a9>] __irq_exit_rcu+0x109/0x210
refcount_t: underflow; use-after-free.
WARNING: CPU: 14 PID: 1063 at lib/refcount.c:28 refcount_warn_saturate+0xee/0x150
CPU: 14 UID: 0 PID: 1063 Comm: kworker/u64:12 Tainted: G B W 6.12.0+ #125
Tainted: [B]=BAD_PAGE, [W]=WARN
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
Workqueue: events_unbound bpf_map_free_deferred
RIP: 0010:refcount_warn_saturate+0xee/0x150
Code: 17 73 eb 03 01 e8 62 53 ad fe 0f 0b eb 91 80 3d 06 73 eb 03 00 75 88 48 c7 c7 e0 bd 95 84 c6 05 f6 72 eb 03 01 e8 42 53 ad fe <0f> 0b e9 6e ff ff ff 80 3d e6 72 eb 03 00 0f 85 61 ff ff ff 48 c7
RSP: 0018:ffff88815c49fc70 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff88811f5b9100 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001
RBP: 0000000000000003 R08: 0000000000000001 R09: ffffed10bcde6349
R10: ffff8885e6f31a4b R11: 0000000000000000 R12: ffff88813be0b000
R13: ffff88811f5b9100 R14: ffff88811f5b9080 R15: ffff88813be0b024
FS: 0000000000000000(0000) GS:ffff8885e6f00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055dda99b0250 CR3: 000000015dbac000 CR4: 0000000000752ef0
PKRU: 55555554
Call Trace:
<TASK>
? __warn.cold+0x5f/0x1ff
? refcount_warn_saturate+0xee/0x150
? report_bug+0x1ec/0x390
? handle_bug+0x58/0x90
? exc_invalid_op+0x13/0x40
? asm_exc_invalid_op+0x16/0x20
? refcount_warn_saturate+0xee/0x150
sock_map_free+0x2d3/0x330
bpf_map_free_deferred+0x173/0x320
process_one_work+0x846/0x1420
worker_thread+0x5b3/0xf80
kthread+0x29e/0x360
ret_from_fork+0x2d/0x70
ret_from_fork_asm+0x1a/0x30
</TASK>
irq event stamp: 10741
hardirqs last enabled at (10741): [<ffffffff84400ec6>] asm_sysvec_apic_timer_interrupt+0x16/0x20
hardirqs last disabled at (10740): [<ffffffff811e532d>] handle_softirqs+0x60d/0x770
softirqs last enabled at (10506): [<ffffffff811e55a9>] __irq_exit_rcu+0x109/0x210
softirqs last disabled at (10301): [<ffffffff811e55a9>] __irq_exit_rcu+0x109/0x210
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241202-sockmap-replace-v1-3-1e88579e7bd5@rbox.co
|
|
Consider a sockmap entry being updated with the same socket:
osk = stab->sks[idx];
sock_map_add_link(psock, link, map, &stab->sks[idx]);
stab->sks[idx] = sk;
if (osk)
sock_map_unref(osk, &stab->sks[idx]);
Due to sock_map_unref(), which invokes sock_map_del_link(), all the
psock's links for stab->sks[idx] are torn:
list_for_each_entry_safe(link, tmp, &psock->link, list) {
if (link->link_raw == link_raw) {
...
list_del(&link->list);
sk_psock_free_link(link);
}
}
And that includes the new link sock_map_add_link() added just before
the unref.
This results in a sockmap holding a socket, but without the respective
link. This in turn means that close(sock) won't trigger the cleanup,
i.e. a closed socket will not be automatically removed from the sockmap.
Stop tearing the links when a matching link_raw is found.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20241202-sockmap-replace-v1-1-1e88579e7bd5@rbox.co
|
|
After the blamed commit below, udp_rehash() is supposed to be called
with both local and remote addresses set.
Currently that is already the case for IPv6 sockets, but for IPv4 the
destination address is updated after rehashing.
Address the issue moving the destination address and port initialization
before rehashing.
Fixes: 1b29a730ef8b ("ipv6/udp: Add 4-tuple hash for connected socket")
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/4761e466ab9f7542c68cdc95f248987d127044d2.1733499715.git.pabeni@redhat.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
mctp_dump_addrinfo() is one of the last users of
net->dev_index_head[] in the control path.
Switch to for_each_netdev_dump() for better scalability.
Use C99 for mctp_device_rtnl_msg_handlers[] to prepare
future RTNL removal from mctp_dump_addrinfo()
(mdev->addrs is not yet RCU protected)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Matt Johnston <matt@codeconstruct.com.au>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Jeremy Kerr <jk@codeconstruct.com.au>
Link: https://patch.msgid.link/20241206223811.1343076-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When an rxrpc call is in its transmission phase and is sending a lot of
packets, stalls occasionally occur that cause severe performance
degradation (eg. increasing the transmission time for a 256MiB payload from
0.7s to 2.5s over a 10G link).
rxrpc already implements TCP-style congestion control [RFC5681] and this
helps mitigate the effects, but occasionally we're missing a time event
that deals with a missing ACK, leading to a stall until the RTO expires.
Fix this by implementing RACK/TLP in rxrpc.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
rxrpc_prepare_data_subpacket() sets the REQUEST-ACK flag on the outgoing
DATA packet under a number of circumstances, including, theoretically, when
the cwnd is at minimum (or less). However, the minimum in this function is
hard-coded as 2, but the actual minimum is RXRPC_MIN_CWND (which is
currently 4) and so this never occurs.
Without this, we will miss the request of some ACKs, potentially leading to
a transmission stall until a timeout occurs on one side or the other that
leads to an ACK being generated.
Fix the function to use RXRPC_MIN_CWND rather than a hard-coded number.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Manage the determination of RTT on a per-call (ie. per-RPC op) basis rather
than on a per-peer basis, averaging across all calls going to that peer.
The problem is that the RTT measurements from the initial packets on a call
may be off because the server may do some setting up (such as getting a
lock on a file) before accepting the rest of the data in the RPC and,
further, the RTT may be affected by server-side file operations, for
instance if a large amount of data is being written or read.
Note: When handling the FS.StoreData-type RPCs, for example, the server
uses the userStatus field in the header of ACK packets as supplementary
flow control to aid in managing this. AF_RXRPC does not yet support this,
but it should be added.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Record the reason for the transmission of an ACK in the rxrpc_tx_ack
tracepoint, and not just in the rxrpc_propose_ack tracepoint.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Add an indicator to the rxrpc_tx_data tracepoint to indicate what triggered
the transmission of a particular packet. At this point, it's only normal
transmission and retransmission, plus the tracepoint is also used to record
loss injection, but in a future patch, TLP-induced (re-)transmission will
also be a thing.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Tidy up the ACK parsing in the following ways:
(1) Put the serial number of the ACK packet into the rxrpc_ack_summary
struct and access it from there whilst parsing an ACK.
(2) Be consistent about using "if (summary.acked_serial)" rather than "if
(summary.acked_serial != 0)".
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Where a spinlock is used by both the application thread and the I/O thread,
use irq-disabling locking so that an interrupt taken on the app thread
doesn't also slow down the I/O thread.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Don't allocate an rxrpc_txbuf struct for an ACK transmission. There's now
no need as the memory to hold the ACK content is allocated with a page frag
allocator. The allocation and freeing of a txbuf is just unnecessary
overhead.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|