summaryrefslogtreecommitdiff
path: root/net/sched
AgeCommit message (Collapse)Author
2024-04-17net/sched: Fix mirred deadlock on device recursionEric Dumazet
When the mirred action is used on a classful egress qdisc and a packet is mirrored or redirected to self we hit a qdisc lock deadlock. See trace below. [..... other info removed for brevity....] [ 82.890906] [ 82.890906] ============================================ [ 82.890906] WARNING: possible recursive locking detected [ 82.890906] 6.8.0-05205-g77fadd89fe2d-dirty #213 Tainted: G W [ 82.890906] -------------------------------------------- [ 82.890906] ping/418 is trying to acquire lock: [ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at: __dev_queue_xmit+0x1778/0x3550 [ 82.890906] [ 82.890906] but task is already holding lock: [ 82.890906] ffff888006994110 (&sch->q.lock){+.-.}-{3:3}, at: __dev_queue_xmit+0x1778/0x3550 [ 82.890906] [ 82.890906] other info that might help us debug this: [ 82.890906] Possible unsafe locking scenario: [ 82.890906] [ 82.890906] CPU0 [ 82.890906] ---- [ 82.890906] lock(&sch->q.lock); [ 82.890906] lock(&sch->q.lock); [ 82.890906] [ 82.890906] *** DEADLOCK *** [ 82.890906] [..... other info removed for brevity....] Example setup (eth0->eth0) to recreate tc qdisc add dev eth0 root handle 1: htb default 30 tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \ action mirred egress redirect dev eth0 Another example(eth0->eth1->eth0) to recreate tc qdisc add dev eth0 root handle 1: htb default 30 tc filter add dev eth0 handle 1: protocol ip prio 2 matchall \ action mirred egress redirect dev eth1 tc qdisc add dev eth1 root handle 1: htb default 30 tc filter add dev eth1 handle 1: protocol ip prio 2 matchall \ action mirred egress redirect dev eth0 We fix this by adding an owner field (CPU id) to struct Qdisc set after root qdisc is entered. When the softirq enters it a second time, if the qdisc owner is the same CPU, the packet is dropped to break the loop. Reported-by: Mingshuai Ren <renmingshuai@huawei.com> Closes: https://lore.kernel.org/netdev/20240314111713.5979-1-renmingshuai@huawei.com/ Fixes: 3bcb846ca4cf ("net: get rid of spin_trylock() in net_tx_action()") Fixes: e578d9c02587 ("net: sched: use counter to break reclassify loops") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20240415210728.36949-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-04net/sched: act_skbmod: prevent kernel-infoleakEric Dumazet
syzbot found that tcf_skbmod_dump() was copying four bytes from kernel stack to user space [1]. The issue here is that 'struct tc_skbmod' has a four bytes hole. We need to clear the structure before filling fields. [1] BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:114 [inline] BUG: KMSAN: kernel-infoleak in copy_to_user_iter lib/iov_iter.c:24 [inline] BUG: KMSAN: kernel-infoleak in iterate_ubuf include/linux/iov_iter.h:29 [inline] BUG: KMSAN: kernel-infoleak in iterate_and_advance2 include/linux/iov_iter.h:245 [inline] BUG: KMSAN: kernel-infoleak in iterate_and_advance include/linux/iov_iter.h:271 [inline] BUG: KMSAN: kernel-infoleak in _copy_to_iter+0x366/0x2520 lib/iov_iter.c:185 instrument_copy_to_user include/linux/instrumented.h:114 [inline] copy_to_user_iter lib/iov_iter.c:24 [inline] iterate_ubuf include/linux/iov_iter.h:29 [inline] iterate_and_advance2 include/linux/iov_iter.h:245 [inline] iterate_and_advance include/linux/iov_iter.h:271 [inline] _copy_to_iter+0x366/0x2520 lib/iov_iter.c:185 copy_to_iter include/linux/uio.h:196 [inline] simple_copy_to_iter net/core/datagram.c:532 [inline] __skb_datagram_iter+0x185/0x1000 net/core/datagram.c:420 skb_copy_datagram_iter+0x5c/0x200 net/core/datagram.c:546 skb_copy_datagram_msg include/linux/skbuff.h:4050 [inline] netlink_recvmsg+0x432/0x1610 net/netlink/af_netlink.c:1962 sock_recvmsg_nosec net/socket.c:1046 [inline] sock_recvmsg+0x2c4/0x340 net/socket.c:1068 __sys_recvfrom+0x35a/0x5f0 net/socket.c:2242 __do_sys_recvfrom net/socket.c:2260 [inline] __se_sys_recvfrom net/socket.c:2256 [inline] __x64_sys_recvfrom+0x126/0x1d0 net/socket.c:2256 do_syscall_64+0xd5/0x1f0 entry_SYSCALL_64_after_hwframe+0x6d/0x75 Uninit was stored to memory at: pskb_expand_head+0x30f/0x19d0 net/core/skbuff.c:2253 netlink_trim+0x2c2/0x330 net/netlink/af_netlink.c:1317 netlink_unicast+0x9f/0x1260 net/netlink/af_netlink.c:1351 nlmsg_unicast include/net/netlink.h:1144 [inline] nlmsg_notify+0x21d/0x2f0 net/netlink/af_netlink.c:2610 rtnetlink_send+0x73/0x90 net/core/rtnetlink.c:741 rtnetlink_maybe_send include/linux/rtnetlink.h:17 [inline] tcf_add_notify net/sched/act_api.c:2048 [inline] tcf_action_add net/sched/act_api.c:2071 [inline] tc_ctl_action+0x146e/0x19d0 net/sched/act_api.c:2119 rtnetlink_rcv_msg+0x1737/0x1900 net/core/rtnetlink.c:6595 netlink_rcv_skb+0x375/0x650 net/netlink/af_netlink.c:2559 rtnetlink_rcv+0x34/0x40 net/core/rtnetlink.c:6613 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0xf4c/0x1260 net/netlink/af_netlink.c:1361 netlink_sendmsg+0x10df/0x11f0 net/netlink/af_netlink.c:1905 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x30f/0x380 net/socket.c:745 ____sys_sendmsg+0x877/0xb60 net/socket.c:2584 ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2638 __sys_sendmsg net/socket.c:2667 [inline] __do_sys_sendmsg net/socket.c:2676 [inline] __se_sys_sendmsg net/socket.c:2674 [inline] __x64_sys_sendmsg+0x307/0x4a0 net/socket.c:2674 do_syscall_64+0xd5/0x1f0 entry_SYSCALL_64_after_hwframe+0x6d/0x75 Uninit was stored to memory at: __nla_put lib/nlattr.c:1041 [inline] nla_put+0x1c6/0x230 lib/nlattr.c:1099 tcf_skbmod_dump+0x23f/0xc20 net/sched/act_skbmod.c:256 tcf_action_dump_old net/sched/act_api.c:1191 [inline] tcf_action_dump_1+0x85e/0x970 net/sched/act_api.c:1227 tcf_action_dump+0x1fd/0x460 net/sched/act_api.c:1251 tca_get_fill+0x519/0x7a0 net/sched/act_api.c:1628 tcf_add_notify_msg net/sched/act_api.c:2023 [inline] tcf_add_notify net/sched/act_api.c:2042 [inline] tcf_action_add net/sched/act_api.c:2071 [inline] tc_ctl_action+0x1365/0x19d0 net/sched/act_api.c:2119 rtnetlink_rcv_msg+0x1737/0x1900 net/core/rtnetlink.c:6595 netlink_rcv_skb+0x375/0x650 net/netlink/af_netlink.c:2559 rtnetlink_rcv+0x34/0x40 net/core/rtnetlink.c:6613 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0xf4c/0x1260 net/netlink/af_netlink.c:1361 netlink_sendmsg+0x10df/0x11f0 net/netlink/af_netlink.c:1905 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x30f/0x380 net/socket.c:745 ____sys_sendmsg+0x877/0xb60 net/socket.c:2584 ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2638 __sys_sendmsg net/socket.c:2667 [inline] __do_sys_sendmsg net/socket.c:2676 [inline] __se_sys_sendmsg net/socket.c:2674 [inline] __x64_sys_sendmsg+0x307/0x4a0 net/socket.c:2674 do_syscall_64+0xd5/0x1f0 entry_SYSCALL_64_after_hwframe+0x6d/0x75 Local variable opt created at: tcf_skbmod_dump+0x9d/0xc20 net/sched/act_skbmod.c:244 tcf_action_dump_old net/sched/act_api.c:1191 [inline] tcf_action_dump_1+0x85e/0x970 net/sched/act_api.c:1227 Bytes 188-191 of 248 are uninitialized Memory access of size 248 starts at ffff888117697680 Data copied to user address 00007ffe56d855f0 Fixes: 86da71b57383 ("net_sched: Introduce skbmod action") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20240403130908.93421-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-03net/sched: fix lockdep splat in qdisc_tree_reduce_backlog()Eric Dumazet
qdisc_tree_reduce_backlog() is called with the qdisc lock held, not RTNL. We must use qdisc_lookup_rcu() instead of qdisc_lookup() syzbot reported: WARNING: suspicious RCU usage 6.1.74-syzkaller #0 Not tainted ----------------------------- net/sched/sch_api.c:305 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by udevd/1142: #0: ffffffff87c729a0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:306 [inline] #0: ffffffff87c729a0 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:747 [inline] #0: ffffffff87c729a0 (rcu_read_lock){....}-{1:2}, at: net_tx_action+0x64a/0x970 net/core/dev.c:5282 #1: ffff888171861108 (&sch->q.lock){+.-.}-{2:2}, at: spin_lock include/linux/spinlock.h:350 [inline] #1: ffff888171861108 (&sch->q.lock){+.-.}-{2:2}, at: net_tx_action+0x754/0x970 net/core/dev.c:5297 #2: ffffffff87c729a0 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:306 [inline] #2: ffffffff87c729a0 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:747 [inline] #2: ffffffff87c729a0 (rcu_read_lock){....}-{1:2}, at: qdisc_tree_reduce_backlog+0x84/0x580 net/sched/sch_api.c:792 stack backtrace: CPU: 1 PID: 1142 Comm: udevd Not tainted 6.1.74-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 Call Trace: <TASK> [<ffffffff85b85f14>] __dump_stack lib/dump_stack.c:88 [inline] [<ffffffff85b85f14>] dump_stack_lvl+0x1b1/0x28f lib/dump_stack.c:106 [<ffffffff85b86007>] dump_stack+0x15/0x1e lib/dump_stack.c:113 [<ffffffff81802299>] lockdep_rcu_suspicious+0x1b9/0x260 kernel/locking/lockdep.c:6592 [<ffffffff84f0054c>] qdisc_lookup+0xac/0x6f0 net/sched/sch_api.c:305 [<ffffffff84f037c3>] qdisc_tree_reduce_backlog+0x243/0x580 net/sched/sch_api.c:811 [<ffffffff84f5b78c>] pfifo_tail_enqueue+0x32c/0x4b0 net/sched/sch_fifo.c:51 [<ffffffff84fbcf63>] qdisc_enqueue include/net/sch_generic.h:833 [inline] [<ffffffff84fbcf63>] netem_dequeue+0xeb3/0x15d0 net/sched/sch_netem.c:723 [<ffffffff84eecab9>] dequeue_skb net/sched/sch_generic.c:292 [inline] [<ffffffff84eecab9>] qdisc_restart net/sched/sch_generic.c:397 [inline] [<ffffffff84eecab9>] __qdisc_run+0x249/0x1e60 net/sched/sch_generic.c:415 [<ffffffff84d7aa96>] qdisc_run+0xd6/0x260 include/net/pkt_sched.h:125 [<ffffffff84d85d29>] net_tx_action+0x7c9/0x970 net/core/dev.c:5313 [<ffffffff85e002bd>] __do_softirq+0x2bd/0x9bd kernel/softirq.c:616 [<ffffffff81568bca>] invoke_softirq kernel/softirq.c:447 [inline] [<ffffffff81568bca>] __irq_exit_rcu+0xca/0x230 kernel/softirq.c:700 [<ffffffff81568ae9>] irq_exit_rcu+0x9/0x20 kernel/softirq.c:712 [<ffffffff85b89f52>] sysvec_apic_timer_interrupt+0x42/0x90 arch/x86/kernel/apic/apic.c:1107 [<ffffffff85c00ccb>] asm_sysvec_apic_timer_interrupt+0x1b/0x20 arch/x86/include/asm/idtentry.h:656 Fixes: d636fc5dd692 ("net: sched: add rcu annotations around qdisc->qdisc_sleeping") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20240402134133.2352776-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-19net/sched: Add module alias for sch_fq_pieMichal Koutný
The commit 2c15a5aee2f3 ("net/sched: Load modules via their alias") starts loading modules via aliases and not canonical names. The new aliases were added in commit 241a94abcf46 ("net/sched: Add module aliases for cls_,sch_,act_ modules") via a Coccinele script. sch_fq_pie.c is missing module.h header and thus Coccinele did not patch it. Add the include and module alias manually, so that autoloading works for sch_fq_pie too. (Note: commit message in commit 241a94abcf46 ("net/sched: Add module aliases for cls_,sch_,act_ modules") was mangled due to '#' misinterpretation. The predicate haskernel is: | @ haskernel @ | @@ | | #include <linux/module.h> | .) Fixes: 241a94abcf46 ("net/sched: Add module aliases for cls_,sch_,act_ modules") Signed-off-by: Michal Koutný <mkoutny@suse.com> Link: https://lore.kernel.org/r/20240315160210.8379-1-mkoutny@suse.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-13net/sched: taprio: proper TCA_TAPRIO_TC_ENTRY_INDEX checkEric Dumazet
taprio_parse_tc_entry() is not correctly checking TCA_TAPRIO_TC_ENTRY_INDEX attribute: int tc; // Signed value tc = nla_get_u32(tb[TCA_TAPRIO_TC_ENTRY_INDEX]); if (tc >= TC_QOPT_MAX_QUEUE) { NL_SET_ERR_MSG_MOD(extack, "TC entry index out of range"); return -ERANGE; } syzbot reported that it could fed arbitary negative values: UBSAN: shift-out-of-bounds in net/sched/sch_taprio.c:1722:18 shift exponent -2147418108 is negative CPU: 0 PID: 5066 Comm: syz-executor367 Not tainted 6.8.0-rc7-syzkaller-00136-gc8a5c731fd12 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106 ubsan_epilogue lib/ubsan.c:217 [inline] __ubsan_handle_shift_out_of_bounds+0x3c7/0x420 lib/ubsan.c:386 taprio_parse_tc_entry net/sched/sch_taprio.c:1722 [inline] taprio_parse_tc_entries net/sched/sch_taprio.c:1768 [inline] taprio_change+0xb87/0x57d0 net/sched/sch_taprio.c:1877 taprio_init+0x9da/0xc80 net/sched/sch_taprio.c:2134 qdisc_create+0x9d4/0x1190 net/sched/sch_api.c:1355 tc_modify_qdisc+0xa26/0x1e40 net/sched/sch_api.c:1776 rtnetlink_rcv_msg+0x885/0x1040 net/core/rtnetlink.c:6617 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2543 netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1367 netlink_sendmsg+0xa3b/0xd70 net/netlink/af_netlink.c:1908 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2584 ___sys_sendmsg net/socket.c:2638 [inline] __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7f1b2dea3759 Code: 48 83 c4 28 c3 e8 d7 19 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd4de452f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f1b2def0390 RCX: 00007f1b2dea3759 RDX: 0000000000000000 RSI: 00000000200007c0 RDI: 0000000000000004 RBP: 0000000000000003 R08: 0000555500000000 R09: 0000555500000000 R10: 0000555500000000 R11: 0000000000000246 R12: 00007ffd4de45340 R13: 00007ffd4de45310 R14: 0000000000000001 R15: 00007ffd4de45340 Fixes: a54fc09e4cba ("net/sched: taprio: allow user input of per-tc max SDU") Reported-and-tested-by: syzbot+a340daa06412d6028918@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-12Merge tag 'net-next-6.9' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Large effort by Eric to lower rtnl_lock pressure and remove locks: - Make commonly used parts of rtnetlink (address, route dumps etc) lockless, protected by RCU instead of rtnl_lock. - Add a netns exit callback which already holds rtnl_lock, allowing netns exit to take rtnl_lock once in the core instead of once for each driver / callback. - Remove locks / serialization in the socket diag interface. - Remove 6 calls to synchronize_rcu() while holding rtnl_lock. - Remove the dev_base_lock, depend on RCU where necessary. - Support busy polling on a per-epoll context basis. Poll length and budget parameters can be set independently of system defaults. - Introduce struct net_hotdata, to make sure read-mostly global config variables fit in as few cache lines as possible. - Add optional per-nexthop statistics to ease monitoring / debug of ECMP imbalance problems. - Support TCP_NOTSENT_LOWAT in MPTCP. - Ensure that IPv6 temporary addresses' preferred lifetimes are long enough, compared to other configured lifetimes, and at least 2 sec. - Support forwarding of ICMP Error messages in IPSec, per RFC 4301. - Add support for the independent control state machine for bonding per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled control state machine. - Add "network ID" to MCTP socket APIs to support hosts with multiple disjoint MCTP networks. - Re-use the mono_delivery_time skbuff bit for packets which user space wants to be sent at a specified time. Maintain the timing information while traversing veth links, bridge etc. - Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets. - Simplify many places iterating over netdevs by using an xarray instead of a hash table walk (hash table remains in place, for use on fastpaths). - Speed up scanning for expired routes by keeping a dedicated list. - Speed up "generic" XDP by trying harder to avoid large allocations. - Support attaching arbitrary metadata to netconsole messages. Things we sprinkled into general kernel code: - Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages (used by bpf_arena). - Rework selftest harness to enable the use of the full range of ksft exit code (pass, fail, skip, xfail, xpass). Netfilter: - Allow userspace to define a table that is exclusively owned by a daemon (via netlink socket aliveness) without auto-removing this table when the userspace program exits. Such table gets marked as orphaned and a restarting management daemon can re-attach/regain ownership. - Speed up element insertions to nftables' concatenated-ranges set type. Compact a few related data structures. BPF: - Add BPF token support for delegating a subset of BPF subsystem functionality from privileged system-wide daemons such as systemd through special mount options for userns-bound BPF fs to a trusted & unprivileged application. - Introduce bpf_arena which is sparse shared memory region between BPF program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and BPF programs. - Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it. - Extend the BPF verifier to enable static subprog calls in spin lock critical sections. - Support registration of struct_ops types from modules which helps projects like fuse-bpf that seeks to implement a new struct_ops type. - Add support for retrieval of cookies for perf/kprobe multi links. - Support arbitrary TCP SYN cookie generation / validation in the TC layer with BPF to allow creating SYN flood handling in BPF firewalls. - Add code generation to inline the bpf_kptr_xchg() helper which improves performance when stashing/popping the allocated BPF objects. Wireless: - Add SPP (signaling and payload protected) AMSDU support. - Support wider bandwidth OFDMA, as required for EHT operation. Driver API: - Major overhaul of the Energy Efficient Ethernet internals to support new link modes (2.5GE, 5GE), share more code between drivers (especially those using phylib), and encourage more uniform behavior. Convert and clean up drivers. - Define an API for querying per netdev queue statistics from drivers. - IPSec: account in global stats for fully offloaded sessions. - Create a concept of Ethernet PHY Packages at the Device Tree level, to allow parameterizing the existing PHY package code. - Enable Rx hashing (RSS) on GTP protocol fields. Misc: - Improvements and refactoring all over networking selftests. - Create uniform module aliases for TC classifiers, actions, and packet schedulers to simplify creating modprobe policies. - Address all missing MODULE_DESCRIPTION() warnings in networking. - Extend the Netlink descriptions in YAML to cover message encapsulation or "Netlink polymorphism", where interpretation of nested attributes depends on link type, classifier type or some other "class type". Drivers: - Ethernet high-speed NICs: - Add a new driver for Marvell's Octeon PCI Endpoint NIC VF. - Intel (100G, ice, idpf): - support E825-C devices - nVidia/Mellanox: - support devices with one port and multiple PCIe links - Broadcom (bnxt): - support n-tuple filters - support configuring the RSS key - Wangxun (ngbe/txgbe): - implement irq_domain for TXGBE's sub-interrupts - Pensando/AMD: - support XDP - optimize queue submission and wakeup handling (+17% bps) - optimize struct layout, saving 28% of memory on queues - Ethernet NICs embedded and virtual: - Google cloud vNIC: - refactor driver to perform memory allocations for new queue config before stopping and freeing the old queue memory - Synopsys (stmmac): - obey queueMaxSDU and implement counters required by 802.1Qbv - Renesas (ravb): - support packet checksum offload - suspend to RAM and runtime PM support - Ethernet switches: - nVidia/Mellanox: - support for nexthop group statistics - Microchip: - ksz8: implement PHY loopback - add support for KSZ8567, a 7-port 10/100Mbps switch - PTP: - New driver for RENESAS FemtoClock3 Wireless clock generator. - Support OCP PTP cards designed and built by Adva. - CAN: - Support recvmsg() flags for own, local and remote traffic on CAN BCM sockets. - Support for esd GmbH PCIe/402 CAN device family. - m_can: - Rx/Tx submission coalescing - wake on frame Rx - WiFi: - Intel (iwlwifi): - enable signaling and payload protected A-MSDUs - support wider-bandwidth OFDMA - support for new devices - bump FW API to 89 for AX devices; 90 for BZ/SC devices - MediaTek (mt76): - mt7915: newer ADIE version support - mt7925: radio temperature sensor support - Qualcomm (ath11k): - support 6 GHz station power modes: Low Power Indoor (LPI), Standard Power) SP and Very Low Power (VLP) - QCA6390 & WCN6855: support 2 concurrent station interfaces - QCA2066 support - Qualcomm (ath12k): - refactoring in preparation for Multi-Link Operation (MLO) support - 1024 Block Ack window size support - firmware-2.bin support - support having multiple identical PCI devices (firmware needs to have ATH12K_FW_FEATURE_MULTI_QRTR_ID) - QCN9274: support split-PHY devices - WCN7850: enable Power Save Mode in station mode - WCN7850: P2P support - RealTek: - rtw88: support for more rtw8811cu and rtw8821cu devices - rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL - rtlwifi: speed up USB firmware initialization - rtwl8xxxu: - RTL8188F: concurrent interface support - Channel Switch Announcement (CSA) support in AP mode - Broadcom (brcmfmac): - per-vendor feature support - per-vendor SAE password setup - DMI nvram filename quirk for ACEPC W5 Pro" * tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits) nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y nexthop: Fix out-of-bounds access during attribute validation nexthop: Only parse NHA_OP_FLAGS for dump messages that require it nexthop: Only parse NHA_OP_FLAGS for get messages that require it bpf: move sleepable flag from bpf_prog_aux to bpf_prog bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes() selftests/bpf: Add kprobe multi triggering benchmarks ptp: Move from simple ida to xarray vxlan: Remove generic .ndo_get_stats64 vxlan: Do not alloc tstats manually devlink: Add comments to use netlink gen tool nfp: flower: handle acti_netdevs allocation failure net/packet: Add getsockopt support for PACKET_COPY_THRESH net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID selftests/bpf: Add bpf_arena_htab test. selftests/bpf: Add bpf_arena_list test. selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages bpf: Add helper macro bpf_addr_space_cast() libbpf: Recognize __arena global variables. bpftool: Recognize arena map type ...
2024-03-11Merge tag 'x86-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: - The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code - Propagate more RIP-relative addressing in assembly code, to generate slightly better code - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options - Rework the x86 idle code to cure RCU violations and to clean up the logic - Clean up the vDSO Makefile logic - Misc cleanups and fixes * tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/idle: Select idle routine only once x86/idle: Let prefer_mwait_c1_over_halt() return bool x86/idle: Cleanup idle_setup() x86/idle: Clean up idle selection x86/idle: Sanitize X86_BUG_AMD_E400 handling sched/idle: Conditionally handle tick broadcast in default_idle_call() x86: Increase brk randomness entropy for 64-bit systems x86/vdso: Move vDSO to mmap region x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o x86/retpoline: Ensure default return thunk isn't used at runtime x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 x86/vdso: Use $(addprefix ) instead of $(foreach ) x86/vdso: Simplify obj-y addition x86/vdso: Consolidate targets and clean-files x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS ...
2024-03-07net: move dev_tx_weight to net_hotdataEric Dumazet
dev_tx_weight is used in tx fast path. Move it to net_hotdata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-9-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: net/ipv4/udp.c f796feabb9f5 ("udp: add local "peek offset enabled" flag") 56667da7399e ("net: implement lockless setsockopt(SO_PEEK_OFF)") Adjacent changes: net/unix/garbage.c aa82ac51d633 ("af_unix: Drop oob_skb ref before purging queue in GC.") 11498715f266 ("af_unix: Remove io_uring code for GC.") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-21net/sched: flower: Add lock protection when remove filter handleJianbo Liu
As IDR can't protect itself from the concurrent modification, place idr_remove() under the protection of tp->lock. Fixes: 08a0063df3ae ("net/sched: flower: Move filter handle initialization earlier") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20240220085928.9161-1-jianbol@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-19net: sched: Annotate struct tc_pedit with __counted_byKees Cook
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct tc_pedit. Additionally, since the element count member must be set before accessing the annotated flexible array member, move its initialization earlier. Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1] Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16net/sched: act_mirred: don't override retval if we already lost the skbJakub Kicinski
If we're redirecting the skb, and haven't called tcf_mirred_forward(), yet, we need to tell the core to drop the skb by setting the retcode to SHOT. If we have called tcf_mirred_forward(), however, the skb is out of our hands and returning SHOT will lead to UaF. Move the retval override to the error path which actually need it. Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Fixes: e5cf1baf92cb ("act_mirred: use TC_ACT_REINSERT when possible") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-16net/sched: act_mirred: use the backlog for mirred ingressJakub Kicinski
The test Davide added in commit ca22da2fbd69 ("act_mirred: use the backlog for nested calls to mirred ingress") hangs our testing VMs every 10 or so runs, with the familiar tcp_v4_rcv -> tcp_v4_rcv deadlock reported by lockdep. The problem as previously described by Davide (see Link) is that if we reverse flow of traffic with the redirect (egress -> ingress) we may reach the same socket which generated the packet. And we may still be holding its socket lock. The common solution to such deadlocks is to put the packet in the Rx backlog, rather than run the Rx path inline. Do that for all egress -> ingress reversals, not just once we started to nest mirred calls. In the past there was a concern that the backlog indirection will lead to loss of error reporting / less accurate stats. But the current workaround does not seem to address the issue. Fixes: 53592b364001 ("net/sched: act_mirred: Implement ingress actions") Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Suggested-by: Davide Caratti <dcaratti@redhat.com> Link: https://lore.kernel.org/netdev/33dc43f587ec1388ba456b4915c75f02a8aae226.1663945716.git.dcaratti@redhat.com/ Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/dev.c 9f30831390ed ("net: add rcu safety to rtnl_prop_list_size()") 723de3ebef03 ("net: free altname using an RCU callback") net/unix/garbage.c 11498715f266 ("af_unix: Remove io_uring code for GC.") 25236c91b5ab ("af_unix: Fix task hung while purging oob_skb in GC.") drivers/net/ethernet/renesas/ravb_main.c ed4adc07207d ("net: ravb: Count packets instead of descriptors in GbEth RX path" ) c2da9408579d ("ravb: Add Rx checksum offload support for GbEth") net/mptcp/protocol.c bdd70eb68913 ("mptcp: drop the push_pending field") 28e5c1380506 ("mptcp: annotate lockless accesses around read-mostly fields") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-14Merge branch 'x86/bugs' into x86/core, to pick up pending changes before ↵Ingo Molnar
dependent patches Merge in pending alternatives patching infrastructure changes, before applying more patches. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-02-13net: sched: codel replace GPLv2/BSD boilerplateStephen Hemminger
The prologue to codel is using BSD-3 clause and GPL-2 boiler plate language. Replace it by using SPDX. The automated treewide scan in commit d2912cb15bdd ("treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500") did not pickup dual licensed code. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Dave Taht <dave.taht@gmail.com> Link: https://lore.kernel.org/r/20240211172532.6568-1-stephen@networkplumber.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-13net: sched: Remove NET_ACT_IPT from KconfigHarshit Mogalapalli
After this commit ba24ea129126 ("net/sched: Retire ipt action") NET_ACT_IPT is not needed anymore as the action is retired and the code is removed. Clean the Kconfig part as well. Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20240209180656.867546-1-harshit.m.mogalapalli@oracle.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-02-09net: fill in MODULE_DESCRIPTION()s for net/schedBreno Leitao
W=1 builds now warn if module is built without a MODULE_DESCRIPTION(). Add descriptions to the network schedulers. Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20240208164244.3818498-8-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-09net/sched: act_mirred: Don't zero blockid when net device is being deletedVictor Nogueira
While testing tdc with parallel tests for mirred to block we caught an intermittent bug. The blockid was being zeroed out when a net device was deleted and, thus, giving us an incorrect blockid value whenever we tried to dump the mirred action. Since we don't increment the block refcount in the control path (and only use the ID), we don't need to zero the blockid field whenever a net device is going down. Fixes: 42f39036cda8 ("net/sched: act_mirred: Allow mirred to block") Signed-off-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20240207222902.1469398-1-victor@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-02net/sched: Remove alias of sch_clsactMichal Koutný
The module sch_ingress stands out among net/sched modules because it provides multiple act/sch functionalities in a single .ko. They have aliases to make autoloading work for any of the provided functionalities. Since the autoloading was changed to uniformly request any functionality under its alias, the non-systemic aliases can be removed now (i.e. assuming the alias were only used to ensure autoloading). Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240201130943.19536-5-mkoutny@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-02net/sched: Load modules via their aliasMichal Koutný
The cls_,sch_,act_ modules may be loaded lazily during network configuration but without user's awareness and control. Switch the lazy loading from canonical module names to a module alias. This allows finer control over lazy loading, the precedent from commit 7f78e0351394 ("fs: Limit sys_mount to only request filesystem modules.") explains it already: Using aliases means user space can control the policy of which filesystem^W net/sched modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. By default, nothing changes. However, if a specific module is blacklisted (its canonical name), it won't be modprobe'd when requested under its alias (i.e. kernel auto-loading). It would appear as if the given module was unknown. The module can still be loaded under its canonical name, which is an explicit (privileged) user action. Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240201130943.19536-4-mkoutny@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-02net/sched: Add module aliases for cls_,sch_,act_ modulesMichal Koutný
No functional change intended, aliases will be used in followup commits. Note for backporters: you may need to add aliases also for modules that are already removed in mainline kernel but still in your version. Patches were generated with the help of Coccinelle scripts like: cat >scripts/coccinelle/misc/tcf_alias.cocci <<EOD virtual patch virtual report @ haskernel @ @@ @ tcf_has_kind depends on report && haskernel @ identifier ops; constant K; @@ static struct tcf_proto_ops ops = { .kind = K, ... }; +char module_alias = K; EOD /usr/bin/spatch -D report --cocci-file scripts/coccinelle/misc/tcf_alias.cocci \ --dir . \ -I ./arch/x86/include -I ./arch/x86/include/generated -I ./include \ -I ./arch/x86/include/uapi -I ./arch/x86/include/generated/uapi \ -I ./include/uapi -I ./include/generated/uapi \ --include ./include/linux/compiler-version.h --include ./include/linux/kconfig.h \ --jobs 8 --chunksize 1 2>/dev/null | \ sed 's/char module_alias = "\([^"]*\)";/MODULE_ALIAS_NET_CLS("\1");/' And analogously for: static struct tc_action_ops ops = { .kind = K, static struct Qdisc_ops ops = { .id = K, (Someone familiar would be able to fit those into one .cocci file without sed post processing.) Signed-off-by: Michal Koutný <mkoutny@suse.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240201130943.19536-3-mkoutny@suse.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-29taprio: validate TCA_TAPRIO_ATTR_FLAGS through policy instead of open-codingAlessandro Marcolini
As of now, the field TCA_TAPRIO_ATTR_FLAGS is being validated by manually checking its value, using the function taprio_flags_valid(). With this patch, the field will be validated through the netlink policy NLA_POLICY_MASK, where the mask is defined by TAPRIO_SUPPORTED_FLAGS. The mutual exclusivity of the two flags TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD and TCA_TAPRIO_ATTR_FLAG_TXTIME_ASSIST is still checked manually. Changes since RFC: - fixed reversed xmas tree - use NL_SET_ERR_MSG_MOD() for both invalid configuration Changes since v1: - Changed NL_SET_ERR_MSG_MOD to NL_SET_ERR_MSG_ATTR when wrong flags issued - Changed __u32 to u32 Changes since v2: - Added the missing parameter for NL_SET_ERR_MSG_ATTR (sorry again for the noise) Signed-off-by: Alessandro Marcolini <alessandromarcolini99@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-24net/sched: flower: Fix chain template offloadIdo Schimmel
When a qdisc is deleted from a net device the stack instructs the underlying driver to remove its flow offload callback from the associated filter block using the 'FLOW_BLOCK_UNBIND' command. The stack then continues to replay the removal of the filters in the block for this driver by iterating over the chains in the block and invoking the 'reoffload' operation of the classifier being used. In turn, the classifier in its 'reoffload' operation prepares and emits a 'FLOW_CLS_DESTROY' command for each filter. However, the stack does not do the same for chain templates and the underlying driver never receives a 'FLOW_CLS_TMPLT_DESTROY' command when a qdisc is deleted. This results in a memory leak [1] which can be reproduced using [2]. Fix by introducing a 'tmplt_reoffload' operation and have the stack invoke it with the appropriate arguments as part of the replay. Implement the operation in the sole classifier that supports chain templates (flower) by emitting the 'FLOW_CLS_TMPLT_{CREATE,DESTROY}' command based on whether a flow offload callback is being bound to a filter block or being unbound from one. As far as I can tell, the issue happens since cited commit which reordered tcf_block_offload_unbind() before tcf_block_flush_all_chains() in __tcf_block_put(). The order cannot be reversed as the filter block is expected to be freed after flushing all the chains. [1] unreferenced object 0xffff888107e28800 (size 2048): comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s) hex dump (first 32 bytes): b1 a6 7c 11 81 88 ff ff e0 5b b3 10 81 88 ff ff ..|......[...... 01 00 00 00 00 00 00 00 e0 aa b0 84 ff ff ff ff ................ backtrace: [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320 [<ffffffff81ab374e>] __kmalloc+0x4e/0x90 [<ffffffff832aec6d>] mlxsw_sp_acl_ruleset_get+0x34d/0x7a0 [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180 [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280 [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340 [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0 [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170 [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0 [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440 [<ffffffff83ac6270>] netlink_unicast+0x540/0x820 [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0 [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80 [<ffffffff8379d29a>] ___sys_sendmsg+0x13a/0x1e0 [<ffffffff8379d50c>] __sys_sendmsg+0x11c/0x1f0 [<ffffffff843b9ce0>] do_syscall_64+0x40/0xe0 unreferenced object 0xffff88816d2c0400 (size 1024): comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s) hex dump (first 32 bytes): 40 00 00 00 00 00 00 00 57 f6 38 be 00 00 00 00 @.......W.8..... 10 04 2c 6d 81 88 ff ff 10 04 2c 6d 81 88 ff ff ..,m......,m.... backtrace: [<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320 [<ffffffff81ab36c1>] __kmalloc_node+0x51/0x90 [<ffffffff81a8ed96>] kvmalloc_node+0xa6/0x1f0 [<ffffffff82827d03>] bucket_table_alloc.isra.0+0x83/0x460 [<ffffffff82828d2b>] rhashtable_init+0x43b/0x7c0 [<ffffffff832aed48>] mlxsw_sp_acl_ruleset_get+0x428/0x7a0 [<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180 [<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280 [<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340 [<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0 [<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170 [<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0 [<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440 [<ffffffff83ac6270>] netlink_unicast+0x540/0x820 [<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0 [<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80 [2] # tc qdisc add dev swp1 clsact # tc chain add dev swp1 ingress proto ip chain 1 flower dst_ip 0.0.0.0/32 # tc qdisc del dev swp1 clsact # devlink dev reload pci/0000:06:00.0 Fixes: bbf73830cd48 ("net: sched: traverse chains in block with tcf_get_next_chain()") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-13net: sched: track device in tcf_block_get/put_ext() only for clsact binder typesJiri Pirko
Clsact/ingress qdisc is not the only one using shared block, red is also using it. The device tracking was originally introduced by commit 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra") for clsact/ingress only. Commit 94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()") mistakenly enabled that for red as well. Fix that by adding a check for the binder type being clsact when adding device to the block->ports xarray. Reported-by: Ido Schimmel <idosch@idosch.org> Closes: https://lore.kernel.org/all/ZZ6JE0odnu1lLPtu@shredder/ Fixes: 94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()") Signed-off-by: Jiri Pirko <jiri@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-10x86/bugs: Rename CONFIG_RETPOLINE => CONFIG_MITIGATION_RETPOLINEBreno Leitao
Step 5/10 of the namespace unification of CPU mitigations related Kconfig options. [ mingo: Converted a few more uses in comments/messages as well. ] Suggested-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ariel Miculas <amiculas@cisco.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20231121160740.1249350-6-leitao@debian.org
2024-01-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Merge in late fixes to prepare for the 6.8 net-next PR No conflicts. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-07net/sched: simplify tc_action_load_ops parametersPedro Tammela
Instead of using two bools derived from a flags passed as arguments to the parent function of tc_action_load_ops, just pass the flags itself to tc_action_load_ops to simplify its parameters. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-05net/sched: act_ct: fix skb leak and crash on ooo fragsTao Liu
act_ct adds skb->users before defragmentation. If frags arrive in order, the last frag's reference is reset in: inet_frag_reasm_prepare skb_morph which is not straightforward. However when frags arrive out of order, nobody unref the last frag, and all frags are leaked. The situation is even worse, as initiating packet capture can lead to a crash[0] when skb has been cloned and shared at the same time. Fix the issue by removing skb_get() before defragmentation. act_ct returns TC_ACT_CONSUMED when defrag failed or in progress. [0]: [ 843.804823] ------------[ cut here ]------------ [ 843.809659] kernel BUG at net/core/skbuff.c:2091! [ 843.814516] invalid opcode: 0000 [#1] PREEMPT SMP [ 843.819296] CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G S 6.7.0-rc3 #2 [ 843.824107] Hardware name: XFUSION 1288H V6/BC13MBSBD, BIOS 1.29 11/25/2022 [ 843.828953] RIP: 0010:pskb_expand_head+0x2ac/0x300 [ 843.833805] Code: 8b 70 28 48 85 f6 74 82 48 83 c6 08 bf 01 00 00 00 e8 38 bd ff ff 8b 83 c0 00 00 00 48 03 83 c8 00 00 00 e9 62 ff ff ff 0f 0b <0f> 0b e8 8d d0 ff ff e9 b3 fd ff ff 81 7c 24 14 40 01 00 00 4c 89 [ 843.843698] RSP: 0018:ffffc9000cce07c0 EFLAGS: 00010202 [ 843.848524] RAX: 0000000000000002 RBX: ffff88811a211d00 RCX: 0000000000000820 [ 843.853299] RDX: 0000000000000640 RSI: 0000000000000000 RDI: ffff88811a211d00 [ 843.857974] RBP: ffff888127d39518 R08: 00000000bee97314 R09: 0000000000000000 [ 843.862584] R10: 0000000000000000 R11: ffff8881109f0000 R12: 0000000000000880 [ 843.867147] R13: ffff888127d39580 R14: 0000000000000640 R15: ffff888170f7b900 [ 843.871680] FS: 0000000000000000(0000) GS:ffff889ffffc0000(0000) knlGS:0000000000000000 [ 843.876242] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 843.880778] CR2: 00007fa42affcfb8 CR3: 000000011433a002 CR4: 0000000000770ef0 [ 843.885336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 843.889809] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 843.894229] PKRU: 55555554 [ 843.898539] Call Trace: [ 843.902772] <IRQ> [ 843.906922] ? __die_body+0x1e/0x60 [ 843.911032] ? die+0x3c/0x60 [ 843.915037] ? do_trap+0xe2/0x110 [ 843.918911] ? pskb_expand_head+0x2ac/0x300 [ 843.922687] ? do_error_trap+0x65/0x80 [ 843.926342] ? pskb_expand_head+0x2ac/0x300 [ 843.929905] ? exc_invalid_op+0x50/0x60 [ 843.933398] ? pskb_expand_head+0x2ac/0x300 [ 843.936835] ? asm_exc_invalid_op+0x1a/0x20 [ 843.940226] ? pskb_expand_head+0x2ac/0x300 [ 843.943580] inet_frag_reasm_prepare+0xd1/0x240 [ 843.946904] ip_defrag+0x5d4/0x870 [ 843.950132] nf_ct_handle_fragments+0xec/0x130 [nf_conntrack] [ 843.953334] tcf_ct_act+0x252/0xd90 [act_ct] [ 843.956473] ? tcf_mirred_act+0x516/0x5a0 [act_mirred] [ 843.959657] tcf_action_exec+0xa1/0x160 [ 843.962823] fl_classify+0x1db/0x1f0 [cls_flower] [ 843.966010] ? skb_clone+0x53/0xc0 [ 843.969173] tcf_classify+0x24d/0x420 [ 843.972333] tc_run+0x8f/0xf0 [ 843.975465] __netif_receive_skb_core+0x67a/0x1080 [ 843.978634] ? dev_gro_receive+0x249/0x730 [ 843.981759] __netif_receive_skb_list_core+0x12d/0x260 [ 843.984869] netif_receive_skb_list_internal+0x1cb/0x2f0 [ 843.987957] ? mlx5e_handle_rx_cqe_mpwrq_rep+0xfa/0x1a0 [mlx5_core] [ 843.991170] napi_complete_done+0x72/0x1a0 [ 843.994305] mlx5e_napi_poll+0x28c/0x6d0 [mlx5_core] [ 843.997501] __napi_poll+0x25/0x1b0 [ 844.000627] net_rx_action+0x256/0x330 [ 844.003705] __do_softirq+0xb3/0x29b [ 844.006718] irq_exit_rcu+0x9e/0xc0 [ 844.009672] common_interrupt+0x86/0xa0 [ 844.012537] </IRQ> [ 844.015285] <TASK> [ 844.017937] asm_common_interrupt+0x26/0x40 [ 844.020591] RIP: 0010:acpi_safe_halt+0x1b/0x20 [ 844.023247] Code: ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 65 48 8b 04 25 00 18 03 00 48 8b 00 a8 08 75 0c 66 90 0f 00 2d 81 d0 44 00 fb f4 <fa> c3 0f 1f 00 89 fa ec 48 8b 05 ee 88 ed 00 a9 00 00 00 80 75 11 [ 844.028900] RSP: 0018:ffffc90000533e70 EFLAGS: 00000246 [ 844.031725] RAX: 0000000000004000 RBX: 0000000000000001 RCX: 0000000000000000 [ 844.034553] RDX: ffff889ffffc0000 RSI: ffffffff828b7f20 RDI: ffff88a090f45c64 [ 844.037368] RBP: ffff88a0901a2800 R08: ffff88a090f45c00 R09: 00000000000317c0 [ 844.040155] R10: 00ec812281150475 R11: ffff889fffff0e04 R12: ffffffff828b7fa0 [ 844.042962] R13: ffffffff828b7f20 R14: 0000000000000001 R15: 0000000000000000 [ 844.045819] acpi_idle_enter+0x7b/0xc0 [ 844.048621] cpuidle_enter_state+0x7f/0x430 [ 844.051451] cpuidle_enter+0x2d/0x40 [ 844.054279] do_idle+0x1d4/0x240 [ 844.057096] cpu_startup_entry+0x2a/0x30 [ 844.059934] start_secondary+0x104/0x130 [ 844.062787] secondary_startup_64_no_verify+0x16b/0x16b [ 844.065674] </TASK> Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct") Signed-off-by: Tao Liu <taoliu828@163.com> Link: https://lore.kernel.org/r/20231228081457.936732-1-taoliu828@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-05net: sched: move block device tracking into tcf_block_get/put_ext()Jiri Pirko
Inserting the device to block xarray in qdisc_create() is not suitable place to do this. As it requires use of tcf_block() callback, it causes multiple issues. It is called for all qdisc types, which is incorrect. So, instead, move it to more suitable place, which is tcf_block_get_ext() and make sure it is only done for qdiscs that use block infrastructure and also only for blocks which are shared. Symmetrically, alter the cleanup path, move the xarray entry removal into tcf_block_put_ext(). Fixes: 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra") Reported-by: Ido Schimmel <idosch@nvidia.com> Closes: https://lore.kernel.org/all/ZY1hBb8GFwycfgvd@shredder/ Reported-by: Kui-Feng Lee <sinquersw@gmail.com> Closes: https://lore.kernel.org/all/ce8d3e55-b8bc-409c-ace9-5cf1c4f7c88e@gmail.com/ Reported-and-tested-by: syzbot+84339b9e7330daae4d66@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000007c85f5060dcc3a28@google.com/ Reported-and-tested-by: syzbot+806b0572c8d06b66b234@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/00000000000082f2f2060dcc3a92@google.com/ Reported-and-tested-by: syzbot+0039110f932d438130f9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000007fbc8c060dcc3a5c@google.com/ Signed-off-by: Jiri Pirko <jiri@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt.c e009b2efb7a8 ("bnxt_en: Remove mis-applied code from bnxt_cfg_ntp_filters()") 0f2b21477988 ("bnxt_en: Fix compile error without CONFIG_RFS_ACCEL") https://lore.kernel.org/all/20240105115509.225aa8a2@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net/sched: sch_api: conditional netlink notificationsPedro Tammela
Implement conditional netlink notifications for Qdiscs and classes, which were missing in the initial patches that targeted tc filters and actions. Notifications will only be built after passing a check for 'rtnl_notify_needed()'. For both Qdiscs and classes 'get' operations now call a dedicated notification function as it was not possible to distinguish between 'create' and 'get' before. This distinction is necessary because 'get' always send a notification. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20231229132642.1489088-2-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net/sched: introduce ACT_P_BOUND return codePedro Tammela
Bound actions always return '0' and as of today we rely on '0' being returned in order to properly skip bound actions in tcf_idr_insert_many. In order to further improve maintainability, introduce the ACT_P_BOUND return code. Actions are updated to return 'ACT_P_BOUND' instead of plain '0'. tcf_idr_insert_many is then updated to check for 'ACT_P_BOUND'. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20231229132642.1489088-1-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-03net/sched: cls_api: complement tcf_tfilter_dump_policyLin Ma
In function `tc_dump_tfilter`, the attributes array is parsed via tcf_tfilter_dump_policy which only describes TCA_DUMP_FLAGS. However, the NLA TCA_CHAIN is also accessed with `nla_get_u32`. The access to TCA_CHAIN is introduced in commit 5bc1701881e3 ("net: sched: introduce multichain support for filters") and no nla_policy is provided for parsing at that point. Later on, tcf_tfilter_dump_policy is introduced in commit f8ab1807a9c9 ("net: sched: introduce terse dump flag") while still ignoring the fact that TCA_CHAIN needs a check. This patch does that by complementing the policy to allow the access discussed here can be safe as other cases just choose rtm_tca_policy as the parsing policy. Signed-off-by: Lin Ma <linma@zju.edu.cn> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-02net/sched: Retire ipt actionJamal Hadi Salim
The tc ipt action was intended to run all netfilter/iptables target. Unfortunately it has not benefitted over the years from proper updates when netfilter changes, and for that reason it has remained rudimentary. Pinging a bunch of people that i was aware were using this indicates that removing it wont affect them. Retire it to reduce maintenance efforts. Buh-bye. Reviewed-by: Victor Noguiera <victor@mojatatu.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-01-01net: sched: em_text: fix possible memory leak in em_text_destroy()Hangyu Hua
m->data needs to be freed when em_text_destroy is called. Fixes: d675c989ed2d ("[PKT_SCHED]: Packet classification based on textsearch (ematch)") Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-26net/sched: act_mirred: Allow mirred to blockVictor Nogueira
So far the mirred action has dealt with syntax that handles mirror/redirection for netdev. A matching packet is redirected or mirrored to a target netdev. In this patch we enable mirred to mirror to a tc block as well. IOW, the new syntax looks as follows: ... mirred <ingress | egress> <mirror | redirect> [index INDEX] < <blockid BLOCKID> | <dev <devname>> > Examples of mirroring or redirecting to a tc block: $ tc filter add block 22 protocol ip pref 25 \ flower dst_ip 192.168.0.0/16 action mirred egress mirror blockid 22 $ tc filter add block 22 protocol ip pref 25 \ flower dst_ip 10.10.10.10/32 action mirred egress redirect blockid 22 Co-developed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-26net/sched: act_mirred: Add helper function tcf_mirred_replace_devVictor Nogueira
The act of replacing a device will be repeated by the init logic for the block ID in the patch that allows mirred to a block. Therefore we encapsulate this functionality in a function (tcf_mirred_replace_dev) so that we can reuse it and avoid code repetition. Co-developed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-26net/sched: act_mirred: Create function tcf_mirred_to_dev and improve readabilityVictor Nogueira
As a preparation for adding block ID to mirred, separate the part of mirred that redirect/mirrors to a dev into a specific function so that it can be called by blockcast for each dev. Also improve readability. Eg. rename use_reinsert to dont_clone and skb2 to skb_to_send. Co-developed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-26net/sched: cls_api: Expose tc block to the datapathVictor Nogueira
The datapath can now find the block of the port in which the packet arrived at. In the next patch we show a possible usage of this patch in a new version of mirred that multicasts to all ports except for the port in which the packet arrived on. Co-developed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-26net/sched: Introduce tc block netdev tracking infraVictor Nogueira
This commit makes tc blocks track which ports have been added to them. And, with that, we'll be able to use this new information to send packets to the block's ports. Which will be done in the patch #3 of this series. Suggested-by: Jiri Pirko <jiri@nvidia.com> Co-developed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-20net: sched: Add initial TC error skb drop reasonsVictor Nogueira
Continue expanding Daniel's patch by adding new skb drop reasons that are idiosyncratic to TC. More specifically: - SKB_DROP_REASON_TC_COOKIE_ERROR: An error occurred whilst processing a tc ext cookie. - SKB_DROP_REASON_TC_CHAIN_NOTFOUND: tc chain lookup failed. - SKB_DROP_REASON_TC_RECLASSIFY_LOOP: tc exceeded max reclassify loop iterations Signed-off-by: Victor Nogueira <victor@mojatatu.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-20net: sched: Move drop_reason to struct tc_skb_cbVictor Nogueira
Move drop_reason from struct tcf_result to skb cb - more specifically to struct tc_skb_cb. With that, we'll be able to also set the drop reason for the remaining qdiscs (aside from clsact) that do not have access to tcf_result when time comes to set the skb drop reason. Signed-off-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/intel/iavf/iavf_ethtool.c 3a0b5a2929fd ("iavf: Introduce new state machines for flow director") 95260816b489 ("iavf: use iavf_schedule_aq_request() helper") https://lore.kernel.org/all/84e12519-04dc-bd80-bc34-8cf50d7898ce@intel.com/ drivers/net/ethernet/broadcom/bnxt/bnxt.c c13e268c0768 ("bnxt_en: Fix HWTSTAMP_FILTER_ALL packet timestamp logic") c2f8063309da ("bnxt_en: Refactor RX VLAN acceleration logic.") a7445d69809f ("bnxt_en: Add support for new RX and TPA_START completion types for P7") 1c7fd6ee2fe4 ("bnxt_en: Rename some macros for the P5 chips") https://lore.kernel.org/all/20231211110022.27926ad9@canb.auug.org.au/ drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.c bd6781c18cb5 ("bnxt_en: Fix wrong return value check in bnxt_close_nic()") 84793a499578 ("bnxt_en: Skip nic close/open when configuring tstamp filters") https://lore.kernel.org/all/20231214113041.3a0c003c@canb.auug.org.au/ drivers/net/ethernet/mellanox/mlx5/core/fw_reset.c 3d7a3f2612d7 ("net/mlx5: Nack sync reset request when HotPlug is enabled") cecf44ea1a1f ("net/mlx5: Allow sync reset flow when BF MGT interface device is present") https://lore.kernel.org/all/20231211110328.76c925af@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-13net/sched: act_api: skip idr replace on bound actionsPedro Tammela
tcf_idr_insert_many will replace the allocated -EBUSY pointer in tcf_idr_check_alloc with the real action pointer, exposing it to all operations. This operation is only needed when the action pointer is created (ACT_P_CREATED). For actions which are bound to (returned 0), the pointer already resides in the idr making such operation a nop. Even though it's a nop, it's still not a cheap operation as internally the idr code walks the idr and then does a replace on the appropriate slot. So if the action was bound, better skip the idr replace entirely. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Link: https://lore.kernel.org/r/20231211181807.96028-3-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-13net/sched: act_api: rely on rcu in tcf_idr_check_allocPedro Tammela
Instead of relying only on the idrinfo->lock mutex for bind/alloc logic, rely on a combination of rcu + mutex + atomics to better scale the case where multiple rtnl-less filters are binding to the same action object. Action binding happens when an action index is specified explicitly and an action exists which such index exists. Example: tc actions add action drop index 1 tc filter add ... matchall action drop index 1 tc filter add ... matchall action drop index 1 tc filter add ... matchall action drop index 1 tc filter ls ... filter protocol all pref 49150 matchall chain 0 filter protocol all pref 49150 matchall chain 0 handle 0x1 not_in_hw action order 1: gact action drop random type none pass val 0 index 1 ref 4 bind 3 filter protocol all pref 49151 matchall chain 0 filter protocol all pref 49151 matchall chain 0 handle 0x1 not_in_hw action order 1: gact action drop random type none pass val 0 index 1 ref 4 bind 3 filter protocol all pref 49152 matchall chain 0 filter protocol all pref 49152 matchall chain 0 handle 0x1 not_in_hw action order 1: gact action drop random type none pass val 0 index 1 ref 4 bind 3 When no index is specified, as before, grab the mutex and allocate in the idr the next available id. In this version, as opposed to before, it's simplified to store the -EBUSY pointer instead of the previous alloc + replace combination. When an index is specified, rely on rcu to find if there's an object in such index. If there's none, fallback to the above, serializing on the mutex and reserving the specified id. If there's one, it can be an -EBUSY pointer, in which case we just try again until it's an action, or an action. Given the rcu guarantees, the action found could be dead and therefore we need to bump the refcount if it's not 0, handling the case it's in fact 0. As bind and the action refcount are already atomics, these increments can happen without the mutex protection while many tcf_idr_check_alloc race to bind to the same action instance. In case binding encounters a parallel delete or add, it will return -EAGAIN in order to try again. Both filter and action apis already have the retry machinery in-place. In case it's an unlocked filter it retries under the rtnl lock. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Vlad Buslov <vladbu@nvidia.com> Link: https://lore.kernel.org/r/20231211181807.96028-2-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-11net/sched: cls_api: conditional notification of eventsPedro Tammela
As of today tc-filter/chain events are unconditionally built and sent to RTNLGRP_TC. As with the introduction of rtnl_notify_needed we can check before-hand if they are really needed. This will help to alleviate system pressure when filters are concurrently added without the rtnl lock as in tc-flower. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Link: https://lore.kernel.org/r/20231208192847.714940-8-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-11net/sched: cls_api: remove 'unicast' argument from delete notificationPedro Tammela
This argument is never called while set to true, so remove it as there's no need for it. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231208192847.714940-7-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-11net/sched: act_api: conditional notification of eventsPedro Tammela
As of today tc-action events are unconditionally built and sent to RTNLGRP_TC. As with the introduction of rtnl_notify_needed we can check before-hand if they are really needed. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231208192847.714940-6-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-11net/sched: act_api: don't open code max()Pedro Tammela
Use max() in a couple of places that are open coding it with the ternary operator. Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20231208192847.714940-5-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>