summaryrefslogtreecommitdiff
path: root/net/sched
AgeCommit message (Collapse)Author
2024-11-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.12-rc8). Conflicts: tools/testing/selftests/net/.gitignore 252e01e68241 ("selftests: net: add netlink-dumps to .gitignore") be43a6b23829 ("selftests: ncdevmem: Move ncdevmem under drivers/net/hw") https://lore.kernel.org/all/20241113122359.1b95180a@canb.auug.org.au/ drivers/net/phy/phylink.c 671154f174e0 ("net: phylink: ensure PHY momentary link-fails are handled") 7530ea26c810 ("net: phylink: remove "using_mac_select_pcs"") Adjacent changes: drivers/net/ethernet/stmicro/stmmac/dwmac-intel-plat.c 5b366eae7193 ("stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines") e96321fad3ad ("net: ethernet: Switch back to struct platform_driver::remove()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12net: sched: cls_u32: Fix u32's systematic failure to free IDR entries for ↵Alexandre Ferrieux
hnodes. To generate hnode handles (in gen_new_htid()), u32 uses IDR and encodes the returned small integer into a structured 32-bit word. Unfortunately, at disposal time, the needed decoding is not done. As a result, idr_remove() fails, and the IDR fills up. Since its size is 2048, the following script ends up with "Filter already exists": tc filter add dev myve $FILTER1 tc filter add dev myve $FILTER2 for i in {1..2048} do echo $i tc filter del dev myve $FILTER2 tc filter add dev myve $FILTER2 done This patch adds the missing decoding logic for handles that deserve it. Fixes: e7614370d6f0 ("net_sched: use idr to allocate u32 filter handles") Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Alexandre Ferrieux <alexandre.ferrieux@orange.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20241110172836.331319-1-alexandre.ferrieux@orange.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-11-12net: sched: cls_api: improve the error message for ID allocation failureJakub Kicinski
We run into an exhaustion problem with the kernel-allocated filter IDs. Our allocation problem can be fixed on the user space side, but the error message in this case was quite misleading: "Filter with specified priority/protocol not found" (EINVAL) Specifically when we can't allocate a _new_ ID because filter with lowest ID already _exists_, saying "filter not found", is confusing. Kernel allocates IDs in range of 0xc0000 -> 0x8000, giving out ID one lower than lowest existing in that range. The error message makes sense when tcf_chain_tp_find() gets called for GET and DEL but for NEW we need to provide more specific error messages for all three cases: - user wants the ID to be auto-allocated but filter with ID 0x8000 already exists - filter already exists and can be replaced, but user asked for a protocol change - filter doesn't exist Caller of tcf_chain_tp_insert_unique() doesn't set extack today, so don't bother plumbing it in. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20241108010254.2995438-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-11-11net: convert to nla_get_*_default()Johannes Berg
Most of the original conversion is from the spatch below, but I edited some and left out other instances that were either buggy after conversion (where default values don't fit into the type) or just looked strange. @@ expression attr, def; expression val; identifier fn =~ "^nla_get_.*"; fresh identifier dfn = fn ## "_default"; @@ ( -if (attr) - val = fn(attr); -else - val = def; +val = dfn(attr, def); | -if (!attr) - val = def; -else - val = fn(attr); +val = dfn(attr, def); | -if (!attr) - return def; -return fn(attr); +return dfn(attr, def); | -attr ? fn(attr) : def +dfn(attr, def) | -!attr ? def : fn(attr) +dfn(attr, def) ) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Toke Høiland-Jørgensen <toke@kernel.org> Link: https://patch.msgid.link/20241108114145.0580b8684e7f.I740beeaa2f70ebfc19bfca1045a24d6151992790@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.12-rc6). Conflicts: drivers/net/wireless/intel/iwlwifi/mvm/mld-mac80211.c cbe84e9ad5e2 ("wifi: iwlwifi: mvm: really send iwl_txpower_constraints_cmd") 188a1bf89432 ("wifi: mac80211: re-order assigning channel in activate links") https://lore.kernel.org/all/20241028123621.7bbb131b@canb.auug.org.au/ net/mac80211/cfg.c c4382d5ca1af ("wifi: mac80211: update the right link for tx power") 8dd0498983ee ("wifi: mac80211: Fix setting txpower with emulate_chanctx") drivers/net/ethernet/intel/ice/ice_ptp_hw.h 6e58c3310622 ("ice: fix crash on probe for DPLL enabled E810 LOM") e4291b64e118 ("ice: Align E810T GPIO to other products") ebb2693f8fbd ("ice: Read SDP section from NVM for pin definitions") ac532f4f4251 ("ice: Cleanup unused declarations") https://lore.kernel.org/all/20241030120524.1ee1af18@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29net/sched: sch_api: fix xa_insert() error path in tcf_block_get_ext()Vladimir Oltean
This command: $ tc qdisc replace dev eth0 ingress_block 1 egress_block 1 clsact Error: block dev insert failed: -EBUSY. fails because user space requests the same block index to be set for both ingress and egress. [ side note, I don't think it even failed prior to commit 913b47d3424e ("net/sched: Introduce tc block netdev tracking infra"), because this is a command from an old set of notes of mine which used to work, but alas, I did not scientifically bisect this ] The problem is not that it fails, but rather, that the second time around, it fails differently (and irrecoverably): $ tc qdisc replace dev eth0 ingress_block 1 egress_block 1 clsact Error: dsa_core: Flow block cb is busy. [ another note: the extack is added by me for illustration purposes. the context of the problem is that clsact_init() obtains the same &q->ingress_block pointer as &q->egress_block, and since we call tcf_block_get_ext() on both of them, "dev" will be added to the block->ports xarray twice, thus failing the operation: once through the ingress block pointer, and once again through the egress block pointer. the problem itself is that when xa_insert() fails, we have emitted a FLOW_BLOCK_BIND command through ndo_setup_tc(), but the offload never sees a corresponding FLOW_BLOCK_UNBIND. ] Even correcting the bad user input, we still cannot recover: $ tc qdisc replace dev swp3 ingress_block 1 egress_block 2 clsact Error: dsa_core: Flow block cb is busy. Basically the only way to recover is to reboot the system, or unbind and rebind the net device driver. To fix the bug, we need to fill the correct error teardown path which was missed during code movement, and call tcf_block_offload_unbind() when xa_insert() fails. [ last note, fundamentally I blame the label naming convention in tcf_block_get_ext() for the bug. The labels should be named after what they do, not after the error path that jumps to them. This way, it is obviously wrong that two labels pointing to the same code mean something is wrong, and checking the code correctness at the goto site is also easier ] Fixes: 94e2557d086a ("net: sched: move block device tracking into tcf_block_get/put_ext()") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20241023100541.974362-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-29net/sched: stop qdisc_tree_reduce_backlog on TC_H_ROOTPedro Tammela
In qdisc_tree_reduce_backlog, Qdiscs with major handle ffff: are assumed to be either root or ingress. This assumption is bogus since it's valid to create egress qdiscs with major handle ffff: Budimir Markovic found that for qdiscs like DRR that maintain an active class list, it will cause a UAF with a dangling class pointer. In 066a3b5b2346, the concern was to avoid iterating over the ingress qdisc since its parent is itself. The proper fix is to stop when parent TC_H_ROOT is reached because the only way to retrieve ingress is when a hierarchy which does not contain a ffff: major handle call into qdisc_lookup with TC_H_MAJ(TC_H_ROOT). In the scenario where major ffff: is an egress qdisc in any of the tree levels, the updates will also propagate to TC_H_ROOT, which then the iteration must stop. Fixes: 066a3b5b2346 ("[NET_SCHED] sch_api: fix qdisc_tree_decrease_qlen() loop") Reported-by: Budimir Markovic <markovicbudimir@gmail.com> Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> net/sched/sch_api.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20241024165547.418570-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR. No conflicts and no adjacent changes. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-23net: sched: use RCU read-side critical section in taprio_dump()Dmitry Antipov
Fix possible use-after-free in 'taprio_dump()' by adding RCU read-side critical section there. Never seen on x86 but found on a KASAN-enabled arm64 system when investigating https://syzkaller.appspot.com/bug?extid=b65e0af58423fc8a73aa: [T15862] BUG: KASAN: slab-use-after-free in taprio_dump+0xa0c/0xbb0 [T15862] Read of size 4 at addr ffff0000d4bb88f8 by task repro/15862 [T15862] [T15862] CPU: 0 UID: 0 PID: 15862 Comm: repro Not tainted 6.11.0-rc1-00293-gdefaf1a2113a-dirty #2 [T15862] Hardware name: QEMU QEMU Virtual Machine, BIOS edk2-20240524-5.fc40 05/24/2024 [T15862] Call trace: [T15862] dump_backtrace+0x20c/0x220 [T15862] show_stack+0x2c/0x40 [T15862] dump_stack_lvl+0xf8/0x174 [T15862] print_report+0x170/0x4d8 [T15862] kasan_report+0xb8/0x1d4 [T15862] __asan_report_load4_noabort+0x20/0x2c [T15862] taprio_dump+0xa0c/0xbb0 [T15862] tc_fill_qdisc+0x540/0x1020 [T15862] qdisc_notify.isra.0+0x330/0x3a0 [T15862] tc_modify_qdisc+0x7b8/0x1838 [T15862] rtnetlink_rcv_msg+0x3c8/0xc20 [T15862] netlink_rcv_skb+0x1f8/0x3d4 [T15862] rtnetlink_rcv+0x28/0x40 [T15862] netlink_unicast+0x51c/0x790 [T15862] netlink_sendmsg+0x79c/0xc20 [T15862] __sock_sendmsg+0xe0/0x1a0 [T15862] ____sys_sendmsg+0x6c0/0x840 [T15862] ___sys_sendmsg+0x1ac/0x1f0 [T15862] __sys_sendmsg+0x110/0x1d0 [T15862] __arm64_sys_sendmsg+0x74/0xb0 [T15862] invoke_syscall+0x88/0x2e0 [T15862] el0_svc_common.constprop.0+0xe4/0x2a0 [T15862] do_el0_svc+0x44/0x60 [T15862] el0_svc+0x50/0x184 [T15862] el0t_64_sync_handler+0x120/0x12c [T15862] el0t_64_sync+0x190/0x194 [T15862] [T15862] Allocated by task 15857: [T15862] kasan_save_stack+0x3c/0x70 [T15862] kasan_save_track+0x20/0x3c [T15862] kasan_save_alloc_info+0x40/0x60 [T15862] __kasan_kmalloc+0xd4/0xe0 [T15862] __kmalloc_cache_noprof+0x194/0x334 [T15862] taprio_change+0x45c/0x2fe0 [T15862] tc_modify_qdisc+0x6a8/0x1838 [T15862] rtnetlink_rcv_msg+0x3c8/0xc20 [T15862] netlink_rcv_skb+0x1f8/0x3d4 [T15862] rtnetlink_rcv+0x28/0x40 [T15862] netlink_unicast+0x51c/0x790 [T15862] netlink_sendmsg+0x79c/0xc20 [T15862] __sock_sendmsg+0xe0/0x1a0 [T15862] ____sys_sendmsg+0x6c0/0x840 [T15862] ___sys_sendmsg+0x1ac/0x1f0 [T15862] __sys_sendmsg+0x110/0x1d0 [T15862] __arm64_sys_sendmsg+0x74/0xb0 [T15862] invoke_syscall+0x88/0x2e0 [T15862] el0_svc_common.constprop.0+0xe4/0x2a0 [T15862] do_el0_svc+0x44/0x60 [T15862] el0_svc+0x50/0x184 [T15862] el0t_64_sync_handler+0x120/0x12c [T15862] el0t_64_sync+0x190/0x194 [T15862] [T15862] Freed by task 6192: [T15862] kasan_save_stack+0x3c/0x70 [T15862] kasan_save_track+0x20/0x3c [T15862] kasan_save_free_info+0x4c/0x80 [T15862] poison_slab_object+0x110/0x160 [T15862] __kasan_slab_free+0x3c/0x74 [T15862] kfree+0x134/0x3c0 [T15862] taprio_free_sched_cb+0x18c/0x220 [T15862] rcu_core+0x920/0x1b7c [T15862] rcu_core_si+0x10/0x1c [T15862] handle_softirqs+0x2e8/0xd64 [T15862] __do_softirq+0x14/0x20 Fixes: 18cdd2f0998a ("net/sched: taprio: taprio_dump and taprio_change are protected by rtnl_mutex") Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Link: https://patch.msgid.link/20241018051339.418890-2-dmantipov@yandex.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-23net: sched: fix use-after-free in taprio_change()Dmitry Antipov
In 'taprio_change()', 'admin' pointer may become dangling due to sched switch / removal caused by 'advance_sched()', and critical section protected by 'q->current_entry_lock' is too small to prevent from such a scenario (which causes use-after-free detected by KASAN). Fix this by prefer 'rcu_replace_pointer()' over 'rcu_assign_pointer()' to update 'admin' immediately before an attempt to schedule freeing. Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule") Reported-by: syzbot+b65e0af58423fc8a73aa@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b65e0af58423fc8a73aa Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Link: https://patch.msgid.link/20241018051339.418890-1-dmantipov@yandex.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-23net/sched: act_api: unexport tcf_action_dump_1()Vladimir Oltean
This isn't used outside act_api.c, but is called by tcf_dump_walker() prior to its definition. So move it upwards and make it static. Simultaneously, reorder the variable declarations so that they follow the networking "reverse Christmas tree" coding style. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20241017161934.3599046-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-23net/sched: act_api: deny mismatched skip_sw/skip_hw flags for actions ↵Vladimir Oltean
created by classifiers tcf_action_init() has logic for checking mismatches between action and filter offload flags (skip_sw/skip_hw). AFAIU, this is intended to run on the transition between the new tc_act_bind(flags) returning true (aka now gets bound to classifier) and tc_act_bind(act->tcfa_flags) returning false (aka action was not bound to classifier before). Otherwise, the check is skipped. For the case where an action is not standalone, but rather it was created by a classifier and is bound to it, tcf_action_init() skips the check entirely, and this means it allows mismatched flags to occur. Taking the matchall classifier code path as an example (with mirred as an action), the reason is the following: 1 | mall_change() 2 | -> mall_replace_hw_filter() 3 | -> tcf_exts_validate_ex() 4 | -> flags |= TCA_ACT_FLAGS_BIND; 5 | -> tcf_action_init() 6 | -> tcf_action_init_1() 7 | -> a_o->init() 8 | -> tcf_mirred_init() 9 | -> tcf_idr_create_from_flags() 10 | -> tcf_idr_create() 11 | -> p->tcfa_flags = flags; 12 | -> tc_act_bind(flags)) 13 | -> tc_act_bind(act->tcfa_flags) When invoked from tcf_exts_validate_ex() like matchall does (but other classifiers validate their extensions as well), tcf_action_init() runs in a call path where "flags" always contains TCA_ACT_FLAGS_BIND (set by line 4). So line 12 is always true, and line 13 is always true as well. No transition ever takes place, and the check is skipped. The code was added in this form in commit c86e0209dc77 ("flow_offload: validate flags of filter and actions"), but I'm attributing the blame even earlier in that series, to when TCA_ACT_FLAGS_SKIP_HW and TCA_ACT_FLAGS_SKIP_SW were added to the UAPI. Following the development process of this change, the check did not always exist in this form. A change took place between v3 [1] and v4 [2], AFAIU due to review feedback that it doesn't make sense for action flags to be different than classifier flags. I think I agree with that feedback, but it was translated into code that omits enforcing this for "classic" actions created at the same time with the filters themselves. There are 3 more important cases to discuss. First there is this command: $ tc qdisc add dev eth0 clasct $ tc filter add dev eth0 ingress matchall skip_sw \ action mirred ingress mirror dev eth1 which should be allowed, because prior to the concept of dedicated action flags, it used to work and it used to mean the action inherited the skip_sw/skip_hw flags from the classifier. It's not a mismatch. Then we have this command: $ tc qdisc add dev eth0 clasct $ tc filter add dev eth0 ingress matchall skip_sw \ action mirred ingress mirror dev eth1 skip_hw where there is a mismatch and it should be rejected. Finally, we have: $ tc qdisc add dev eth0 clasct $ tc filter add dev eth0 ingress matchall skip_sw \ action mirred ingress mirror dev eth1 skip_sw where the offload flags coincide, and this should be treated the same as the first command based on inheritance, and accepted. [1]: https://lore.kernel.org/netdev/20211028110646.13791-9-simon.horman@corigine.com/ [2]: https://lore.kernel.org/netdev/20211118130805.23897-10-simon.horman@corigine.com/ Fixes: 7adc57651211 ("flow_offload: add skip_hw and skip_sw to control if offload the action") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20241017161049.3570037-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-21net: fix races in netdev_tx_sent_queue()/dev_watchdog()Eric Dumazet
Some workloads hit the infamous dev_watchdog() message: "NETDEV WATCHDOG: eth0 (xxxx): transmit queue XX timed out" It seems possible to hit this even for perfectly normal BQL enabled drivers: 1) Assume a TX queue was idle for more than dev->watchdog_timeo (5 seconds unless changed by the driver) 2) Assume a big packet is sent, exceeding current BQL limit. 3) Driver ndo_start_xmit() puts the packet in TX ring, and netdev_tx_sent_queue() is called. 4) QUEUE_STATE_STACK_XOFF could be set from netdev_tx_sent_queue() before txq->trans_start has been written. 5) txq->trans_start is written later, from netdev_start_xmit() if (rc == NETDEV_TX_OK) txq_trans_update(txq) dev_watchdog() running on another cpu could read the old txq->trans_start, and then see QUEUE_STATE_STACK_XOFF, because 5) did not happen yet. To solve the issue, write txq->trans_start right before one XOFF bit is set : - _QUEUE_STATE_DRV_XOFF from netif_tx_stop_queue() - __QUEUE_STATE_STACK_XOFF from netdev_tx_sent_queue() From dev_watchdog(), we have to read txq->state before txq->trans_start. Add memory barriers to enforce correct ordering. In the future, we could avoid writing over txq->trans_start for normal operations, and rename this field to txq->xoff_start_time. Fixes: bec251bc8b6a ("net: no longer stop all TX queues in dev_watchdog()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20241015194118.3951657-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-15net: sched: Use rtnl_register_many().Kuniyuki Iwashima
We will remove rtnl_register() in favour of rtnl_register_many(). When it succeeds, rtnl_register_many() guarantees all rtnetlink types in the passed array are supported, and there is no chance that a part of message types is not supported. Let's use rtnl_register_many() instead. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20241014201828.91221-5-kuniyu@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-15net/sched: cbs: Fix integer overflow in cbs_set_port_rate()Elena Salomatkina
The subsequent calculation of port_rate = speed * 1000 * BYTES_PER_KBIT, where the BYTES_PER_KBIT is of type LL, may cause an overflow. At least when speed = SPEED_20000, the expression to the left of port_rate will be greater than INT_MAX. Found by Linux Verification Center (linuxtesting.org) with SVACE. Signed-off-by: Elena Salomatkina <esalomatkina@ispras.ru> Link: https://patch.msgid.link/20241013124529.1043-1-esalomatkina@ispras.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14net_sched: sch_fq: prepare for TIME_WAIT socketsEric Dumazet
TCP stack is not attaching skb to TIME_WAIT sockets yet, but we would like to allow this in the future. Add sk_listener_or_tw() helper to detect the three states that FQ needs to take care. Like NEW_SYN_RECV, TIME_WAIT are not full sockets and do not contain sk->sk_pacing_status, sk->sk_pacing_rate. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Brian Vazquez <brianvv@google.com> Link: https://patch.msgid.link/20241010174817.1543642-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.12-rc3). No conflicts and no adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-09net_sched: sch_sfq: handle bigger packetsEric Dumazet
SFQ has an assumption on dealing with packets smaller than 64KB. Even before BIG TCP, TCA_STAB can provide arbitrary big values in qdisc_pkt_len(skb) It is time to switch (struct sfq_slot)->allot to a 32bit field. sizeof(struct sfq_slot) is now 64 bytes, giving better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20241008111603.653140-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-08net/sched: accept TCA_STAB only for root qdiscEric Dumazet
Most qdiscs maintain their backlog using qdisc_pkt_len(skb) on the assumption it is invariant between the enqueue() and dequeue() handlers. Unfortunately syzbot can crash a host rather easily using a TBF + SFQ combination, with an STAB on SFQ [1] We can't support TCA_STAB on arbitrary level, this would require to maintain per-qdisc storage. [1] [ 88.796496] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 88.798611] #PF: supervisor read access in kernel mode [ 88.799014] #PF: error_code(0x0000) - not-present page [ 88.799506] PGD 0 P4D 0 [ 88.799829] Oops: Oops: 0000 [#1] SMP NOPTI [ 88.800569] CPU: 14 UID: 0 PID: 2053 Comm: b371744477 Not tainted 6.12.0-rc1-virtme #1117 [ 88.801107] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 88.801779] RIP: 0010:sfq_dequeue (net/sched/sch_sfq.c:272 net/sched/sch_sfq.c:499) sch_sfq [ 88.802544] Code: 0f b7 50 12 48 8d 04 d5 00 00 00 00 48 89 d6 48 29 d0 48 8b 91 c0 01 00 00 48 c1 e0 03 48 01 c2 66 83 7a 1a 00 7e c0 48 8b 3a <4c> 8b 07 4c 89 02 49 89 50 08 48 c7 47 08 00 00 00 00 48 c7 07 00 All code ======== 0: 0f b7 50 12 movzwl 0x12(%rax),%edx 4: 48 8d 04 d5 00 00 00 lea 0x0(,%rdx,8),%rax b: 00 c: 48 89 d6 mov %rdx,%rsi f: 48 29 d0 sub %rdx,%rax 12: 48 8b 91 c0 01 00 00 mov 0x1c0(%rcx),%rdx 19: 48 c1 e0 03 shl $0x3,%rax 1d: 48 01 c2 add %rax,%rdx 20: 66 83 7a 1a 00 cmpw $0x0,0x1a(%rdx) 25: 7e c0 jle 0xffffffffffffffe7 27: 48 8b 3a mov (%rdx),%rdi 2a:* 4c 8b 07 mov (%rdi),%r8 <-- trapping instruction 2d: 4c 89 02 mov %r8,(%rdx) 30: 49 89 50 08 mov %rdx,0x8(%r8) 34: 48 c7 47 08 00 00 00 movq $0x0,0x8(%rdi) 3b: 00 3c: 48 rex.W 3d: c7 .byte 0xc7 3e: 07 (bad) ... Code starting with the faulting instruction =========================================== 0: 4c 8b 07 mov (%rdi),%r8 3: 4c 89 02 mov %r8,(%rdx) 6: 49 89 50 08 mov %rdx,0x8(%r8) a: 48 c7 47 08 00 00 00 movq $0x0,0x8(%rdi) 11: 00 12: 48 rex.W 13: c7 .byte 0xc7 14: 07 (bad) ... [ 88.803721] RSP: 0018:ffff9a1f892b7d58 EFLAGS: 00000206 [ 88.804032] RAX: 0000000000000000 RBX: ffff9a1f8420c800 RCX: ffff9a1f8420c800 [ 88.804560] RDX: ffff9a1f81bc1440 RSI: 0000000000000000 RDI: 0000000000000000 [ 88.805056] RBP: ffffffffc04bb0e0 R08: 0000000000000001 R09: 00000000ff7f9a1f [ 88.805473] R10: 000000000001001b R11: 0000000000009a1f R12: 0000000000000140 [ 88.806194] R13: 0000000000000001 R14: ffff9a1f886df400 R15: ffff9a1f886df4ac [ 88.806734] FS: 00007f445601a740(0000) GS:ffff9a2e7fd80000(0000) knlGS:0000000000000000 [ 88.807225] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 88.807672] CR2: 0000000000000000 CR3: 000000050cc46000 CR4: 00000000000006f0 [ 88.808165] Call Trace: [ 88.808459] <TASK> [ 88.808710] ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434) [ 88.809261] ? page_fault_oops (arch/x86/mm/fault.c:715) [ 88.809561] ? exc_page_fault (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:87 ./arch/x86/include/asm/irqflags.h:147 arch/x86/mm/fault.c:1489 arch/x86/mm/fault.c:1539) [ 88.809806] ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) [ 88.810074] ? sfq_dequeue (net/sched/sch_sfq.c:272 net/sched/sch_sfq.c:499) sch_sfq [ 88.810411] sfq_reset (net/sched/sch_sfq.c:525) sch_sfq [ 88.810671] qdisc_reset (./include/linux/skbuff.h:2135 ./include/linux/skbuff.h:2441 ./include/linux/skbuff.h:3304 ./include/linux/skbuff.h:3310 net/sched/sch_generic.c:1036) [ 88.810950] tbf_reset (./include/linux/timekeeping.h:169 net/sched/sch_tbf.c:334) sch_tbf [ 88.811208] qdisc_reset (./include/linux/skbuff.h:2135 ./include/linux/skbuff.h:2441 ./include/linux/skbuff.h:3304 ./include/linux/skbuff.h:3310 net/sched/sch_generic.c:1036) [ 88.811484] netif_set_real_num_tx_queues (./include/linux/spinlock.h:396 ./include/net/sch_generic.h:768 net/core/dev.c:2958) [ 88.811870] __tun_detach (drivers/net/tun.c:590 drivers/net/tun.c:673) [ 88.812271] tun_chr_close (drivers/net/tun.c:702 drivers/net/tun.c:3517) [ 88.812505] __fput (fs/file_table.c:432 (discriminator 1)) [ 88.812735] task_work_run (kernel/task_work.c:230) [ 88.813016] do_exit (kernel/exit.c:940) [ 88.813372] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:58 (discriminator 4)) [ 88.813639] ? handle_mm_fault (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./arch/x86/include/asm/irqflags.h:155 ./include/linux/memcontrol.h:1022 ./include/linux/memcontrol.h:1045 ./include/linux/memcontrol.h:1052 mm/memory.c:5928 mm/memory.c:6088) [ 88.813867] do_group_exit (kernel/exit.c:1070) [ 88.814138] __x64_sys_exit_group (kernel/exit.c:1099) [ 88.814490] x64_sys_call (??:?) [ 88.814791] do_syscall_64 (arch/x86/entry/common.c:52 (discriminator 1) arch/x86/entry/common.c:83 (discriminator 1)) [ 88.815012] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) [ 88.815495] RIP: 0033:0x7f44560f1975 Fixes: 175f9c1bba9b ("net_sched: Add size table for qdiscs") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20241007184130.3960565-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-04net_sched: sch_fq: add the ability to offload pacingJeffrey Ji
Some network devices have the ability to offload EDT (Earliest Departure Time) which is the model used for TCP pacing and FQ packet scheduler. Some of them implement the timing wheel mechanism described in https://saeed.github.io/files/carousel-sigcomm17.pdf with an associated 'timing wheel horizon'. This patchs adds to FQ packet scheduler TCA_FQ_OFFLOAD_HORIZON attribute. Its value is capped by the device max_pacing_offload_horizon, added in the prior patch. It allows FQ to let packets within pacing offload horizon to be delivered to the device, which will handle the needed delay without host involvement. Signed-off-by: Jeffrey Ji <jeffreyji@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20241003121219.2396589-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-03netem: Include <linux/prandom.h> in sch_netem.cUros Bizjak
Include <linux/prandom.h> header to allow the removal of legacy inclusion of <linux/prandom.h> from <linux/random.h>. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2024-10-02move asm/unaligned.h to linux/unaligned.hAl Viro
asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-09-10sch_cake: constify inverse square root cacheDave Taht
sch_cake uses a cache of the first 16 values of the inverse square root calculation for the Cobalt AQM to save some cycles on the fast path. This cache is populated when the qdisc is first loaded, but there's really no reason why it can't just be pre-populated. So change it to be pre-populated with constants, which also makes it possible to constify it. This gives a modest space saving for the module (not counting debug data): .text: -224 bytes .rodata: +80 bytes .bss: -64 bytes Total: -192 bytes Signed-off-by: Dave Taht <dave.taht@gmail.com> [ fixed up comment, rewrote commit message ] Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20240909091630.22177-1-toke@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-08net: sched: consistently use rcu_replace_pointer() in taprio_change()Dmitry Antipov
According to Vinicius (and carefully looking through the whole https://syzkaller.appspot.com/bug?extid=b65e0af58423fc8a73aa once again), txtime branch of 'taprio_change()' is not going to race against 'advance_sched()'. But using 'rcu_replace_pointer()' in the former may be a good idea as well. Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-09-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/phy/phy_device.c 2560db6ede1a ("net: phy: Fix missing of_node_put() for leds") 1dce520abd46 ("net: phy: Use for_each_available_child_of_node_scoped()") https://lore.kernel.org/20240904115823.74333648@canb.auug.org.au Adjacent changes: drivers/net/ethernet/xilinx/xilinx_axienet.h drivers/net/ethernet/xilinx/xilinx_axienet_main.c 858430db28a5 ("net: xilinx: axienet: Fix race in axienet_stop") 76abb5d675c4 ("net: xilinx: axienet: Add statistics support") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-09-05sched: sch_cake: fix bulk flow accounting logic for host fairnessToke Høiland-Jørgensen
In sch_cake, we keep track of the count of active bulk flows per host, when running in dst/src host fairness mode, which is used as the round-robin weight when iterating through flows. The count of active bulk flows is updated whenever a flow changes state. This has a peculiar interaction with the hash collision handling: when a hash collision occurs (after the set-associative hashing), the state of the hash bucket is simply updated to match the new packet that collided, and if host fairness is enabled, that also means assigning new per-host state to the flow. For this reason, the bulk flow counters of the host(s) assigned to the flow are decremented, before new state is assigned (and the counters, which may not belong to the same host anymore, are incremented again). Back when this code was introduced, the host fairness mode was always enabled, so the decrement was unconditional. When the configuration flags were introduced the *increment* was made conditional, but the *decrement* was not. Which of course can lead to a spurious decrement (and associated wrap-around to U16_MAX). AFAICT, when host fairness is disabled, the decrement and wrap-around happens as soon as a hash collision occurs (which is not that common in itself, due to the set-associative hashing). However, in most cases this is harmless, as the value is only used when host fairness mode is enabled. So in order to trigger an array overflow, sch_cake has to first be configured with host fairness disabled, and while running in this mode, a hash collision has to occur to cause the overflow. Then, the qdisc has to be reconfigured to enable host fairness, which leads to the array out-of-bounds because the wrapped-around value is retained and used as an array index. It seems that syzbot managed to trigger this, which is quite impressive in its own right. This patch fixes the issue by introducing the same conditional check on decrement as is used on increment. The original bug predates the upstreaming of cake, but the commit listed in the Fixes tag touched that code, meaning that this patch won't apply before that. Fixes: 712639929912 ("sch_cake: Make the dual modes fairer") Reported-by: syzbot+7fe7b81d602cc1e6b94d@syzkaller.appspotmail.com Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20240903160846.20909-1-toke@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-09-03sch/netem: fix use after free in netem_dequeueStephen Hemminger
If netem_dequeue() enqueues packet to inner qdisc and that qdisc returns __NET_XMIT_STOLEN. The packet is dropped but qdisc_tree_reduce_backlog() is not called to update the parent's q.qlen, leading to the similar use-after-free as Commit e04991a48dbaf382 ("netem: fix return value if duplicate enqueue fails") Commands to trigger KASAN UaF: ip link add type dummy ip link set lo up ip link set dummy0 up tc qdisc add dev lo parent root handle 1: drr tc filter add dev lo parent 1: basic classid 1:1 tc class add dev lo classid 1:1 drr tc qdisc add dev lo parent 1:1 handle 2: netem tc qdisc add dev lo parent 2: handle 3: drr tc filter add dev lo parent 3: basic classid 3:1 action mirred egress redirect dev dummy0 tc class add dev lo classid 3:1 drr ping -c1 -W0.01 localhost # Trigger bug tc class del dev lo classid 1:1 tc class add dev lo classid 1:1 drr ping -c1 -W0.01 localhost # UaF Fixes: 50612537e9ab ("netem: fix classful handling") Reported-by: Budimir Markovic <markovicbudimir@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Link: https://patch.msgid.link/20240901182438.4992-1-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/faraday/ftgmac100.c 4186c8d9e6af ("net: ftgmac100: Ensure tx descriptor updates are visible") e24a6c874601 ("net: ftgmac100: Get link speed and duplex for NC-SI") https://lore.kernel.org/0b851ec5-f91d-4dd3-99da-e81b98c9ed28@kernel.org net/ipv4/tcp.c bac76cf89816 ("tcp: fix forever orphan socket caused by tcp_abort") edefba66d929 ("tcp: rstreason: introduce SK_RST_REASON_TCP_STATE for active reset") https://lore.kernel.org/20240828112207.5c199d41@canb.auug.org.au No adjacent changes. Link: https://patch.msgid.link/20240829130829.39148-1-pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-27net_sched: sch_fq: fix incorrect behavior for small weightsEric Dumazet
fq_dequeue() has a complex logic to find packets in one of the 3 bands. As Neal found out, it is possible that one band has a deficit smaller than its weight. fq_dequeue() can return NULL while some packets are elligible for immediate transmit. In this case, more than one iteration is needed to refill pband->credit. With default parameters (weights 589824 196608 65536) bug can trigger if large BIG TCP packets are sent to the lowest priority band. Bisected-by: John Sperbeck <jsperbeck@google.com> Diagnosed-by: Neal Cardwell <ncardwell@google.com> Fixes: 29f834aa326e ("net_sched: sch_fq: add 3 bands and WRR scheduling") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20240824181901.953776-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-27tc: adjust network header after 2nd vlan pushBoris Sukholitko
<tldr> skb network header of the single-tagged vlan packet continues to point the vlan payload (e.g. IP) after second vlan tag is pushed by tc act_vlan. This causes problem at the dissector which expects double-tagged packet network header to point to the inner vlan. The fix is to adjust network header in tcf_act_vlan.c but requires refactoring of skb_vlan_push function. </tldr> Consider the following shell script snippet configuring TC rules on the veth interface: ip link add veth0 type veth peer veth1 ip link set veth0 up ip link set veth1 up tc qdisc add dev veth0 clsact tc filter add dev veth0 ingress pref 10 chain 0 flower \ num_of_vlans 2 cvlan_ethtype 0x800 action goto chain 5 tc filter add dev veth0 ingress pref 20 chain 0 flower \ num_of_vlans 1 action vlan push id 100 \ protocol 0x8100 action goto chain 5 tc filter add dev veth0 ingress pref 30 chain 5 flower \ num_of_vlans 2 cvlan_ethtype 0x800 action simple sdata "success" Sending double-tagged vlan packet with the IP payload inside: cat <<ENDS | text2pcap - - | tcpreplay -i veth1 - 0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 64 ..........."...d 0010 81 00 00 14 08 00 45 04 00 26 04 d2 00 00 7f 11 ......E..&...... 0020 18 ef 0a 00 00 01 14 00 00 02 00 00 00 00 00 12 ................ 0030 e1 c7 00 00 00 00 00 00 00 00 00 00 ............ ENDS will match rule 10, goto rule 30 in chain 5 and correctly emit "success" to the dmesg. OTOH, sending single-tagged vlan packet: cat <<ENDS | text2pcap - - | tcpreplay -i veth1 - 0000 00 00 00 00 00 11 00 00 00 00 00 22 81 00 00 14 ...........".... 0010 08 00 45 04 00 2a 04 d2 00 00 7f 11 18 eb 0a 00 ..E..*.......... 0020 00 01 14 00 00 02 00 00 00 00 00 16 e1 bf 00 00 ................ 0030 00 00 00 00 00 00 00 00 00 00 00 00 ............ ENDS will match rule 20, will push the second vlan tag but will *not* match rule 30. IOW, the match at rule 30 fails if the second vlan was freshly pushed by the kernel. Lets look at __skb_flow_dissect working on the double-tagged vlan packet. Here is the relevant code from around net/core/flow_dissector.c:1277 copy-pasted here for convenience: if (dissector_vlan == FLOW_DISSECTOR_KEY_MAX && skb && skb_vlan_tag_present(skb)) { proto = skb->protocol; } else { vlan = __skb_header_pointer(skb, nhoff, sizeof(_vlan), data, hlen, &_vlan); if (!vlan) { fdret = FLOW_DISSECT_RET_OUT_BAD; break; } proto = vlan->h_vlan_encapsulated_proto; nhoff += sizeof(*vlan); } The "else" clause above gets the protocol of the encapsulated packet from the skb data at the network header location. printk debugging has showed that in the good double-tagged packet case proto is htons(0x800 == ETH_P_IP) as expected. However in the single-tagged packet case proto is garbage leading to the failure to match tc filter 30. proto is being set from the skb header pointed by nhoff parameter which is defined at the beginning of __skb_flow_dissect (net/core/flow_dissector.c:1055 in the current version): nhoff = skb_network_offset(skb); Therefore the culprit seems to be that the skb network offset is different between double-tagged packet received from the interface and single-tagged packet having its vlan tag pushed by TC. Lets look at the interesting points of the lifetime of the single/double tagged packets as they traverse our packet flow. Both of them will start at __netif_receive_skb_core where the first vlan tag will be stripped: if (eth_type_vlan(skb->protocol)) { skb = skb_vlan_untag(skb); if (unlikely(!skb)) goto out; } At this stage in double-tagged case skb->data points to the second vlan tag while in single-tagged case skb->data points to the network (eg. IP) header. Looking at TC vlan push action (net/sched/act_vlan.c) we have the following code at tcf_vlan_act (interesting points are in square brackets): if (skb_at_tc_ingress(skb)) [1] skb_push_rcsum(skb, skb->mac_len); .... case TCA_VLAN_ACT_PUSH: err = skb_vlan_push(skb, p->tcfv_push_proto, p->tcfv_push_vid | (p->tcfv_push_prio << VLAN_PRIO_SHIFT), 0); if (err) goto drop; break; .... out: if (skb_at_tc_ingress(skb)) [3] skb_pull_rcsum(skb, skb->mac_len); And skb_vlan_push (net/core/skbuff.c:6204) function does: err = __vlan_insert_tag(skb, skb->vlan_proto, skb_vlan_tag_get(skb)); if (err) return err; skb->protocol = skb->vlan_proto; [2] skb->mac_len += VLAN_HLEN; in the case of pushing the second tag. Lets look at what happens with skb->data of the single-tagged packet at each of the above points: 1. As a result of the skb_push_rcsum, skb->data is moved back to the start of the packet. 2. First VLAN tag is moved from the skb into packet buffer, skb->mac_len is incremented, skb->data still points to the start of the packet. 3. As a result of the skb_pull_rcsum, skb->data is moved forward by the modified skb->mac_len, thus pointing to the network header again. Then __skb_flow_dissect will get confused by having double-tagged vlan packet with the skb->data at the network header. The solution for the bug is to preserve "skb->data at second vlan header" semantics in the skb_vlan_push function. We do this by manipulating skb->network_header rather than skb->mac_len. skb_vlan_push callers are updated to do skb_reset_mac_len. Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-08-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: drivers/net/ethernet/broadcom/bnxt/bnxt.h c948c0973df5 ("bnxt_en: Don't clear ntuple filters and rss contexts during ethtool ops") f2878cdeb754 ("bnxt_en: Add support to call FW to update a VNIC") Link: https://patch.msgid.link/20240822210125.1542769-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-20netem: fix return value if duplicate enqueue failsStephen Hemminger
There is a bug in netem_enqueue() introduced by commit 5845f706388a ("net: netem: fix skb length BUG_ON in __skb_to_sgvec") that can lead to a use-after-free. This commit made netem_enqueue() always return NET_XMIT_SUCCESS when a packet is duplicated, which can cause the parent qdisc's q.qlen to be mistakenly incremented. When this happens qlen_notify() may be skipped on the parent during destruction, leaving a dangling pointer for some classful qdiscs like DRR. There are two ways for the bug happen: - If the duplicated packet is dropped by rootq->enqueue() and then the original packet is also dropped. - If rootq->enqueue() sends the duplicated packet to a different qdisc and the original packet is dropped. In both cases NET_XMIT_SUCCESS is returned even though no packets are enqueued at the netem qdisc. The fix is to defer the enqueue of the duplicate packet until after the original packet has been guaranteed to return NET_XMIT_SUCCESS. Fixes: 5845f706388a ("net: netem: fix skb length BUG_ON in __skb_to_sgvec") Reported-by: Budimir Markovic <markovicbudimir@gmail.com> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240819175753.5151-1-stephen@networkplumber.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-08-12sched: act_ct: avoid -Wflex-array-member-not-at-end warningGustavo A. R. Silva
-Wflex-array-member-not-at-end was introduced in GCC-14, and we are getting ready to enable it, globally. Remove unnecessary flex-array member `pad[]` and refactor the related code a bit. Fix the following warning: net/sched/act_ct.c:57:29: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://patch.msgid.link/ZrY0JMVsImbDbx6r@cute Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-26sched: act_ct: take care of padding in struct zones_ht_keyEric Dumazet
Blamed commit increased lookup key size from 2 bytes to 16 bytes, because zones_ht_key got a struct net pointer. Make sure rhashtable_lookup() is not using the padding bytes which are not initialized. BUG: KMSAN: uninit-value in rht_ptr_rcu include/linux/rhashtable.h:376 [inline] BUG: KMSAN: uninit-value in __rhashtable_lookup include/linux/rhashtable.h:607 [inline] BUG: KMSAN: uninit-value in rhashtable_lookup include/linux/rhashtable.h:646 [inline] BUG: KMSAN: uninit-value in rhashtable_lookup_fast include/linux/rhashtable.h:672 [inline] BUG: KMSAN: uninit-value in tcf_ct_flow_table_get+0x611/0x2260 net/sched/act_ct.c:329 rht_ptr_rcu include/linux/rhashtable.h:376 [inline] __rhashtable_lookup include/linux/rhashtable.h:607 [inline] rhashtable_lookup include/linux/rhashtable.h:646 [inline] rhashtable_lookup_fast include/linux/rhashtable.h:672 [inline] tcf_ct_flow_table_get+0x611/0x2260 net/sched/act_ct.c:329 tcf_ct_init+0xa67/0x2890 net/sched/act_ct.c:1408 tcf_action_init_1+0x6cc/0xb30 net/sched/act_api.c:1425 tcf_action_init+0x458/0xf00 net/sched/act_api.c:1488 tcf_action_add net/sched/act_api.c:2061 [inline] tc_ctl_action+0x4be/0x19d0 net/sched/act_api.c:2118 rtnetlink_rcv_msg+0x12fc/0x1410 net/core/rtnetlink.c:6647 netlink_rcv_skb+0x375/0x650 net/netlink/af_netlink.c:2550 rtnetlink_rcv+0x34/0x40 net/core/rtnetlink.c:6665 netlink_unicast_kernel net/netlink/af_netlink.c:1331 [inline] netlink_unicast+0xf52/0x1260 net/netlink/af_netlink.c:1357 netlink_sendmsg+0x10da/0x11e0 net/netlink/af_netlink.c:1901 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x30f/0x380 net/socket.c:745 ____sys_sendmsg+0x877/0xb60 net/socket.c:2597 ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2651 __sys_sendmsg net/socket.c:2680 [inline] __do_sys_sendmsg net/socket.c:2689 [inline] __se_sys_sendmsg net/socket.c:2687 [inline] __x64_sys_sendmsg+0x307/0x4a0 net/socket.c:2687 x64_sys_call+0x2dd6/0x3c10 arch/x86/include/generated/asm/syscalls_64.h:47 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Local variable key created at: tcf_ct_flow_table_get+0x4a/0x2260 net/sched/act_ct.c:324 tcf_ct_init+0xa67/0x2890 net/sched/act_ct.c:1408 Fixes: 88c67aeb1407 ("sched: act_ct: add netns into the key of tcf_ct_flow_table") Reported-by: syzbot+1b5e4e187cc586d05ea0@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-15net/sched: cls_flower: propagate tca[TCA_OPTIONS] to NL_REQ_ATTR_CHECKAsbjørn Sloth Tønnesen
NL_REQ_ATTR_CHECK() is used in fl_set_key_flags() to set extended attributes about the origin of an error, this patch propagates tca[TCA_OPTIONS] through. Before this patch: $ sudo ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/tc.yaml \ --do newtfilter --json '{ "chain": 0, "family": 0, "handle": 4, "ifindex": 22, "info": 262152, "kind": "flower", "options": { "flags": 0, "key-enc-flags": 8, "key-eth-type": 2048 }, "parent": 4294967283 }' Netlink error: Invalid argument nl_len = 68 (52) nl_flags = 0x300 nl_type = 2 error: -22 extack: {'msg': 'Missing flags mask', 'miss-type': 111} After this patch: [same cmd] Netlink error: Invalid argument nl_len = 76 (60) nl_flags = 0x300 nl_type = 2 error: -22 extack: {'msg': 'Missing flags mask', 'miss-type': 111, 'miss-nest': 56} Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Link: https://patch.msgid.link/20240713021911.1631517-14-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15flow_dissector: set encapsulation control flags for non-IPAsbjørn Sloth Tønnesen
Make sure to set encapsulated control flags also for non-IP packets, such that it's possible to allow matching on e.g. TUNNEL_OAM on a geneve packet carrying a non-IP packet. Suggested-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Tested-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/20240713021911.1631517-13-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15flow_dissector: cleanup FLOW_DISSECTOR_KEY_ENC_FLAGSAsbjørn Sloth Tønnesen
Now that TCA_FLOWER_KEY_ENC_FLAGS is unused, as it's former data is stored behind TCA_FLOWER_KEY_ENC_CONTROL, then remove the last bits of FLOW_DISSECTOR_KEY_ENC_FLAGS. FLOW_DISSECTOR_KEY_ENC_FLAGS is unreleased, and have been in net-next since 2024-06-04. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Tested-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/20240713021911.1631517-12-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15net/sched: cls_flower: rework TCA_FLOWER_KEY_ENC_FLAGS usageAsbjørn Sloth Tønnesen
This patch changes how TCA_FLOWER_KEY_ENC_FLAGS is used, so that it is used with TCA_FLOWER_KEY_FLAGS_* flags, in the same way as TCA_FLOWER_KEY_FLAGS is currently used. Where TCA_FLOWER_KEY_FLAGS uses {key,mask}->control.flags, then TCA_FLOWER_KEY_ENC_FLAGS now uses {key,mask}->enc_control.flags, therefore {key,mask}->enc_flags is now unused. As the generic fl_set_key_flags/fl_dump_key_flags() is used with encap set to true, then fl_{set,dump}_key_enc_flags() is removed. This breaks unreleased userspace API (net-next since 2024-06-04). Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Tested-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/20240713021911.1631517-10-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15net/sched: cls_flower: add tunnel flags to fl_{set,dump}_key_flags()Asbjørn Sloth Tønnesen
Prepare to set and dump the tunnel flags. This code won't see any of these flags yet, as these flags aren't allowed by the NLA_POLICY_MASK, and the functions doesn't get called with encap set to true yet. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Tested-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/20240713021911.1631517-9-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15net/sched: cls_flower: add policy for TCA_FLOWER_KEY_FLAGSAsbjørn Sloth Tønnesen
This policy guards fl_set_key_flags() from seeing flags not used in the context of TCA_FLOWER_KEY_FLAGS. In order For the policy check to be performed with the correct endianness, then we also needs to change the attribute type to NLA_BE32 (Thanks Davide). TCA_FLOWER_KEY_FLAGS{,_MASK} already has a be32 comment in include/uapi/linux/pkt_cls.h. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Tested-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/20240713021911.1631517-6-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15net/sched: cls_flower: prepare fl_{set,dump}_key_flags() for ENC_FLAGSAsbjørn Sloth Tønnesen
Prepare fl_set_key_flags/fl_dump_key_flags() for use with TCA_FLOWER_KEY_ENC_FLAGS{,_MASK}. This patch adds an encap argument, similar to fl_set_key_ip/ fl_dump_key_ip(), and determine the flower keys based on the encap argument, and use them in the rest of the two functions. Since these functions are so far, only called with encap set false, then there is no functional change. Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Tested-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Link: https://patch.msgid.link/20240713021911.1631517-5-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15Merge branch 'net-make-timestamping-selectable'Jakub Kicinski
First part of "net: Make timestamping selectable" from Kory Maincent. Change the driver-facing type already to lower rebasing pain. Link: https://lore.kernel.org/20240709-feature_ptp_netnext-v17-0-b5317f50df2a@bootlin.com/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-15net: Add struct kernel_ethtool_ts_infoKory Maincent
In prevision to add new UAPI for hwtstamp we will be limited to the struct ethtool_ts_info that is currently passed in fixed binary format through the ETHTOOL_GET_TS_INFO ethtool ioctl. It would be good if new kernel code already started operating on an extensible kernel variant of that structure, similar in concept to struct kernel_hwtstamp_config vs struct hwtstamp_config. Since struct ethtool_ts_info is in include/uapi/linux/ethtool.h, here we introduce the kernel-only structure in include/linux/ethtool.h. The manual copy is then made in the function called by ETHTOOL_GET_TS_INFO. Acked-by: Shannon Nelson <shannon.nelson@amd.com> Acked-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20240709-feature_ptp_netnext-v17-6-b5317f50df2a@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11net/sched: act_skbmod: convert comma to semicolonChen Ni
Replace a comma between expression statements by a semicolon. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240709072838.1152880-1-nichen@iscas.ac.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: net/sched/act_ct.c 26488172b029 ("net/sched: Fix UAF when resolving a clash") 3abbd7ed8b76 ("act_ct: prepare for stolen verdict coming from conntrack and nat engine") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-11net/sched: Fix UAF when resolving a clashChengen Du
KASAN reports the following UAF: BUG: KASAN: slab-use-after-free in tcf_ct_flow_table_process_conn+0x12b/0x380 [act_ct] Read of size 1 at addr ffff888c07603600 by task handler130/6469 Call Trace: <IRQ> dump_stack_lvl+0x48/0x70 print_address_description.constprop.0+0x33/0x3d0 print_report+0xc0/0x2b0 kasan_report+0xd0/0x120 __asan_load1+0x6c/0x80 tcf_ct_flow_table_process_conn+0x12b/0x380 [act_ct] tcf_ct_act+0x886/0x1350 [act_ct] tcf_action_exec+0xf8/0x1f0 fl_classify+0x355/0x360 [cls_flower] __tcf_classify+0x1fd/0x330 tcf_classify+0x21c/0x3c0 sch_handle_ingress.constprop.0+0x2c5/0x500 __netif_receive_skb_core.constprop.0+0xb25/0x1510 __netif_receive_skb_list_core+0x220/0x4c0 netif_receive_skb_list_internal+0x446/0x620 napi_complete_done+0x157/0x3d0 gro_cell_poll+0xcf/0x100 __napi_poll+0x65/0x310 net_rx_action+0x30c/0x5c0 __do_softirq+0x14f/0x491 __irq_exit_rcu+0x82/0xc0 irq_exit_rcu+0xe/0x20 common_interrupt+0xa1/0xb0 </IRQ> <TASK> asm_common_interrupt+0x27/0x40 Allocated by task 6469: kasan_save_stack+0x38/0x70 kasan_set_track+0x25/0x40 kasan_save_alloc_info+0x1e/0x40 __kasan_krealloc+0x133/0x190 krealloc+0xaa/0x130 nf_ct_ext_add+0xed/0x230 [nf_conntrack] tcf_ct_act+0x1095/0x1350 [act_ct] tcf_action_exec+0xf8/0x1f0 fl_classify+0x355/0x360 [cls_flower] __tcf_classify+0x1fd/0x330 tcf_classify+0x21c/0x3c0 sch_handle_ingress.constprop.0+0x2c5/0x500 __netif_receive_skb_core.constprop.0+0xb25/0x1510 __netif_receive_skb_list_core+0x220/0x4c0 netif_receive_skb_list_internal+0x446/0x620 napi_complete_done+0x157/0x3d0 gro_cell_poll+0xcf/0x100 __napi_poll+0x65/0x310 net_rx_action+0x30c/0x5c0 __do_softirq+0x14f/0x491 Freed by task 6469: kasan_save_stack+0x38/0x70 kasan_set_track+0x25/0x40 kasan_save_free_info+0x2b/0x60 ____kasan_slab_free+0x180/0x1f0 __kasan_slab_free+0x12/0x30 slab_free_freelist_hook+0xd2/0x1a0 __kmem_cache_free+0x1a2/0x2f0 kfree+0x78/0x120 nf_conntrack_free+0x74/0x130 [nf_conntrack] nf_ct_destroy+0xb2/0x140 [nf_conntrack] __nf_ct_resolve_clash+0x529/0x5d0 [nf_conntrack] nf_ct_resolve_clash+0xf6/0x490 [nf_conntrack] __nf_conntrack_confirm+0x2c6/0x770 [nf_conntrack] tcf_ct_act+0x12ad/0x1350 [act_ct] tcf_action_exec+0xf8/0x1f0 fl_classify+0x355/0x360 [cls_flower] __tcf_classify+0x1fd/0x330 tcf_classify+0x21c/0x3c0 sch_handle_ingress.constprop.0+0x2c5/0x500 __netif_receive_skb_core.constprop.0+0xb25/0x1510 __netif_receive_skb_list_core+0x220/0x4c0 netif_receive_skb_list_internal+0x446/0x620 napi_complete_done+0x157/0x3d0 gro_cell_poll+0xcf/0x100 __napi_poll+0x65/0x310 net_rx_action+0x30c/0x5c0 __do_softirq+0x14f/0x491 The ct may be dropped if a clash has been resolved but is still passed to the tcf_ct_flow_table_process_conn function for further usage. This issue can be fixed by retrieving ct from skb again after confirming conntrack. Fixes: 0cc254e5aa37 ("net/sched: act_ct: Offload connections with commit action") Co-developed-by: Gerald Yang <gerald.yang@canonical.com> Signed-off-by: Gerald Yang <gerald.yang@canonical.com> Signed-off-by: Chengen Du <chengen.du@canonical.com> Link: https://patch.msgid.link/20240710053747.13223-1-chengen.du@canonical.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-08bpf: Fix too early release of tcx_entryDaniel Borkmann
Pedro Pinto and later independently also Hyunwoo Kim and Wongi Lee reported an issue that the tcx_entry can be released too early leading to a use after free (UAF) when an active old-style ingress or clsact qdisc with a shared tc block is later replaced by another ingress or clsact instance. Essentially, the sequence to trigger the UAF (one example) can be as follows: 1. A network namespace is created 2. An ingress qdisc is created. This allocates a tcx_entry, and &tcx_entry->miniq is stored in the qdisc's miniqp->p_miniq. At the same time, a tcf block with index 1 is created. 3. chain0 is attached to the tcf block. chain0 must be connected to the block linked to the ingress qdisc to later reach the function tcf_chain0_head_change_cb_del() which triggers the UAF. 4. Create and graft a clsact qdisc. This causes the ingress qdisc created in step 1 to be removed, thus freeing the previously linked tcx_entry: rtnetlink_rcv_msg() => tc_modify_qdisc() => qdisc_create() => clsact_init() [a] => qdisc_graft() => qdisc_destroy() => __qdisc_destroy() => ingress_destroy() [b] => tcx_entry_free() => kfree_rcu() // tcx_entry freed 5. Finally, the network namespace is closed. This registers the cleanup_net worker, and during the process of releasing the remaining clsact qdisc, it accesses the tcx_entry that was already freed in step 4, causing the UAF to occur: cleanup_net() => ops_exit_list() => default_device_exit_batch() => unregister_netdevice_many() => unregister_netdevice_many_notify() => dev_shutdown() => qdisc_put() => clsact_destroy() [c] => tcf_block_put_ext() => tcf_chain0_head_change_cb_del() => tcf_chain_head_change_item() => clsact_chain_head_change() => mini_qdisc_pair_swap() // UAF There are also other variants, the gist is to add an ingress (or clsact) qdisc with a specific shared block, then to replace that qdisc, waiting for the tcx_entry kfree_rcu() to be executed and subsequently accessing the current active qdisc's miniq one way or another. The correct fix is to turn the miniq_active boolean into a counter. What can be observed, at step 2 above, the counter transitions from 0->1, at step [a] from 1->2 (in order for the miniq object to remain active during the replacement), then in [b] from 2->1 and finally [c] 1->0 with the eventual release. The reference counter in general ranges from [0,2] and it does not need to be atomic since all access to the counter is protected by the rtnl mutex. With this in place, there is no longer a UAF happening and the tcx_entry is freed at the correct time. Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support") Reported-by: Pedro Pinto <xten@osec.io> Co-developed-by: Pedro Pinto <xten@osec.io> Signed-off-by: Pedro Pinto <xten@osec.io> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Hyunwoo Kim <v4bel@theori.io> Cc: Wongi Lee <qwerty@theori.io> Cc: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20240708133130.11609-1-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-07-08act_ct: prepare for stolen verdict coming from conntrack and nat engineFlorian Westphal
At this time, conntrack either returns NF_ACCEPT or NF_DROP. To improve debuging it would be nice to be able to replace NF_DROP verdict with NF_DROP_REASON() helper, This helper releases the skb instantly (so drop_monitor can pinpoint exact location) and returns NF_STOLEN. Prepare call sites to deal with this before introducing such changes in conntrack and nat core. Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-07-05net: sched: act_sample: add action cookie to sampleAdrian Moreno
If the action has a user_cookie, pass it along to the sample so it can be easily identified. Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Link: https://patch.msgid.link/20240704085710.353845-3-amorenoz@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-06-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/broadcom/bnxt/bnxt.c 1e7962114c10 ("bnxt_en: Restore PTP tx_avail count in case of skb_pad() error") 165f87691a89 ("bnxt_en: add timestamping statistics support") No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>