summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-02-14net/mlx5e: TC, support per action statsOz Shlomo
Extend the action stats callback implementation to update stats for actions that are associated with hw counters. Note that the callback may be called from tc action utility or from tc flower. Both apis expect the driver to return the stats difference from the last update. As such, query the raw counter value and maintain the diff from the last api call in the tc layer, instead of the fs_core layer. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-14net/mlx5e: TC, map tc action cookie to a hw counterOz Shlomo
Currently a hardware counter is associated with a flow cookie. This does not apply to flows using branching action which are required to return per action stats. A single counter may apply to multiple actions. Scan the flow actions in reverse (from the last to the first action) while caching the last counter. Associate all the flow attribute tc action cookies with the current cached counter. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-14net/mlx5e: TC, store tc action cookies per attrOz Shlomo
The tc parse action phase translates the tc actions to mlx5 flow attributes data structure that is used during the flow offload phase. Currently, the flow offload stage instantiates hw counters while associating them to flow cookie. However, flows with branching actions are required to associate a hardware counter with its action cookies. Store the parsed tc action cookies on the flow attribute. Use the list of cookies in the next patch to associate a tc action cookie with its allocated hw counter. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-14net/mlx5e: TC, add hw counter to branching actionsOz Shlomo
Currently a hw count action is appended to the last action of the action list. However, a branching action may terminate the action list before reaching the last action. Append a count action to a branching action. In the next patches, filters with branching actions will read this counter when reporting stats per action. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-14net/sched: support per action hw statsOz Shlomo
There are currently two mechanisms for populating hardware stats: 1. Using flow_offload api to query the flow's statistics. The api assumes that the same stats values apply to all the flow's actions. This assumption breaks when action drops or jumps over following actions. 2. Using hw_action api to query specific action stats via a driver callback method. This api assures the correct action stats for the offloaded action, however, it does not apply to the rest of the actions in the flow's actions array. Extend the flow_offload stats callback to indicate that a per action stats update is required. Use the existing flow_offload_action api to query the action's hw stats. In addition, currently the tc action stats utility only updates hw actions. Reuse the existing action stats cb infrastructure to query any action stats. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-14net/sched: introduce flow_offload action cookieOz Shlomo
Currently a hardware action is uniquely identified by the <id, hw_index> tuple. However, the id is set by the flow_act_setup callback and tc core cannot enforce this, and it is possible that a future change could break this. In addition, <id, hw_index> are not unique across network namespaces. Uniquely identify the action by setting an action cookie by the tc core. Use the unique action cookie to query the action's hardware stats. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-14net/sched: pass flow_stats instead of multiple stats argsOz Shlomo
Instead of passing 6 stats related args, pass the flow_stats. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-14net/sched: act_pedit, setup offload action for action stats queryOz Shlomo
A single tc pedit action may be translated to multiple flow_offload actions. Offload only actions that translate to a single pedit command value. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-14net/sched: optimize action stats api callsOz Shlomo
Currently the hw action stats update is called from tcf_exts_hw_stats_update, when a tc filter is dumped, and from tcf_action_copy_stats, when a hw action is dumped. However, the tcf_action_copy_stats is also called from tcf_action_dump. As such, the hw action stats update cb is called 3 times for every tc flower filter dump. Move the tc action hw stats update from tcf_action_copy_stats to tcf_dump_walker to update the hw action stats when tc action is dumped. Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-14wifi: cfg80211: trace: remove MAC_PR_{FMT,ARG}Johannes Berg
With %pM, this really is no longer needed, and actually longer to spell out. Remove it. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-02-14dt-bindings: net: dsa: mediatek,mt7530: improve binding descriptionArınç ÜNAL
Fix inaccurate information about PHY muxing, and merge standalone and multi-chip module MT7530 configuration methods. Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20230212131258.47551-1-arinc.unal@arinc9.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-14net: stmmac: Restrict warning on disabling DMA store and fwd modeCristian Ciocaltea
When setting 'snps,force_thresh_dma_mode' DT property, the following warning is always emitted, regardless the status of force_sf_dma_mode: dwmac-starfive 10020000.ethernet: force_sf_dma_mode is ignored if force_thresh_dma_mode is set. Do not print the rather misleading message when DMA store and forward mode is already disabled. Fixes: e2a240c7d3bc ("driver:net:stmmac: Disable DMA store and forward mode if platform data force_thresh_dma_mode is set.") Signed-off-by: Cristian Ciocaltea <cristian.ciocaltea@collabora.com> Link: https://lore.kernel.org/r/20230210202126.877548-1-cristian.ciocaltea@collabora.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-02-13Revert "bpf: Add --skip_encoding_btf_inconsistent_proto, --btf_gen_optimized ↵Alexei Starovoitov
to pahole flags for v1.25" This reverts commit 0243d3dfe274832aa0a16214499c208122345173. pahole 1.25 is too aggressive removing functions. With clang compiled kernel the following is seen: WARN: resolve_btfids: unresolved symbol tcp_reno_cong_avoid WARN: resolve_btfids: unresolved symbol dctcp_update_alpha WARN: resolve_btfids: unresolved symbol cubictcp_cong_avoid WARN: resolve_btfids: unresolved symbol bpf_xdp_metadata_rx_timestamp WARN: resolve_btfids: unresolved symbol bpf_xdp_metadata_rx_hash WARN: resolve_btfids: unresolved symbol bpf_task_kptr_get WARN: resolve_btfids: unresolved symbol bpf_task_acquire_not_zero WARN: resolve_btfids: unresolved symbol bpf_rdonly_cast WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_static_unused_arg WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_ref WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_pass_ctx WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_pass2 WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_pass1 WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_mem_len_pass1 WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_mem_len_fail2 WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_mem_len_fail1 WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_kptr_get WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_fail3 WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_fail2 WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test_acquire WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test2 WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_test1 WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_memb_release WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_memb1_release WARN: resolve_btfids: unresolved symbol bpf_kfunc_call_int_mem_release Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13selftests/bpf: Clean up dynptr prog_testsJoanne Koong
Clean up prog_tests/dynptr.c by removing the unneeded "expected_err_msg" in the dynptr_tests struct, which is a remnant from converting the fail tests cases to use the generic verification tester. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/r/20230214051332.4007131-2-joannelkoong@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13selftests/bpf: Clean up user_ringbuf, cgrp_kfunc, kfunc_dynptr_param testsJoanne Koong
Clean up user_ringbuf, cgrp_kfunc, and kfunc_dynptr_param tests to use the generic verification tester for checking verifier rejections. The generic verification tester uses btf_decl_tag-based annotations for verifying that the tests fail with the expected log messages. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Acked-by: David Vernet <void@manifault.com> Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com> Link: https://lore.kernel.org/r/20230214051332.4007131-1-joannelkoong@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-14nvme-pci: always return an ERR_PTR from nvme_pci_alloc_devIrvin Cote
Don't mix NULL and ERR_PTR returns. Fixes: 2e87570be9d2 ("nvme-pci: factor out a nvme_pci_alloc_dev helper") Signed-off-by: Irvin Cote <irvin.cote@insa-lyon.fr> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2023-02-14nvme-pci: set the DMA mask earlierChristoph Hellwig
Set the DMA mask before calling dma_addressing_limited, which depends on it. Note that this stop checking the return value of dma_set_mask_and_coherent as this function can only fail for masks < 32-bit. Fixes: 3f30a79c2e2c ("nvme-pci: set constant paramters in nvme_pci_alloc_ctrl") Reported-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Michael Kelley <mikelley@microsoft.com>
2023-02-13net/sched: act_ctinfo: use percpu statsPedro Tammela
The tc action act_ctinfo was using shared stats, fix it to use percpu stats since bstats_update() must be called with locks or with a percpu pointer argument. tdc results: 1..12 ok 1 c826 - Add ctinfo action with default setting ok 2 0286 - Add ctinfo action with dscp ok 3 4938 - Add ctinfo action with valid cpmark and zone ok 4 7593 - Add ctinfo action with drop control ok 5 2961 - Replace ctinfo action zone and action control ok 6 e567 - Delete ctinfo action with valid index ok 7 6a91 - Delete ctinfo action with invalid index ok 8 5232 - List ctinfo actions ok 9 7702 - Flush ctinfo actions ok 10 3201 - Add ctinfo action with duplicate index ok 11 8295 - Add ctinfo action with invalid index ok 12 3964 - Replace ctinfo action with invalid goto_chain control Fixes: 24ec483cec98 ("net: sched: Introduce act_ctinfo action") Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com> Link: https://lore.kernel.org/r/20230210200824.444856-1-pctammela@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13net: stmmac: fix order of dwmac5 FlexPPS parametrization sequenceJohannes Zink
So far changing the period by just setting new period values while running did not work. The order as indicated by the publicly available reference manual of the i.MX8MP [1] indicates a sequence: * initiate the programming sequence * set the values for PPS period and start time * start the pulse train generation. This is currently not used in dwmac5_flex_pps_config(), which instead does: * initiate the programming sequence and immediately start the pulse train generation * set the values for PPS period and start time This caused the period values written not to take effect until the FlexPPS output was disabled and re-enabled again. This patch fix the order and allows the period to be set immediately. [1] https://www.nxp.com/webapp/Download?colCode=IMX8MPRM Fixes: 9a8a02c9d46d ("net: stmmac: Add Flexible PPS support") Signed-off-by: Johannes Zink <j.zink@pengutronix.de> Link: https://lore.kernel.org/r/20230210143937.3427483-1-j.zink@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13Merge branch 'ipv6-more-drop-reason'Jakub Kicinski
Eric Dumazet says: ==================== ipv6: more drop reason Add more drop reasons to IPv6: - IPV6_BAD_EXTHDR - IPV6_NDISC_FRAG - IPV6_NDISC_HOP_LIMIT - IPV6_NDISC_BAD_CODE ==================== Link: https://lore.kernel.org/r/20230210184708.2172562-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13ipv6: icmp6: add drop reason support to ndisc_rcv()Eric Dumazet
Creates three new drop reasons: SKB_DROP_REASON_IPV6_NDISC_FRAG: invalid frag (suppress_frag_ndisc). SKB_DROP_REASON_IPV6_NDISC_HOP_LIMIT: invalid hop limit. SKB_DROP_REASON_IPV6_NDISC_BAD_CODE: invalid NDISC icmp6 code. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13ipv6: icmp6: add drop reason support to icmpv6_notify()Eric Dumazet
Accurately reports what happened in icmpv6_notify() when handling a packet. This makes use of the new IPV6_BAD_EXTHDR drop reason. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13net: add pskb_may_pull_reason() helperEric Dumazet
pskb_may_pull() can fail for two different reasons. Provide pskb_may_pull_reason() helper to distinguish between these reasons. It returns: SKB_NOT_DROPPED_YET : Success SKB_DROP_REASON_PKT_TOO_SMALL : packet too small SKB_DROP_REASON_NOMEM : skb->head could not be resized Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13net: dropreason: add SKB_DROP_REASON_IPV6_BAD_EXTHDREric Dumazet
This drop reason can be used whenever an IPv6 packet has a malformed extension header. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13net: stmmac: dwc-qos: Make struct dwc_eth_dwmac_data::remove return voidUwe Kleine-König
All implementations of the remove callback return 0 unconditionally. So in dwc_eth_dwmac_remove() there is no error handling necessary. Simplify accordingly. This is a preparation for making struct platform_driver::remove return void, too. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Link: https://lore.kernel.org/r/20230211112431.214252-2-u.kleine-koenig@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13net: stmmac: Make stmmac_dvr_remove() return voidUwe Kleine-König
The function returns zero unconditionally. Change it to return void instead which simplifies some callers as error handing becomes unnecessary. This also makes it more obvious that most platform remove callbacks always return zero. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Link: https://lore.kernel.org/r/20230211112431.214252-1-u.kleine-koenig@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13net: mvneta: do not set xdp_features for hw buffer devicesLorenzo Bianconi
Devices with hardware buffer management do not support XDP, so do not set xdp_features for them. Fixes: 66c0e13ad236 ("drivers: net: turn on XDP features") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/19b5838bb3e4515750af822edb2fa5e974d0a86b.1676196230.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13hv_netvsc: add missing NETDEV_XDP_ACT_NDO_XMIT xdp-features flagLorenzo Bianconi
Add missing ndo_xdp_xmit bit to xdp_features capability flag. Fixes: 66c0e13ad236 ("drivers: net: turn on XDP features") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/8e3747018f0fd0b5d6e6b9aefe8d9448ca3a3288.1676195726.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13net: stmmac: add missing NETDEV_XDP_ACT_XSK_ZEROCOPY bit to xdp_featuresLorenzo Bianconi
Add missing xsk zero-copy bit to xdp_features capability flag. Fixes: 66c0e13ad236 ("drivers: net: turn on XDP features") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://lore.kernel.org/r/c8949baafdf617188dcedb9033ce5a9ca6e9e5ff.1676195440.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13net: ethernet: mtk_wed: No need to clear memory after a dma_alloc_coherent() ↵Christophe JAILLET
call dma_alloc_coherent() already clears the allocated memory, there is no need to explicitly call memset(). Moreover, it is likely that the size in the memset() is incorrect and should be "size * sizeof(*ring->desc)". Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/d5acce7dd108887832c9719f62c7201b4c83b3fb.1676184599.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13Merge branch 'BPF rbtree next-gen datastructure'Alexei Starovoitov
Dave Marchevsky says: ==================== This series adds a rbtree datastructure following the "next-gen datastructure" precedent set by recently-added linked-list [0]. This is a reimplementation of previous rbtree RFC [1] to use kfunc + kptr instead of adding a new map type. This series adds a smaller set of API functions than that RFC - just the minimum needed to support current cgfifo example scheduler in ongoing sched_ext effort [2], namely: bpf_rbtree_add bpf_rbtree_remove bpf_rbtree_first The meat of this series is bugfixes and verifier infra work to support these API functions. Adding more rbtree kfuncs in future patches should be straightforward as a result. First, the series refactors and extends linked_list's release_on_unlock logic. The concept of "reference to node that was added to data structure" is formalized as "non-owning reference". From linked_list's perspective this non-owning reference after linked_list_push_{front,back} has same semantics as release_on_unlock, with the addition of writes to such references being valid in the critical section. Such references are no longer marked PTR_UNTRUSTED. Patches 2 and 13 go into more detail. The series then adds rbtree API kfuncs and necessary verifier support for them - namely support for callback args to kfuncs and some non-owning reference interactions that linked_list didn't need. BPF rbtree uses struct rb_root_cached + existing rbtree lib under the hood. From the BPF program writer's perspective, a BPF rbtree is very similar to existing linked list. Consider the following example: struct node_data { long key; long data; struct bpf_rb_node node; } static bool less(struct bpf_rb_node *a, const struct bpf_rb_node *b) { struct node_data *node_a; struct node_data *node_b; node_a = container_of(a, struct node_data, node); node_b = container_of(b, struct node_data, node); return node_a->key < node_b->key; } private(A) struct bpf_spin_lock glock; private(A) struct bpf_rb_root groot __contains(node_data, node); /* ... in BPF program */ struct node_data *n, *m; struct bpf_rb_node *res; n = bpf_obj_new(typeof(*n)); if (!n) /* skip */ n->key = 5; n->data = 10; bpf_spin_lock(&glock); bpf_rbtree_add(&groot, &n->node, less); bpf_spin_unlock(&glock); bpf_spin_lock(&glock); res = bpf_rbtree_first(&groot); if (!res) /* skip */ res = bpf_rbtree_remove(&groot, res); if (!res) /* skip */ bpf_spin_unlock(&glock); m = container_of(res, struct node_data, node); bpf_obj_drop(m); Some obvious similarities: * Special bpf_rb_root and bpf_rb_node types have same semantics as bpf_list_head and bpf_list_node, respectively * __contains is used to associated node type with root * The spin_lock associated with a rbtree must be held when using rbtree API kfuncs * Nodes are allocated via bpf_obj_new and dropped via bpf_obj_drop * Rbtree takes ownership of node lifetime when a node is added. Removing a node gives ownership back to the program, requiring a bpf_obj_drop before program exit Some new additions as well: * Support for callbacks in kfunc args is added to enable 'less' callback use above * bpf_rbtree_first is the first graph API function to return a non-owning reference instead of convering an arg from own->non-own * Because all references to nodes already added to the rbtree are non-owning, bpf_rbtree_remove must accept such a reference in order to remove it from the tree Summary of patches: Patches 1 - 5 implement the meat of rbtree-specific support in this series, gradually building up to implemented kfuncs that verify as expected. Patch 6 adds the bpf_rbtree_{add,first,remove} to bpf_experimental.h. Patch 7 adds tests, Patch 9 adds documentation. [0]: lore.kernel.org/bpf/20221118015614.2013203-1-memxor@gmail.com [1]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com [2]: lore.kernel.org/bpf/20221130082313.3241517-1-tj@kernel.org Changelog: v5 -> v6: lore.kernel.org/bpf/20230212092715.1422619-1-davemarchevsky@fb.com/ Patch #'s below refer to the patch's number in v5 unless otherwise stated. * General / Patch 1 * Rebase onto latest bpf-next: "bpf: Migrate release_on_unlock logic to non-owning ref semantics" * This was Patch 1 of v4, was applied, not included in v6 * Patch 3 - "bpf: Add bpf_rbtree_{add,remove,first} kfuncs" * Use bpf_callback_t instead of plain-C fn ptr for bpf_rbtree_add. This necessitated having bpf_rbtree_add duplicate rbtree_add's functionality. Wrapper function was used w/ internal __bpf_rbtree_add helper so that bpf_experimental.h proto could continue to use plain-C fn ptr so BPF progs could benefit from typechecking (Alexei) v4 -> v5: lore.kernel.org/bpf/20230209174144.3280955-1-davemarchevsky@fb.com/ Patch #'s below refer to the patch's number in v4 unless otherwise stated. * General * Rebase onto latest bpf-next: "Merge branch 'bpf, mm: introduce cgroup.memory=nobpf'" * Patches 1-3 are squashed into "bpf: Migrate release_on_unlock logic to non-owning ref semantics". * Added type_is_non_owning_ref helper (Alexei) * Use a NON_OWN_REF type flag instead of separate bool (Alexei) * Patch 8 - "bpf: Special verifier handling for bpf_rbtree_{remove, first}" * When doing btf_parse_fields, reject structs with both bpf_list_node and bpf_rb_node fields. This is a temporary measure that can be removed after "collection identity" followup. See comment added in btf_parse_fields for more detail (Kumar, Alexei) * Add linked_list BTF test exercising check added to btf_parse_fields * Minor changes and moving around of some reg type checks due to NON_OWN_REF type flag introduction * Patch 10 - "selftests/bpf: Add rbtree selftests" * Migrate failure tests to RUN_TESTS, __failure, __msg() framework (Alexei) v3 -> v4: lore.kernel.org/bpf/20230131180016.3368305-1-davemarchevsky@fb.com/ Patch #'s below refer to the patch's number in v3 unless otherwise stated. * General * Don't base this series on "bpf: Refactor release_regno searching logic", which was submitted separately as a refactor. * Rebase onto latest bpf-next: "samples/bpf: Add openat2() enter/exit tracepoint to syscall_tp sample" * Patch 2 - "bpf: Improve bpf_reg_state space usage for non-owning ref lock" * print_verifier_state change was adding redundant comma after "non_own_ref", fix it to put comma in correct place * invalidate_non_owning_refs no longer needs to take bpf_active_lock param, since any non-owning ref reg in env's cur_state is assumed to use that state's active_lock (Alexei) * invalidate_non_owning_refs' reg loop should check that the reg being inspected is a PTR_TO_BTF_ID before checking reg->non_owning_ref_lock, since that field is part of a union and may be filled w/ meaningless bytes if reg != PTR_TO_BTF_ID (Alexei) * Patch 3 - "selftests/bpf: Update linked_list tests for non-owning ref semantics" * Change the string searched for by the following tests: * linked_list/incorrect_node_off1 * linked_list/double_push_front * linked_list/double_push_back necessary due to rebase / dropping of "release_regno searching logic" patch (see "General" changes) * Patch 8 - "bpf: Special verifier handling for bpf_rbtree_{remove, first}" * Just call invalidate_non_owning_refs w/ env instead of env, lock. (see Patch 2 changes) * Patch 11 - "bpf, documentation: Add graph documentation for non-owning refs" * Fix documentation formatting and improve content (David) * v3's version of patch 11 was missing some changes, v4's patch 11 is still addressing David's feedback from v2 v2 -> v3: lore.kernel.org/bpf/20221217082506.1570898-1-davemarchevsky@fb.com/ Patch #'s below refer to the patch's number in v2 unless otherwise stated. * Patch 1 - "bpf: Support multiple arg regs w/ ref_obj_id for kfuncs" * No longer needed as v3 doesn't have multiple ref_obj_id arg regs * The refactoring pieces were submitted separately (https://lore.kernel.org/bpf/20230121002417.1684602-1-davemarchevsky@fb.com/) * Patch 2 - "bpf: Migrate release_on_unlock logic to non-owning ref semantics" * Remove KF_RELEASE_NON_OWN flag from list API push methods, just match against specific kfuncs for now (Alexei, David) * Separate "release non owning reference" logic from KF_RELEASE logic (Alexei, David) * reg_find_field_offset now correctly tests 'rec' instead of 'reg' after calling reg_btf_record (Dan Carpenter) * New patch added after Patch 2 - "bpf: Improve bpf_reg_state space usage for non-owning ref lock" * Eliminates extra bpf_reg_state memory usage by using a bool instead of copying lock identity * Patch 4 - "bpf: rename list_head -> graph_root in field info types" * v2's version was applied to bpf-next, not including in respins * Patch 6 - "bpf: Add bpf_rbtree_{add,remove,first} kfuncs" * Remove KF_RELEASE_NON_OWN flag from rbtree_add, just add it to specific kfunc matching (Alexei, David) * Patch 9 - "bpf: Special verifier handling for bpf_rbtree_{remove, first}" * Remove KF_INVALIDATE_NON_OWN kfunc flag, just match against specific kfunc for now (Alexei, David) * Patch 11 - "libbpf: Make BTF mandatory if program BTF has spin_lock or alloc_obj type" * Drop for now, will submit separately * Patch 12 - "selftests/bpf: Add rbtree selftests" * Some expected-failure tests have different error messages due to "release non-owning reference logic" being separated from KF_RELEASE logic in Patch 2 changes * Patch 13 - "bpf, documentation: Add graph documentation for non-owning refs" * Fix documentation formatting and improve content (David) v1 -> v2: lore.kernel.org/bpf/20221206231000.3180914-1-davemarchevsky@fb.com/ Series-wide changes: * Rename datastructure_{head,node,api} -> graph_{root,node,api} (Alexei) * "graph datastructure" in patch summaries to refer to linked_list + rbtree instead of "next-gen datastructure" (Alexei) * Move from hacky marking of non-owning references as PTR_UNTRUSTED to cleaner implementation (Alexei) * Add invalidation of non-owning refs to rbtree_remove (Kumar, Alexei) Patch #'s below refer to the patch's number in v1 unless otherwise stated. Note that in v1 most of the meaty verifier changes were in the latter half of the series. Here, about half of that complexity has been moved to "bpf: Migrate release_on_unlock logic to non-owning ref semantics" - was Patch 3 in v1. * Patch 1 - "bpf: Loosen alloc obj test in verifier's reg_btf_record" * Was applied, dropped from further iterations * Patch 2 - "bpf: map_check_btf should fail if btf_parse_fields fails" * Dropped in favor of verifier check-on-use: when some normal verifier checking expects the map to have btf_fields correctly parsed, it won't find any and verification will fail * New patch added before Patch 3 - "bpf: Support multiple arg regs w/ ref_obj_id for kfuncs" * Addition of KF_RELEASE_NON_OWN flag, which requires KF_RELEASE, and tagging of bpf_list_push_{front,back} KF_RELEASE | KF_RELEASE_NON_OWN, means that list-in-list push_{front,back} will trigger "only one ref_obj_id arg reg" logic. This is because "head" arg to those functions can be a list-in-list, which itself can be an owning reference with ref_obj_id. So need to support multiple ref_obj_id for release kfuncs. * Patch 3 - "bpf: Minor refactor of ref_set_release_on_unlock" * Now a major refactor w/ a rename to reflect this * "bpf: Migrate release_on_unlock logic to non-owning ref semantics" * Replaces release_on_unlock with active_lock logic as discussed in v1 * New patch added after Patch 3 - "selftests/bpf: Update linked_list tests for non_owning_ref logic" * Removes "write after push" linked_list failure tests - no longer failure scenarios. * Patch 4 - "bpf: rename list_head -> datastructure_head in field info types" * rename to graph_root instead. Similar renamings across the series - see series-wide changes. * Patch 5 - "bpf: Add basic bpf_rb_{root,node} support" * OWNER_FIELD_MASK -> GRAPH_ROOT_MASK, OWNEE_FIELD_MASK -> GRAPH_NODE_MASK, and change of "owner"/"ownee" in big btf_check_and_fixup_fields comment to "root"/"node" (Alexei) * Patch 6 - "bpf: Add bpf_rbtree_{add,remove,first} kfuncs" * bpf_rbtree_remove can no longer return NULL. v2 continues v1's "use type system to prevent remove of node that isn't in a datastructure" approach, so rbtree_remove should never have been able to return NULL * Patch 7 - "bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args" * is_bpf_datastructure_api_kfunc -> is_bpf_graph_api_kfunc (Alexei) * Patch 8 - "bpf: Add callback validation to kfunc verifier logic" * Explicitly disallow rbtree_remove in rbtree callback * Explicitly disallow bpf_spin_{lock,unlock} call in rbtree callback, preventing possibility of "unbalanced" unlock (Alexei) * Patch 10 - "bpf, x86: BPF_PROBE_MEM handling for insn->off < 0" * Now that non-owning refs aren't marked PTR_UNTRUSTED it's not necessary to include this patch as part of the series * After conversation w/ Alexei, did another pass and submitted as an independent series (lore.kernel.org/bpf/20221213182726.325137-1-davemarchevsky@fb.com/) * Patch 13 - "selftests/bpf: Add rbtree selftests" * Since bpf_rbtree_remove can no longer return null, remove null checks * Remove test confirming that rbtree_first isn't allowed in callback. We want this to be possible * Add failure test confirming that rbtree_remove's new non-owning reference invalidation behavior behaves as expected * Add SEC("license") to rbtree_btf_fail__* progs. They were previously failing due to lack of this section. Now they're failing for correct reasons. * rbtree_btf_fail__add_wrong_type.c - add locking around rbtree_add, rename the bpf prog to something reasonable * New patch added after patch 13 - "bpf, documentation: Add graph documentation for non-owning refs" * Summarizes details of owning and non-owning refs which we hashed out in v1 ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13bpf, documentation: Add graph documentation for non-owning refsDave Marchevsky
It is difficult to intuit the semantics of owning and non-owning references from verifier code. In order to keep the high-level details from being lost in the mailing list, this patch adds documentation explaining semantics and details. The target audience of doc added in this patch is folks working on BPF internals, as there's focus on "what should the verifier do here". Via reorganization or copy-and-paste, much of the content can probably be repurposed for BPF program writer audience as well. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-9-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13selftests/bpf: Add rbtree selftestsDave Marchevsky
This patch adds selftests exercising the logic changed/added in the previous patches in the series. A variety of successful and unsuccessful rbtree usages are validated: Success: * Add some nodes, let map_value bpf_rbtree_root destructor clean them up * Add some nodes, remove one using the non-owning ref leftover by successful rbtree_add() call * Add some nodes, remove one using the non-owning ref returned by rbtree_first() call Failure: * BTF where bpf_rb_root owns bpf_list_node should fail to load * BTF where node of type X is added to tree containing nodes of type Y should fail to load * No calling rbtree api functions in 'less' callback for rbtree_add * No releasing lock in 'less' callback for rbtree_add * No removing a node which hasn't been added to any tree * No adding a node which has already been added to a tree * No escaping of non-owning references past their lock's critical section * No escaping of non-owning references past other invalidation points (rbtree_remove) These tests mostly focus on rbtree-specific additions, but some of the failure cases revalidate scenarios common to both linked_list and rbtree which are covered in the former's tests. Better to be a bit redundant in case linked_list and rbtree semantics deviate over time. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-8-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13bpf: Add bpf_rbtree_{add,remove,first} decls to bpf_experimental.hDave Marchevsky
These kfuncs will be used by selftests in following patches Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-7-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13bpf: Special verifier handling for bpf_rbtree_{remove, first}Dave Marchevsky
Newly-added bpf_rbtree_{remove,first} kfuncs have some special properties that require handling in the verifier: * both bpf_rbtree_remove and bpf_rbtree_first return the type containing the bpf_rb_node field, with the offset set to that field's offset, instead of a struct bpf_rb_node * * mark_reg_graph_node helper added in previous patch generalizes this logic, use it * bpf_rbtree_remove's node input is a node that's been inserted in the tree - a non-owning reference. * bpf_rbtree_remove must invalidate non-owning references in order to avoid aliasing issue. Use previously-added invalidate_non_owning_refs helper to mark this function as a non-owning ref invalidation point. * Unlike other functions, which convert one of their input arg regs to non-owning reference, bpf_rbtree_first takes no arguments and just returns a non-owning reference (possibly null) * For now verifier logic for this is special-cased instead of adding new kfunc flag. This patch, along with the previous one, complete special verifier handling for all rbtree API functions added in this series. With functional verifier handling of rbtree_remove, under current non-owning reference scheme, a node type with both bpf_{list,rb}_node fields could cause the verifier to accept programs which remove such nodes from collections they haven't been added to. In order to prevent this, this patch adds a check to btf_parse_fields which rejects structs with both bpf_{list,rb}_node fields. This is a temporary measure that can be removed after "collection identity" followup. See comment added in btf_parse_fields. A linked_list BTF test exercising the new check is added in this patch as well. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-6-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13bpf: Add callback validation to kfunc verifier logicDave Marchevsky
Some BPF helpers take a callback function which the helper calls. For each helper that takes such a callback, there's a special call to __check_func_call with a callback-state-setting callback that sets up verifier bpf_func_state for the callback's frame. kfuncs don't have any of this infrastructure yet, so let's add it in this patch, following existing helper pattern as much as possible. To validate functionality of this added plumbing, this patch adds callback handling for the bpf_rbtree_add kfunc and hopes to lay groundwork for future graph datastructure callbacks. In the "general plumbing" category we have: * check_kfunc_call doing callback verification right before clearing CALLER_SAVED_REGS, exactly like check_helper_call * recognition of func_ptr BTF types in kfunc args as KF_ARG_PTR_TO_CALLBACK + propagation of subprogno for this arg type In the "rbtree_add / graph datastructure-specific plumbing" category: * Since bpf_rbtree_add must be called while the spin_lock associated with the tree is held, don't complain when callback's func_state doesn't unlock it by frame exit * Mark rbtree_add callback's args with ref_set_non_owning to prevent rbtree api functions from being called in the callback. Semantically this makes sense, as less() takes no ownership of its args when determining which comes first. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-5-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc argsDave Marchevsky
Now that we find bpf_rb_root and bpf_rb_node in structs, let's give args that contain those types special classification and properly handle these types when checking kfunc args. "Properly handling" these types largely requires generalizing similar handling for bpf_list_{head,node}, with little new logic added in this patch. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13bpf: Add bpf_rbtree_{add,remove,first} kfuncsDave Marchevsky
This patch adds implementations of bpf_rbtree_{add,remove,first} and teaches verifier about their BTF_IDs as well as those of bpf_rb_{root,node}. All three kfuncs have some nonstandard component to their verification that needs to be addressed in future patches before programs can properly use them: * bpf_rbtree_add: Takes 'less' callback, need to verify it * bpf_rbtree_first: Returns ptr_to_node_type(off=rb_node_off) instead of ptr_to_rb_node(off=0). Return value ref is non-owning. * bpf_rbtree_remove: Returns ptr_to_node_type(off=rb_node_off) instead of ptr_to_rb_node(off=0). 2nd arg (node) is a non-owning reference. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-3-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13bpf: Add basic bpf_rb_{root,node} supportDave Marchevsky
This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new types, and adds bpf_rb_root_free function for freeing bpf_rb_root in map_values. structs bpf_rb_root and bpf_rb_node are opaque types meant to obscure structs rb_root_cached rb_node, respectively. btf_struct_access will prevent BPF programs from touching these special fields automatically now that they're recognized. btf_check_and_fixup_fields now groups list_head and rb_root together as "graph root" fields and {list,rb}_node as "graph node", and does same ownership cycle checking as before. Note that this function does _not_ prevent ownership type mixups (e.g. rb_root owning list_node) - that's handled by btf_parse_graph_root. After this patch, a bpf program can have a struct bpf_rb_root in a map_value, but not add anything to nor do anything useful with it. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230214004017.2534011-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13net: lan966x: set xdp_features flagLorenzo Bianconi
Set xdp_features netdevice flag if lan966x nic supports xdp mode. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> Link: https://lore.kernel.org/r/01f4412f28899d97b0054c9c1a63694201301b42.1676055718.git.lorenzo@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13Merge tag 'mm-hotfixes-stable-2023-02-13-13-50' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Twelve hotfixes, mostly against mm/. Five of these fixes are cc:stable" * tag 'mm-hotfixes-stable-2023-02-13-13-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: of: reserved_mem: Have kmemleak ignore dynamically allocated reserved mem scripts/gdb: fix 'lx-current' for x86 lib: parser: optimize match_NUMBER apis to use local array mm: shrinkers: fix deadlock in shrinker debugfs mm: hwpoison: support recovery from ksm_might_need_to_copy() kasan: fix Oops due to missing calls to kasan_arch_is_ready() revert "squashfs: harden sanity check in squashfs_read_xattr_id_table" fsdax: dax_unshare_iter() should return a valid length mm/gup: add folio to list when folio_isolate_lru() succeed aio: fix mremap after fork null-deref mailmap: add entry for Alexander Mikhalitsyn mm: extend max struct page size for kmsan
2023-02-13bpf: Migrate release_on_unlock logic to non-owning ref semanticsDave Marchevsky
This patch introduces non-owning reference semantics to the verifier, specifically linked_list API kfunc handling. release_on_unlock logic for refs is refactored - with small functional changes - to implement these semantics, and bpf_list_push_{front,back} are migrated to use them. When a list node is pushed to a list, the program still has a pointer to the node: n = bpf_obj_new(typeof(*n)); bpf_spin_lock(&l); bpf_list_push_back(&l, n); /* n still points to the just-added node */ bpf_spin_unlock(&l); What the verifier considers n to be after the push, and thus what can be done with n, are changed by this patch. Common properties both before/after this patch: * After push, n is only a valid reference to the node until end of critical section * After push, n cannot be pushed to any list * After push, the program can read the node's fields using n Before: * After push, n retains the ref_obj_id which it received on bpf_obj_new, but the associated bpf_reference_state's release_on_unlock field is set to true * release_on_unlock field and associated logic is used to implement "n is only a valid ref until end of critical section" * After push, n cannot be written to, the node must be removed from the list before writing to its fields * After push, n is marked PTR_UNTRUSTED After: * After push, n's ref is released and ref_obj_id set to 0. NON_OWN_REF type flag is added to reg's type, indicating that it's a non-owning reference. * NON_OWN_REF flag and logic is used to implement "n is only a valid ref until end of critical section" * n can be written to (except for special fields e.g. bpf_list_node, timer, ...) Summary of specific implementation changes to achieve the above: * release_on_unlock field, ref_set_release_on_unlock helper, and logic to "release on unlock" based on that field are removed * The anonymous active_lock struct used by bpf_verifier_state is pulled out into a named struct bpf_active_lock. * NON_OWN_REF type flag is introduced along with verifier logic changes to handle non-owning refs * Helpers are added to use NON_OWN_REF flag to implement non-owning ref semantics as described above * invalidate_non_owning_refs - helper to clobber all non-owning refs matching a particular bpf_active_lock identity. Replaces release_on_unlock logic in process_spin_lock. * ref_set_non_owning - set NON_OWN_REF type flag after doing some sanity checking * ref_convert_owning_non_owning - convert owning reference w/ specified ref_obj_id to non-owning references. Set NON_OWN_REF flag for each reg with that ref_obj_id and 0-out its ref_obj_id * Update linked_list selftests to account for minor semantic differences introduced by this patch * Writes to a release_on_unlock node ref are not allowed, while writes to non-owning reference pointees are. As a result the linked_list "write after push" failure tests are no longer scenarios that should fail. * The test##missing_lock##op and test##incorrect_lock##op macro-generated failure tests need to have a valid node argument in order to have the same error output as before. Otherwise verification will fail early and the expected error output won't be seen. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230212092715.1422619-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-02-13ice: fix lost multicast packets in promisc modeJesse Brandeburg
There was a problem reported to us where the addition of a VF with an IPv6 address ending with a particular sequence would cause the parent device on the PF to no longer be able to respond to neighbor discovery packets. In this case, we had an ovs-bridge device living on top of a VLAN, which was on top of a PF, and it would not be able to talk anymore (the neighbor entry would expire and couldn't be restored). The root cause of the issue is that if the PF is asked to be in IFF_PROMISC mode (promiscuous mode) and it had an ipv6 address that needed the 33:33:ff:00:00:04 multicast address to work, then when the VF was added with the need for the same multicast address, the VF would steal all the traffic destined for that address. The ice driver didn't auto-subscribe a request of IFF_PROMISC to the "multicast replication from other port's traffic" meaning that it won't get for instance, packets with an exact destination in the VF, as above. The VF's IPv6 address, which adds a "perfect filter" for 33:33:ff:00:00:04, results in no packets for that multicast address making it to the PF (which is in promisc but NOT "multicast replication"). The fix is to enable "multicast promiscuous" whenever the driver is asked to enable IFF_PROMISC, and make sure to disable it when appropriate. Fixes: e94d44786693 ("ice: Implement filter sync, NDO operations and bump version") Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Rafal Romanowski <rafal.romanowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-02-13Merge branch 'xdp-ice-mbuf'Daniel Borkmann
Alexander Lobakin says: ==================== The set grew from the poor performance of %BPF_F_TEST_XDP_LIVE_FRAMES when the ice-backed device is a sender. Initially there were around 3.3 Mpps / thread, while I have 5.5 on skb-based pktgen ... After fixing 0005 (0004 is a prereq for it) first (strange thing nobody noticed that earlier), I started catching random OOMs. This is how 0002 (and partially 0001) appeared. 0003 is a suggestion from Maciej to not waste time on refactoring dead lines. 0006 is a "cherry on top" to get away with the final 6.7 Mpps. 4.5 of 6 are fixes, but only the first three are tagged, since it then starts being tricky. I may backport them manually later on. TL;DR for the series is that shortcuts are good, but only as long as they don't make the driver miss important things. %XDP_TX is purely driver-local, however .ndo_xdp_xmit() is not, and sometimes assumptions can be unsafe there. With that series and also one core code patch[0], "live frames" and xdp-trafficgen are now safe'n'fast on ice (probably more to come). [0] https://lore.kernel.org/all/20230209172827.874728-1-alexandr.lobakin@intel.com ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2023-02-13ice: Micro-optimize .ndo_xdp_xmit() pathAlexander Lobakin
After the recent mbuf changes, ice_xmit_xdp_ring() became a 3-liner. It makes no sense to keep it global in a different file than its caller. Move it just next to the sole call site and mark static. Also, it doesn't need a full xdp_convert_frame_to_buff(). Save several cycles and fill only the fields used by __ice_xmit_xdp_ring() later on. Finally, since it doesn't modify @xdpf anyhow, mark the argument const to save some more (whole -11 bytes of .text! :D). Thanks to 1 jump less and less calcs as well, this yields as many as 6.7 Mpps per queue. `xdp.data_hard_start = xdpf` is fully intentional again (see xdp_convert_buff_to_frame()) and just works when there are no source device's driver issues. Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-7-alexandr.lobakin@intel.com
2023-02-13ice: Fix freeing XDP frames backed by Page PoolAlexander Lobakin
As already mentioned, freeing any &xdp_frame via page_frag_free() is wrong, as it assumes the frame is backed by either an order-0 page or a page with no "patrons" behind them, while in fact frames backed by Page Pool can be redirected to a device, which's driver doesn't use it. Keep storing a pointer to the raw buffer and then freeing it unconditionally via page_frag_free() for %XDP_TX frames, but introduce a separate type in the enum for frames coming through .ndo_xdp_xmit(), and free them via xdp_return_frame_bulk(). Note that saving xdpf as xdp_buff->data_hard_start is intentional and is always true when everything is configured properly. After this change, %XDP_REDIRECT from a Page Pool based driver to ice becomes zero-alloc as it should be and horrendous 3.3 Mpps / queue turn into 6.6, hehe. Let it go with no "Fixes:" tag as it spans across good 5+ commits and can't be trivially backported. Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-6-alexandr.lobakin@intel.com
2023-02-13ice: Robustify cleaning/completing XDP Tx buffersAlexander Lobakin
When queueing frames from a Page Pool for redirecting to a device backed by the ice driver, `perf top` shows heavy load on page_alloc() and page_frag_free(), despite that on a properly working system it must be fully or at least almost zero-alloc. The problem is in fact a bit deeper and raises from how ice cleans up completed Tx buffers. The story so far: when cleaning/freeing the resources related to a particular completed Tx frame (skbs, DMA mappings etc.), ice uses some heuristics only without setting any type explicitly (except for dummy Flow Director packets, which are marked via ice_tx_buf::tx_flags). This kinda works, but only up to some point. For example, currently ice assumes that each frame coming to __ice_xmit_xdp_ring(), is backed by either plain order-0 page or plain page frag, while it may also be backed by Page Pool or any other possible memory models introduced in future. This means any &xdp_frame must be freed properly via xdp_return_frame() family with no assumptions. In order to do that, the whole heuristics must be replaced with setting the Tx buffer/frame type explicitly, just how it's always been done via an enum. Let us reuse 16 bits from ::tx_flags -- 1 bit-and instr won't hurt much -- especially given that sometimes there was a check for %ICE_TX_FLAGS_DUMMY_PKT, which is now turned from a flag to an enum member. The rest of the changes is straightforward and most of it is just a conversion to rely now on the type set in &ice_tx_buf rather than to some secondary properties. For now, no functional changes intended, the change only prepares the ground for starting freeing XDP frames properly next step. And it must be done atomically/synchronously to not break stuff. Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-5-alexandr.lobakin@intel.com
2023-02-13ice: Remove two impossible branches on XDP Tx cleaningAlexander Lobakin
The tagged commit started sending %XDP_TX frames from XSk Rx ring directly without converting it to an &xdp_frame. However, when XSk is enabled on a queue pair, it has its separate Tx cleaning functions, so neither ice_clean_xdp_irq() nor ice_unmap_and_free_tx_buf() ever happens there. Remove impossible branches in order to reduce the diffstat of the upcoming change. Fixes: a24b4c6e9aab ("ice: xsk: Do not convert to buff to frame for XDP_TX") Suggested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-4-alexandr.lobakin@intel.com
2023-02-13ice: Fix XDP Tx ring overrunAlexander Lobakin
Sometimes, under heavy XDP Tx traffic, e.g. when using XDP traffic generator (%BPF_F_TEST_XDP_LIVE_FRAMES), the machine can catch OOM due to the driver not freeing all of the pages passed to it by .ndo_xdp_xmit(). Turned out that during the development of the tagged commit, the check, which ensures that we have a free descriptor to queue a frame, moved into the branch happening only when a buffer has frags. Otherwise, we only run a cleaning cycle, but don't check anything. ATST, there can be situations when the driver gets new frames to send, but there are no buffers that can be cleaned/completed and the ring has no free slots. It's very rare, but still possible (> 6.5 Mpps per ring). The driver then fills the next buffer/descriptor, effectively overwriting the data, which still needs to be freed. Restore the check after the cleaning routine to make sure there is a slot to queue a new frame. When there are frags, there still will be a separate check that we can place all of them, but if the ring is full, there's no point in wasting any more time. (minor: make `!ready_frames` unlikely since it happens ~1-2 times per billion of frames) Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-3-alexandr.lobakin@intel.com
2023-02-13ice: fix ice_tx_ring:: Xdp_tx_active underflowAlexander Lobakin
xdp_tx_active is used to indicate whether an XDP ring has any %XDP_TX frames queued to shortcut processing Tx cleaning for XSk-enabled queues. When !XSk, it simply indicates whether the ring has any queued frames in general. It gets increased on each frame placed onto the ring and counts the whole frame, not each frag. However, currently it gets decremented in ice_clean_xdp_tx_buf(), which is called per each buffer, i.e. per each frag. Thus, on completing multi-frag frames, an underflow happens. Move the decrement to the outer function and do it once per frame, not buf. Also, do that on the stack and update the ring counter after the loop is done to save several cycles. XSk rings are fine since there are no frags at the moment. Fixes: 3246a10752a7 ("ice: Add support for XDP multi-buffer on Tx side") Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20230210170618.1973430-2-alexandr.lobakin@intel.com