summaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)Author
2023-07-19bpf: Add generic attach/detach/query API for multi-progsDaniel Borkmann
This adds a generic layer called bpf_mprog which can be reused by different attachment layers to enable multi-program attachment and dependency resolution. In-kernel users of the bpf_mprog don't need to care about the dependency resolution internals, they can just consume it with few API calls. The initial idea of having a generic API sparked out of discussion [0] from an earlier revision of this work where tc's priority was reused and exposed via BPF uapi as a way to coordinate dependencies among tc BPF programs, similar as-is for classic tc BPF. The feedback was that priority provides a bad user experience and is hard to use [1], e.g.: I cannot help but feel that priority logic copy-paste from old tc, netfilter and friends is done because "that's how things were done in the past". [...] Priority gets exposed everywhere in uapi all the way to bpftool when it's right there for users to understand. And that's the main problem with it. The user don't want to and don't need to be aware of it, but uapi forces them to pick the priority. [...] Your cover letter [0] example proves that in real life different service pick the same priority. They simply don't know any better. Priority is an unnecessary magic that apps _have_ to pick, so they just copy-paste and everyone ends up using the same. The course of the discussion showed more and more the need for a generic, reusable API where the "same look and feel" can be applied for various other program types beyond just tc BPF, for example XDP today does not have multi- program support in kernel, but also there was interest around this API for improving management of cgroup program types. Such common multi-program management concept is useful for BPF management daemons or user space BPF applications coordinating internally about their attachments. Both from Cilium and Meta side [2], we've collected the following requirements for a generic attach/detach/query API for multi-progs which has been implemented as part of this work: - Support prog-based attach/detach and link API - Dependency directives (can also be combined): - BPF_F_{BEFORE,AFTER} with relative_{fd,id} which can be {prog,link,none} - BPF_F_ID flag as {fd,id} toggle; the rationale for id is so that user space application does not need CAP_SYS_ADMIN to retrieve foreign fds via bpf_*_get_fd_by_id() - BPF_F_LINK flag as {prog,link} toggle - If relative_{fd,id} is none, then BPF_F_BEFORE will just prepend, and BPF_F_AFTER will just append for attaching - Enforced only at attach time - BPF_F_REPLACE with replace_bpf_fd which can be prog, links have their own infra for replacing their internal prog - If no flags are set, then it's default append behavior for attaching - Internal revision counter and optionally being able to pass expected_revision - User space application can query current state with revision, and pass it along for attachment to assert current state before doing updates - Query also gets extension for link_ids array and link_attach_flags: - prog_ids are always filled with program IDs - link_ids are filled with link IDs when link was used, otherwise 0 - {prog,link}_attach_flags for holding {prog,link}-specific flags - Must be easy to integrate/reuse for in-kernel users The uapi-side changes needed for supporting bpf_mprog are rather minimal, consisting of the additions of the attachment flags, revision counter, and expanding existing union with relative_{fd,id} member. The bpf_mprog framework consists of an bpf_mprog_entry object which holds an array of bpf_mprog_fp (fast-path structure). The bpf_mprog_cp (control-path structure) is part of bpf_mprog_bundle. Both have been separated, so that fast-path gets efficient packing of bpf_prog pointers for maximum cache efficiency. Also, array has been chosen instead of linked list or other structures to remove unnecessary indirections for a fast point-to-entry in tc for BPF. The bpf_mprog_entry comes as a pair via bpf_mprog_bundle so that in case of updates the peer bpf_mprog_entry is populated and then just swapped which avoids additional allocations that could otherwise fail, for example, in detach case. bpf_mprog_{fp,cp} arrays are currently static, but they could be converted to dynamic allocation if necessary at a point in future. Locking is deferred to the in-kernel user of bpf_mprog, for example, in case of tcx which uses this API in the next patch, it piggybacks on rtnl. An extensive test suite for checking all aspects of this API for prog-based attach/detach and link API comes as BPF selftests in this series. Thanks also to Andrii Nakryiko for early API discussions wrt Meta's BPF prog management. [0] https://lore.kernel.org/bpf/20221004231143.19190-1-daniel@iogearbox.net [1] https://lore.kernel.org/bpf/CAADnVQ+gEY3FjCR=+DmjDR4gp5bOYZUFJQXj4agKFHT9CQPZBw@mail.gmail.com [2] http://vger.kernel.org/bpfconf2023_material/tcx_meta_netdev_borkmann.pdf Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20230719140858.13224-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19bpf: allow any program to use the bpf_map_sum_elem_count kfuncAnton Protopopov
Register the bpf_map_sum_elem_count func for all programs, and update the map_ptr subtest of the test_progs test to test the new functionality. The usage is allowed as long as the pointer to the map is trusted (when using tracing programs) or is a const pointer to map, as in the following example: struct { __uint(type, BPF_MAP_TYPE_HASH); ... } hash SEC(".maps"); ... static inline int some_bpf_prog(void) { struct bpf_map *map = (struct bpf_map *)&hash; __s64 count; count = bpf_map_sum_elem_count(map); ... } Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230719092952.41202-5-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19bpf: make an argument const in the bpf_map_sum_elem_count kfuncAnton Protopopov
We use the map pointer only to read the counter values, no locking involved, so mark the argument as const. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230719092952.41202-4-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19bpf: consider CONST_PTR_TO_MAP as trusted pointer to struct bpf_mapAnton Protopopov
Add the BTF id of struct bpf_map to the reg2btf_ids array. This makes the values of the CONST_PTR_TO_MAP type to be considered as trusted by kfuncs. This, in turn, allows users to execute trusted kfuncs which accept `struct bpf_map *` arguments from non-tracing programs. While exporting the btf_bpf_map_id variable, save some bytes by defining it as BTF_ID_LIST_GLOBAL_SINGLE (which is u32[1]) and not as BTF_ID_LIST (which is u32[64]). Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230719092952.41202-3-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-19bpf: consider types listed in reg2btf_ids as trustedAnton Protopopov
The reg2btf_ids array contains a list of types for which we can (and need) to find a corresponding static BTF id. All the types in the list can be considered as trusted for purposes of kfuncs. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230719092952.41202-2-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18bpf: Add 'owner' field to bpf_{list,rb}_nodeDave Marchevsky
As described by Kumar in [0], in shared ownership scenarios it is necessary to do runtime tracking of {rb,list} node ownership - and synchronize updates using this ownership information - in order to prevent races. This patch adds an 'owner' field to struct bpf_list_node and bpf_rb_node to implement such runtime tracking. The owner field is a void * that describes the ownership state of a node. It can have the following values: NULL - the node is not owned by any data structure BPF_PTR_POISON - the node is in the process of being added to a data structure ptr_to_root - the pointee is a data structure 'root' (bpf_rb_root / bpf_list_head) which owns this node The field is initially NULL (set by bpf_obj_init_field default behavior) and transitions states in the following sequence: Insertion: NULL -> BPF_PTR_POISON -> ptr_to_root Removal: ptr_to_root -> NULL Before a node has been successfully inserted, it is not protected by any root's lock, and therefore two programs can attempt to add the same node to different roots simultaneously. For this reason the intermediate BPF_PTR_POISON state is necessary. For removal, the node is protected by some root's lock so this intermediate hop isn't necessary. Note that bpf_list_pop_{front,back} helpers don't need to check owner before removing as the node-to-be-removed is not passed in as input and is instead taken directly from the list. Do the check anyways and WARN_ON_ONCE in this unexpected scenario. Selftest changes in this patch are entirely mechanical: some BTF tests have hardcoded struct sizes for structs that contain bpf_{list,rb}_node fields, those were adjusted to account for the new sizes. Selftest additions to validate the owner field are added in a further patch in the series. [0]: https://lore.kernel.org/bpf/d7hyspcow5wtjcmw4fugdgyp3fwhljwuscp3xyut5qnwivyeru@ysdq543otzv2 Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230718083813.3416104-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18bpf: Introduce internal definitions for UAPI-opaque bpf_{rb,list}_nodeDave Marchevsky
Structs bpf_rb_node and bpf_list_node are opaquely defined in uapi/linux/bpf.h, as BPF program writers are not expected to touch their fields - nor does the verifier allow them to do so. Currently these structs are simple wrappers around structs rb_node and list_head and linked_list / rbtree implementation just casts and passes to library functions for those data structures. Later patches in this series, though, will add an "owner" field to bpf_{rb,list}_node, such that they're not just wrapping an underlying node type. Moreover, the bpf linked_list and rbtree implementations will deal with these owner pointers directly in a few different places. To avoid having to do void *owner = (void*)bpf_list_node + sizeof(struct list_head) with opaque UAPI node types, add bpf_{list,rb}_node_kern struct definitions to internal headers and modify linked_list and rbtree to use the internal types where appropriate. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20230718083813.3416104-3-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18bpf: Repeat check_max_stack_depth for async callbacksKumar Kartikeya Dwivedi
While the check_max_stack_depth function explores call chains emanating from the main prog, which is typically enough to cover all possible call chains, it doesn't explore those rooted at async callbacks unless the async callback will have been directly called, since unlike non-async callbacks it skips their instruction exploration as they don't contribute to stack depth. It could be the case that the async callback leads to a callchain which exceeds the stack depth, but this is never reachable while only exploring the entry point from main subprog. Hence, repeat the check for the main subprog *and* all async callbacks marked by the symbolic execution pass of the verifier, as execution of the program may begin at any of them. Consider functions with following stack depths: main: 256 async: 256 foo: 256 main: rX = async bpf_timer_set_callback(...) async: foo() Here, async is not descended as it does not contribute to stack depth of main (since it is referenced using bpf_pseudo_func and not bpf_pseudo_call). However, when async is invoked asynchronously, it will end up breaching the MAX_BPF_STACK limit by calling foo. Hence, in addition to main, we also need to explore call chains beginning at all async callback subprogs in a program. Fixes: 7ddc80a476c2 ("bpf: Teach stack depth check about async callbacks.") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230717161530.1238-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-18bpf: Fix subprog idx logic in check_max_stack_depthKumar Kartikeya Dwivedi
The assignment to idx in check_max_stack_depth happens once we see a bpf_pseudo_call or bpf_pseudo_func. This is not an issue as the rest of the code performs a few checks and then pushes the frame to the frame stack, except the case of async callbacks. If the async callback case causes the loop iteration to be skipped, the idx assignment will be incorrect on the next iteration of the loop. The value stored in the frame stack (as the subprogno of the current subprog) will be incorrect. This leads to incorrect checks and incorrect tail_call_reachable marking. Save the target subprog in a new variable and only assign to idx once we are done with the is_async_cb check which may skip pushing of frame to the frame stack and subsequent stack depth checks and tail call markings. Fixes: 7ddc80a476c2 ("bpf: Teach stack depth check about async callbacks.") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230717161530.1238-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-13Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Alexei Starovoitov says: ==================== pull-request: bpf-next 2023-07-13 We've added 67 non-merge commits during the last 15 day(s) which contain a total of 106 files changed, 4444 insertions(+), 619 deletions(-). The main changes are: 1) Fix bpftool build in presence of stale vmlinux.h, from Alexander Lobakin. 2) Introduce bpf_me_mcache_free_rcu() and fix OOM under stress, from Alexei Starovoitov. 3) Teach verifier actual bounds of bpf_get_smp_processor_id() and fix perf+libbpf issue related to custom section handling, from Andrii Nakryiko. 4) Introduce bpf map element count, from Anton Protopopov. 5) Check skb ownership against full socket, from Kui-Feng Lee. 6) Support for up to 12 arguments in BPF trampoline, from Menglong Dong. 7) Export rcu_request_urgent_qs_task, from Paul E. McKenney. 8) Fix BTF walking of unions, from Yafang Shao. 9) Extend link_info for kprobe_multi and perf_event links, from Yafang Shao. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (67 commits) selftests/bpf: Add selftest for PTR_UNTRUSTED bpf: Fix an error in verifying a field in a union selftests/bpf: Add selftests for nested_trust bpf: Fix an error around PTR_UNTRUSTED selftests/bpf: add testcase for TRACING with 6+ arguments bpf, x86: allow function arguments up to 12 for TRACING bpf, x86: save/restore regs with BPF_DW size bpftool: Use "fallthrough;" keyword instead of comments bpf: Add object leak check. bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu. bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu(). selftests/bpf: Improve test coverage of bpf_mem_alloc. rcu: Export rcu_request_urgent_qs_task() bpf: Allow reuse from waiting_for_gp_ttrace list. bpf: Add a hint to allocated objects. bpf: Change bpf_mem_cache draining process. bpf: Further refactor alloc_bulk(). bpf: Factor out inc/dec of active flag into helpers. bpf: Refactor alloc_bulk(). bpf: Let free_all() return the number of freed elements. ... ==================== Link: https://lore.kernel.org/r/20230714020910.80794-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-07-13bpf: Fix an error in verifying a field in a unionYafang Shao
We are utilizing BPF LSM to monitor BPF operations within our container environment. When we add support for raw_tracepoint, it hits below error. ; (const void *)attr->raw_tracepoint.name); 27: (79) r3 = *(u64 *)(r2 +0) access beyond the end of member map_type (mend:4) in struct (anon) with off 0 size 8 It can be reproduced with below BPF prog. SEC("lsm/bpf") int BPF_PROG(bpf_audit, int cmd, union bpf_attr *attr, unsigned int size) { switch (cmd) { case BPF_RAW_TRACEPOINT_OPEN: bpf_printk("raw_tracepoint is %s", attr->raw_tracepoint.name); break; default: break; } return 0; } The reason is that when accessing a field in a union, such as bpf_attr, if the field is located within a nested struct that is not the first member of the union, it can result in incorrect field verification. union bpf_attr { struct { __u32 map_type; <<<< Actually it will find that field. __u32 key_size; __u32 value_size; ... }; ... struct { __u64 name; <<<< We want to verify this field. __u32 prog_fd; } raw_tracepoint; }; Considering the potential deep nesting levels, finding a perfect solution to address this issue has proven challenging. Therefore, I propose a solution where we simply skip the verification process if the field in question is located within a union. Fixes: 7e3617a72df3 ("bpf: Add array support to btf_struct_access") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20230713025642.27477-4-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-13bpf: Fix an error around PTR_UNTRUSTEDYafang Shao
Per discussion with Alexei, the PTR_UNTRUSTED flag should not been cleared when we start to walk a new struct, because the struct in question may be a struct nested in a union. We should also check and set this flag before we walk its each member, in case itself is a union. We will clear this flag if the field is BTF_TYPE_SAFE_RCU_OR_NULL. Fixes: 6fcd486b3a0a ("bpf: Refactor RCU enforcement in the verifier.") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Link: https://lore.kernel.org/r/20230713025642.27477-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-12bpf: Add object leak check.Hou Tao
The object leak check is cheap. Do it unconditionally to spot difficult races in bpf_mem_alloc. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230706033447.54696-15-alexei.starovoitov@gmail.com
2023-07-12bpf: Convert bpf_cpumask to bpf_mem_cache_free_rcu.Alexei Starovoitov
Convert bpf_cpumask to bpf_mem_cache_free_rcu. Note that migrate_disable() in bpf_cpumask_release() is still necessary, since bpf_cpumask_release() is a dtor. bpf_obj_free_fields() can be converted to do migrate_disable() there in a follow up. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-14-alexei.starovoitov@gmail.com
2023-07-12bpf: Introduce bpf_mem_free_rcu() similar to kfree_rcu().Alexei Starovoitov
Introduce bpf_mem_[cache_]free_rcu() similar to kfree_rcu(). Unlike bpf_mem_[cache_]free() that links objects for immediate reuse into per-cpu free list the _rcu() flavor waits for RCU grace period and then moves objects into free_by_rcu_ttrace list where they are waiting for RCU task trace grace period to be freed into slab. The life cycle of objects: alloc: dequeue free_llist free: enqeueu free_llist free_rcu: enqueue free_by_rcu -> waiting_for_gp free_llist above high watermark -> free_by_rcu_ttrace after RCU GP waiting_for_gp -> free_by_rcu_ttrace free_by_rcu_ttrace -> waiting_for_gp_ttrace -> slab Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-13-alexei.starovoitov@gmail.com
2023-07-12bpf: Allow reuse from waiting_for_gp_ttrace list.Alexei Starovoitov
alloc_bulk() can reuse elements from free_by_rcu_ttrace. Let it reuse from waiting_for_gp_ttrace as well to avoid unnecessary kmalloc(). Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230706033447.54696-10-alexei.starovoitov@gmail.com
2023-07-12bpf: Add a hint to allocated objects.Alexei Starovoitov
To address OOM issue when one cpu is allocating and another cpu is freeing add a target bpf_mem_cache hint to allocated objects and when local cpu free_llist overflows free to that bpf_mem_cache. The hint addresses the OOM while maintaining the same performance for common case when alloc/free are done on the same cpu. Note that do_call_rcu_ttrace() now has to check 'draining' flag in one more case, since do_call_rcu_ttrace() is called not only for current cpu. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-9-alexei.starovoitov@gmail.com
2023-07-12bpf: Change bpf_mem_cache draining process.Alexei Starovoitov
The next patch will introduce cross-cpu llist access and existing irq_work_sync() + drain_mem_cache() + rcu_barrier_tasks_trace() mechanism will not be enough, since irq_work_sync() + drain_mem_cache() on cpu A won't guarantee that llist on cpu A are empty. The free_bulk() on cpu B might add objects back to llist of cpu A. Add 'bool draining' flag. The modified sequence looks like: for_each_cpu: WRITE_ONCE(c->draining, true); // do_call_rcu_ttrace() won't be doing call_rcu() any more irq_work_sync(); // wait for irq_work callback (free_bulk) to finish drain_mem_cache(); // free all objects rcu_barrier_tasks_trace(); // wait for RCU callbacks to execute Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-8-alexei.starovoitov@gmail.com
2023-07-12bpf: Further refactor alloc_bulk().Alexei Starovoitov
In certain scenarios alloc_bulk() might be taking free objects mainly from free_by_rcu_ttrace list. In such case get_memcg() and set_active_memcg() are redundant, but they show up in perf profile. Split the loop and only set memcg when allocating from slab. No performance difference in this patch alone, but it helps in combination with further patches. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-7-alexei.starovoitov@gmail.com
2023-07-12bpf: Factor out inc/dec of active flag into helpers.Alexei Starovoitov
Factor out local_inc/dec_return(&c->active) into helpers. No functional changes. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-6-alexei.starovoitov@gmail.com
2023-07-12bpf: Refactor alloc_bulk().Alexei Starovoitov
Factor out inner body of alloc_bulk into separate helper. No functional changes. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-5-alexei.starovoitov@gmail.com
2023-07-12bpf: Let free_all() return the number of freed elements.Alexei Starovoitov
Let free_all() helper return the number of freed elements. It's not used in this patch, but helps in debug/development of bpf_mem_alloc. For example this diff for __free_rcu(): - free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size); + printk("cpu %d freed %d objs after tasks trace\n", raw_smp_processor_id(), + free_all(llist_del_all(&c->waiting_for_gp_ttrace), !!c->percpu_size)); would show how busy RCU tasks trace is. In artificial benchmark where one cpu is allocating and different cpu is freeing the RCU tasks trace won't be able to keep up and the list of objects would keep growing from thousands to millions and eventually OOMing. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-4-alexei.starovoitov@gmail.com
2023-07-12bpf: Simplify code of destroy_mem_alloc() with kmemdup().Alexei Starovoitov
Use kmemdup() to simplify the code. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-3-alexei.starovoitov@gmail.com
2023-07-12bpf: Rename few bpf_mem_alloc fields.Alexei Starovoitov
Rename: - struct rcu_head rcu; - struct llist_head free_by_rcu; - struct llist_head waiting_for_gp; - atomic_t call_rcu_in_progress; + struct llist_head free_by_rcu_ttrace; + struct llist_head waiting_for_gp_ttrace; + struct rcu_head rcu_ttrace; + atomic_t call_rcu_ttrace_in_progress; ... - static void do_call_rcu(struct bpf_mem_cache *c) + static void do_call_rcu_ttrace(struct bpf_mem_cache *c) to better indicate intended use. The 'tasks trace' is shortened to 'ttrace' to reduce verbosity. No functional changes. Later patches will add free_by_rcu/waiting_for_gp fields to be used with normal RCU. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20230706033447.54696-2-alexei.starovoitov@gmail.com
2023-07-12bpf: teach verifier actual bounds of bpf_get_smp_processor_id() resultAndrii Nakryiko
bpf_get_smp_processor_id() helper returns current CPU on which BPF program runs. It can't return value that is bigger than maximum allowed number of CPUs (minus one, due to zero indexing). Teach BPF verifier to recognize that. This makes it possible to use bpf_get_smp_processor_id() result to index into arrays without extra checks, as demonstrated in subsequent selftests/bpf patch. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230711232400.1658562-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11bpf: Support ->fill_link_info for perf_eventYafang Shao
By introducing support for ->fill_link_info to the perf_event link, users gain the ability to inspect it using `bpftool link show`. While the current approach involves accessing this information via `bpftool perf show`, consolidating link information for all link types in one place offers greater convenience. Additionally, this patch extends support to the generic perf event, which is not currently accommodated by `bpftool perf show`. While only the perf type and config are exposed to userspace, other attributes such as sample_period and sample_freq are ignored. It's important to note that if kptr_restrict is not permitted, the probed address will not be exposed, maintaining security measures. A new enum bpf_perf_event_type is introduced to help the user understand which struct is relevant. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20230709025630.3735-9-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11bpf: Add a common helper bpf_copy_to_user()Yafang Shao
Add a common helper bpf_copy_to_user(), which will be used at multiple places. No functional change. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230709025630.3735-8-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-11bpf: cpumap: Fix memory leak in cpu_map_update_elemPu Lehui
Syzkaller reported a memory leak as follows: BUG: memory leak unreferenced object 0xff110001198ef748 (size 192): comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s) hex dump (first 32 bytes): 00 00 00 00 4a 19 00 00 80 ad e3 e4 fe ff c0 00 ....J........... 00 b2 d3 0c 01 00 11 ff 28 f5 8e 19 01 00 11 ff ........(....... backtrace: [<ffffffffadd28087>] __cpu_map_entry_alloc+0xf7/0xb00 [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0 [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520 [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720 [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90 [<ffffffffb029cc80>] do_syscall_64+0x30/0x40 [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 BUG: memory leak unreferenced object 0xff110001198ef528 (size 192): comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffffadd281f0>] __cpu_map_entry_alloc+0x260/0xb00 [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0 [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520 [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720 [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90 [<ffffffffb029cc80>] do_syscall_64+0x30/0x40 [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 BUG: memory leak unreferenced object 0xff1100010fd93d68 (size 8): comm "syz-executor.3", pid 17672, jiffies 4298118891 (age 9.906s) hex dump (first 8 bytes): 00 00 00 00 00 00 00 00 ........ backtrace: [<ffffffffade5db3e>] kvmalloc_node+0x11e/0x170 [<ffffffffadd28280>] __cpu_map_entry_alloc+0x2f0/0xb00 [<ffffffffadd28d8e>] cpu_map_update_elem+0x2fe/0x3d0 [<ffffffffadc6d0fd>] bpf_map_update_value.isra.0+0x2bd/0x520 [<ffffffffadc7349b>] map_update_elem+0x4cb/0x720 [<ffffffffadc7d983>] __se_sys_bpf+0x8c3/0xb90 [<ffffffffb029cc80>] do_syscall_64+0x30/0x40 [<ffffffffb0400099>] entry_SYSCALL_64_after_hwframe+0x61/0xc6 In the cpu_map_update_elem flow, when kthread_stop is called before calling the threadfn of rcpu->kthread, since the KTHREAD_SHOULD_STOP bit of kthread has been set by kthread_stop, the threadfn of rcpu->kthread will never be executed, and rcpu->refcnt will never be 0, which will lead to the allocated rcpu, rcpu->queue and rcpu->queue->queue cannot be released. Calling kthread_stop before executing kthread's threadfn will return -EINTR. We can complete the release of memory resources in this state. Fixes: 6710e1126934 ("bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP") Signed-off-by: Pu Lehui <pulehui@huawei.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20230711115848.2701559-1-pulehui@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-06bpf: make preloaded map iterators to display map elements countAnton Protopopov
Add another column to the /sys/fs/bpf/maps.debug iterator to display cur_entries, the current number of entries in the map as is returned by the bpf_map_sum_elem_count kfunc. Also fix formatting. Example: # cat /sys/fs/bpf/maps.debug id name max_entries cur_entries 2 iterator.rodata 1 0 125 cilium_auth_map 524288 666 126 cilium_runtime_ 256 0 127 cilium_signals 32 0 128 cilium_node_map 16384 1344 129 cilium_events 32 0 ... Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230706133932.45883-5-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-06bpf: populate the per-cpu insertions/deletions counters for hashmapsAnton Protopopov
Initialize and utilize the per-cpu insertions/deletions counters for hash-based maps. Non-trivial changes only apply to the preallocated maps for which the {inc,dec}_elem_count functions are not called, as there's no need in counting elements to sustain proper map operations. To increase/decrease percpu counters for preallocated maps we add raw calls to the bpf_map_{inc,dec}_elem_count functions so that the impact is minimal. For dynamically allocated maps we add corresponding calls to the existing {inc,dec}_elem_count functions. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230706133932.45883-4-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-06bpf: add a new kfunc to return current bpf_map elements countAnton Protopopov
A bpf_map_sum_elem_count kfunc was added to simplify getting the sum of the map per-cpu element counters. If a map doesn't implement the counter, then the function will always return 0. Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230706133932.45883-3-aspsk@isovalent.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-05bpf: Fix max stack depth check for async callbacksKumar Kartikeya Dwivedi
The check_max_stack_depth pass happens after the verifier's symbolic execution, and attempts to walk the call graph of the BPF program, ensuring that the stack usage stays within bounds for all possible call chains. There are two cases to consider: bpf_pseudo_func and bpf_pseudo_call. In the former case, the callback pointer is loaded into a register, and is assumed that it is passed to some helper later which calls it (however there is no way to be sure), but the check remains conservative and accounts the stack usage anyway. For this particular case, asynchronous callbacks are skipped as they execute asynchronously when their corresponding event fires. The case of bpf_pseudo_call is simpler and we know that the call is definitely made, hence the stack depth of the subprog is accounted for. However, the current check still skips an asynchronous callback even if a bpf_pseudo_call was made for it. This is erroneous, as it will miss accounting for the stack usage of the asynchronous callback, which can be used to breach the maximum stack depth limit. Fix this by only skipping asynchronous callbacks when the instruction is not a pseudo call to the subprog. Fixes: 7ddc80a476c2 ("bpf: Teach stack depth check about async callbacks.") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230705144730.235802-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-07-05bpf: Remove unnecessary ring buffer size checkHou Tao
The theoretical maximum size of ring buffer is about 64GB, but now the size of ring buffer is specified by max_entries in bpf_attr and its maximum value is (4GB - 1), and it won't be possible for overflow. So just remove the unnecessary size check in ringbuf_map_alloc() but keep the comments for possible extension in future. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Closes: https://lore.kernel.org/bpf/9c636a63-1f3d-442d-9223-96c2dccb9469@moroto.mountain Link: https://lore.kernel.org/bpf/20230704074014.216616-1-houtao@huaweicloud.com
2023-07-03bpf, btf: Warn but return no error for NULL btf from ↵SeongJae Park
__register_btf_kfunc_id_set() __register_btf_kfunc_id_set() assumes .BTF to be part of the module's .ko file if CONFIG_DEBUG_INFO_BTF is enabled. If that's not the case, the function prints an error message and return an error. As a result, such modules cannot be loaded. However, the section could be stripped out during a build process. It would be better to let the modules loaded, because their basic functionalities have no problem [0], though the BTF functionalities will not be supported. Make the function to lower the level of the message from error to warn, and return no error. [0] https://lore.kernel.org/bpf/20220219082037.ow2kbq5brktf4f2u@apollo.legion Fixes: c446fdacb10d ("bpf: fix register_btf_kfunc_id_set for !CONFIG_DEBUG_INFO_BTF") Reported-by: Alexander Egorenkov <Alexander.Egorenkov@ibm.com> Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/87y228q66f.fsf@oc8242746057.ibm.com Link: https://lore.kernel.org/bpf/20220219082037.ow2kbq5brktf4f2u@apollo.legion Link: https://lore.kernel.org/bpf/20230701171447.56464-1-sj@kernel.org
2023-06-30bpf: Resolve modifiers when walking structsStanislav Fomichev
It is impossible to use skb_frag_t in the tracing program. Resolve typedefs when walking structs. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20230626212522.2414485-1-sdf@google.com
2023-06-29bpf: Replace deprecated -target with --target= for ClangFangrui Song
The -target option has been deprecated since clang 3.4 in 2013. Therefore, use the preferred --target=bpf form instead. This also matches how we use --target= in scripts/Makefile.clang. Signed-off-by: Fangrui Song <maskray@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://github.com/llvm/llvm-project/commit/274b6f0c87a6a1798de0a68135afc7f95def6277 Link: https://lore.kernel.org/bpf/20230624001856.1903733-1-maskray@google.com
2023-06-24Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-06-23 We've added 49 non-merge commits during the last 24 day(s) which contain a total of 70 files changed, 1935 insertions(+), 442 deletions(-). The main changes are: 1) Extend bpf_fib_lookup helper to allow passing the route table ID, from Louis DeLosSantos. 2) Fix regsafe() in verifier to call check_ids() for scalar registers, from Eduard Zingerman. 3) Extend the set of cpumask kfuncs with bpf_cpumask_first_and() and a rework of bpf_cpumask_any*() kfuncs. Additionally, add selftests, from David Vernet. 4) Fix socket lookup BPF helpers for tc/XDP to respect VRF bindings, from Gilad Sever. 5) Change bpf_link_put() to use workqueue unconditionally to fix it under PREEMPT_RT, from Sebastian Andrzej Siewior. 6) Follow-ups to address issues in the bpf_refcount shared ownership implementation, from Dave Marchevsky. 7) A few general refactorings to BPF map and program creation permissions checks which were part of the BPF token series, from Andrii Nakryiko. 8) Various fixes for benchmark framework and add a new benchmark for BPF memory allocator to BPF selftests, from Hou Tao. 9) Documentation improvements around iterators and trusted pointers, from Anton Protopopov. 10) Small cleanup in verifier to improve allocated object check, from Daniel T. Lee. 11) Improve performance of bpf_xdp_pointer() by avoiding access to shared_info when XDP packet does not have frags, from Jesper Dangaard Brouer. 12) Silence a harmless syzbot-reported warning in btf_type_id_size(), from Yonghong Song. 13) Remove duplicate bpfilter_umh_cleanup in favor of umd_cleanup_helper, from Jarkko Sakkinen. 14) Fix BPF selftests build for resolve_btfids under custom HOSTCFLAGS, from Viktor Malik. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits) bpf, docs: Document existing macros instead of deprecated bpf, docs: BPF Iterator Document selftests/bpf: Fix compilation failure for prog vrf_socket_lookup selftests/bpf: Add vrf_socket_lookup tests bpf: Fix bpf socket lookup from tc/xdp to respect socket VRF bindings bpf: Call __bpf_sk_lookup()/__bpf_skc_lookup() directly via TC hookpoint bpf: Factor out socket lookup functions for the TC hookpoint. selftests/bpf: Set the default value of consumer_cnt as 0 selftests/bpf: Ensure that next_cpu() returns a valid CPU number selftests/bpf: Output the correct error code for pthread APIs selftests/bpf: Use producer_cnt to allocate local counter array xsk: Remove unused inline function xsk_buff_discard() bpf: Keep BPF_PROG_LOAD permission checks clear of validations bpf: Centralize permissions checks for all BPF map types bpf: Inline map creation logic in map_create() function bpf: Move unprivileged checks into map_create() and bpf_prog_load() bpf: Remove in_atomic() from bpf_link_put(). selftests/bpf: Verify that check_ids() is used for scalars in regsafe() bpf: Verify scalar ids mapping in regsafe() using check_ids() selftests/bpf: Check if mark_chain_precision() follows scalar ids ... ==================== Link: https://lore.kernel.org/r/20230623211256.8409-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: tools/testing/selftests/net/fcnal-test.sh d7a2fc1437f7 ("selftests: net: fcnal-test: check if FIPS mode is enabled") dd017c72dde6 ("selftests: fcnal: Test SO_DONTROUTE on TCP sockets.") https://lore.kernel.org/all/5007b52c-dd16-dbf6-8d64-b9701bfa498b@tessares.net/ https://lore.kernel.org/all/20230619105427.4a0df9b3@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-21bpf: Force kprobe multi expected_attach_type for kprobe_multi linkJiri Olsa
We currently allow to create perf link for program with expected_attach_type == BPF_TRACE_KPROBE_MULTI. This will cause crash when we call helpers like get_attach_cookie or get_func_ip in such program, because it will call the kprobe_multi's version (current->bpf_ctx context setup) of those helpers while it expects perf_link's current->bpf_ctx context setup. Making sure that we use BPF_TRACE_KPROBE_MULTI expected_attach_type only for programs attaching through kprobe_multi link. Fixes: ca74823c6e16 ("bpf: Add cookie support to programs attached with kprobe multi link") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20230618131414.75649-1-jolsa@kernel.org
2023-06-21bpf/btf: Accept function names that contain dotsFlorent Revest
When building a kernel with LLVM=1, LLVM_IAS=0 and CONFIG_KASAN=y, LLVM leaves DWARF tags for the "asan.module_ctor" & co symbols. In turn, pahole creates BTF_KIND_FUNC entries for these and this makes the BTF metadata validation fail because they contain a dot. In a dramatic turn of event, this BTF verification failure can cause the netfilter_bpf initialization to fail, causing netfilter_core to free the netfilter_helper hashmap and netfilter_ftp to trigger a use-after-free. The risk of u-a-f in netfilter will be addressed separately but the existence of "asan.module_ctor" debug info under some build conditions sounds like a good enough reason to accept functions that contain dots in BTF. Although using only LLVM=1 is the recommended way to compile clang-based kernels, users can certainly do LLVM=1, LLVM_IAS=0 as well and we still try to support that combination according to Nick. To clarify: - > v5.10 kernel, LLVM=1 (LLVM_IAS=0 is not the default) is recommended, but user can still have LLVM=1, LLVM_IAS=0 to trigger the issue - <= 5.10 kernel, LLVM=1 (LLVM_IAS=0 is the default) is recommended in which case GNU as will be used Fixes: 1dc92851849c ("bpf: kernel side support for BTF Var and DataSec") Signed-off-by: Florent Revest <revest@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andrii@kernel.org> Cc: Yonghong Song <yhs@meta.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/bpf/20230615145607.3469985-1-revest@chromium.org
2023-06-19bpf: Keep BPF_PROG_LOAD permission checks clear of validationsAndrii Nakryiko
Move out flags validation and license checks out of the permission checks. They were intermingled, which makes subsequent changes harder. Clean this up: perform straightforward flag validation upfront, and fetch and check license later, right where we use it. Also consolidate capabilities check in one block, right after basic attribute sanity checks. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230613223533.3689589-5-andrii@kernel.org
2023-06-19bpf: Centralize permissions checks for all BPF map typesAndrii Nakryiko
This allows to do more centralized decisions later on, and generally makes it very explicit which maps are privileged and which are not (e.g., LRU_HASH and LRU_PERCPU_HASH, which are privileged HASH variants, as opposed to unprivileged HASH and HASH_PERCPU; now this is explicit and easy to verify). Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230613223533.3689589-4-andrii@kernel.org
2023-06-19bpf: Inline map creation logic in map_create() functionAndrii Nakryiko
Currently find_and_alloc_map() performs two separate functions: some argument sanity checking and partial map creation workflow hanling. Neither of those functions are self-sufficient and are augmented by further checks and initialization logic in the caller (map_create() function). So unify all the sanity checks, permission checks, and creation and initialization logic in one linear piece of code in map_create() instead. This also make it easier to further enhance permission checks and keep them located in one place. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230613223533.3689589-3-andrii@kernel.org
2023-06-19bpf: Move unprivileged checks into map_create() and bpf_prog_load()Andrii Nakryiko
Make each bpf() syscall command a bit more self-contained, making it easier to further enhance it. We move sysctl_unprivileged_bpf_disabled handling down to map_create() and bpf_prog_load(), two special commands in this regard. Also swap the order of checks, calling bpf_capable() only if sysctl_unprivileged_bpf_disabled is true, avoiding unnecessary audit messages. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230613223533.3689589-2-andrii@kernel.org
2023-06-16bpf: Remove in_atomic() from bpf_link_put().Sebastian Andrzej Siewior
bpf_free_inode() is invoked as a RCU callback. Usually RCU callbacks are invoked within softirq context. By setting rcutree.use_softirq=0 boot option the RCU callbacks will be invoked in a per-CPU kthread with bottom halves disabled which implies a RCU read section. On PREEMPT_RT the context remains fully preemptible. The RCU read section however does not allow schedule() invocation. The latter happens in mutex_lock() performed by bpf_trampoline_unlink_prog() originated from bpf_link_put(). It was pointed out that the bpf_link_put() invocation should not be delayed if originated from close(). It was also pointed out that other invocations from within a syscall should also avoid the workqueue. Everyone else should use workqueue by default to remain safe in the future (while auditing the code, every caller was preemptible except for the RCU case). Let bpf_link_put() use the worker unconditionally. Add bpf_link_put_direct() which will directly free the resources and is used by close() and from within __sys_bpf(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230614083430.oENawF8f@linutronix.de
2023-06-13bpf: Verify scalar ids mapping in regsafe() using check_ids()Eduard Zingerman
Make sure that the following unsafe example is rejected by verifier: 1: r9 = ... some pointer with range X ... 2: r6 = ... unbound scalar ID=a ... 3: r7 = ... unbound scalar ID=b ... 4: if (r6 > r7) goto +1 5: r6 = r7 6: if (r6 > X) goto ... --- checkpoint --- 7: r9 += r7 8: *(u64 *)r9 = Y This example is unsafe because not all execution paths verify r7 range. Because of the jump at (4) the verifier would arrive at (6) in two states: I. r6{.id=b}, r7{.id=b} via path 1-6; II. r6{.id=a}, r7{.id=b} via path 1-4, 6. Currently regsafe() does not call check_ids() for scalar registers, thus from POV of regsafe() states (I) and (II) are identical. If the path 1-6 is taken by verifier first, and checkpoint is created at (6) the path [1-4, 6] would be considered safe. Changes in this commit: - check_ids() is modified to disallow mapping multiple old_id to the same cur_id. - check_scalar_ids() is added, unlike check_ids() it treats ID zero as a unique scalar ID. - check_scalar_ids() needs to generate temporary unique IDs, field 'tmp_id_gen' is added to bpf_verifier_env::idmap_scratch to facilitate this. - regsafe() is updated to: - use check_scalar_ids() for precise scalar registers. - compare scalar registers using memcmp only for explore_alu_limits branch. This simplifies control flow for scalar case, and has no measurable performance impact. - check_alu_op() is updated to avoid generating bpf_reg_state::id for constant scalar values when processing BPF_MOV. ID is needed to propagate range information for identical values, but there is nothing to propagate for constants. Fixes: 75748837b7e5 ("bpf: Propagate scalar ranges through register assignments.") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230613153824.3324830-4-eddyz87@gmail.com
2023-06-13bpf: Use scalar ids in mark_chain_precision()Eduard Zingerman
Change mark_chain_precision() to track precision in situations like below: r2 = unknown value ... --- state #0 --- ... r1 = r2 // r1 and r2 now share the same ID ... --- state #1 {r1.id = A, r2.id = A} --- ... if (r2 > 10) goto exit; // find_equal_scalars() assigns range to r1 ... --- state #2 {r1.id = A, r2.id = A} --- r3 = r10 r3 += r1 // need to mark both r1 and r2 At the beginning of the processing of each state, ensure that if a register with a scalar ID is marked as precise, all registers sharing this ID are also marked as precise. This property would be used by a follow-up change in regsafe(). Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230613153824.3324830-2-eddyz87@gmail.com
2023-06-13bpf: ensure main program has an extableKrister Johansen
When subprograms are in use, the main program is not jit'd after the subprograms because jit_subprogs sets a value for prog->bpf_func upon success. Subsequent calls to the JIT are bypassed when this value is non-NULL. This leads to a situation where the main program and its func[0] counterpart are both in the bpf kallsyms tree, but only func[0] has an extable. Extables are only created during JIT. Now there are two nearly identical program ksym entries in the tree, but only one has an extable. Depending upon how the entries are placed, there's a chance that a fault will call search_extable on the aux with the NULL entry. Since jit_subprogs already copies state from func[0] to the main program, include the extable pointer in this state duplication. Additionally, ensure that the copy of the main program in func[0] is not added to the bpf_prog_kallsyms table. Instead, let the main program get added later in bpf_prog_load(). This ensures there is only a single copy of the main program in the kallsyms table, and that its tag matches the tag observed by tooling like bpftool. Cc: stable@vger.kernel.org Fixes: 1c2a088a6626 ("bpf: x64: add JIT support for multi-function programs") Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Ilya Leoshkevich <iii@linux.ibm.com> Tested-by: Ilya Leoshkevich <iii@linux.ibm.com> Link: https://lore.kernel.org/r/6de9b2f4b4724ef56efbb0339daaa66c8b68b1e7.1686616663.git.kjlx@templeofstupid.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-12bpf: Replace bpf_cpumask_any* with bpf_cpumask_any_distribute*David Vernet
We currently export the bpf_cpumask_any() and bpf_cpumask_any_and() kfuncs. Intuitively, one would expect these to choose any CPU in the cpumask, but what they actually do is alias to cpumask_first() and cpmkas_first_and(). This is useless given that we already export bpf_cpumask_first() and bpf_cpumask_first_and(), so this patch replaces them with kfuncs that call cpumask_any_distribute() and cpumask_any_and_distribute(), which actually choose any CPU from the cpumask (or the AND of two cpumasks for the latter). Signed-off-by: David Vernet <void@manifault.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230610035053.117605-3-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-06-12bpf: Add bpf_cpumask_first_and() kfuncDavid Vernet
We currently provide bpf_cpumask_first(), bpf_cpumask_any(), and bpf_cpumask_any_and() kfuncs. bpf_cpumask_any() and bpf_cpumask_any_and() are confusing misnomers in that they actually just call cpumask_first() and cpumask_first_and() respectively. We'll replace them with bpf_cpumask_any_distribute() and bpf_cpumask_any_distribute_and() kfuncs in a subsequent patch, so let's ensure feature parity by adding a bpf_cpumask_first_and() kfunc to account for bpf_cpumask_any_and() being removed. Signed-off-by: David Vernet <void@manifault.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230610035053.117605-1-void@manifault.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>