summaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)Author
2023-10-24bpf: Improve JEQ/JNE branch taken logicAndrii Nakryiko
When determining if an if/else branch will always or never be taken, use signed range knowledge in addition to currently used unsigned range knowledge. If either signed or unsigned range suggests that condition is always/never taken, return corresponding branch_taken verdict. Current use of unsigned range for this seems arbitrary and unnecessarily incomplete. It is possible for *signed* operations to be performed on register, which could "invalidate" unsigned range for that register. In such case branch_taken will be artificially useless, even if we can still tell that some constant is outside of register value range based on its signed bounds. veristat-based validation shows zero differences across selftests, Cilium, and Meta-internal BPF object files. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/bpf/20231022205743.72352-2-andrii@kernel.org
2023-10-24bpf: Fold smp_mb__before_atomic() into atomic_set_release()Paul E. McKenney
The bpf_user_ringbuf_drain() BPF_CALL function uses an atomic_set() immediately preceded by smp_mb__before_atomic() so as to order storing of ring-buffer consumer and producer positions prior to the atomic_set() call's clearing of the ->busy flag, as follows: smp_mb__before_atomic(); atomic_set(&rb->busy, 0); Although this works given current architectures and implementations, and given that this only needs to order prior writes against a later write. However, it does so by accident because the smp_mb__before_atomic() is only guaranteed to work with read-modify-write atomic operations, and not at all with things like atomic_set() and atomic_read(). Note especially that smp_mb__before_atomic() will not, repeat *not*, order the prior write to "a" before the subsequent non-read-modify-write atomic read from "b", even on strongly ordered systems such as x86: WRITE_ONCE(a, 1); smp_mb__before_atomic(); r1 = atomic_read(&b); Therefore, replace the smp_mb__before_atomic() and atomic_set() with atomic_set_release() as follows: atomic_set_release(&rb->busy, 0); This is no slower (and sometimes is faster) than the original, and also provides a formal guarantee of ordering that the original lacks. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: David Vernet <void@manifault.com> Link: https://lore.kernel.org/bpf/ec86d38e-cfb4-44aa-8fdb-6c925922d93c@paulmck-laptop
2023-10-24bpf: Fix unnecessary -EBUSY from htab_lock_bucketSong Liu
htab_lock_bucket uses the following logic to avoid recursion: 1. preempt_disable(); 2. check percpu counter htab->map_locked[hash] for recursion; 2.1. if map_lock[hash] is already taken, return -BUSY; 3. raw_spin_lock_irqsave(); However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ logic will not able to access the same hash of the hashtab and get -EBUSY. This -EBUSY is not really necessary. Fix it by disabling IRQ before checking map_locked: 1. preempt_disable(); 2. local_irq_save(); 3. check percpu counter htab->map_locked[hash] for recursion; 3.1. if map_lock[hash] is already taken, return -BUSY; 4. raw_spin_lock(). Similarly, use raw_spin_unlock() and local_irq_restore() in htab_unlock_bucket(). Fixes: 20b6cc34ea74 ("bpf: Avoid hashtab deadlock with map_locked") Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/7a9576222aa40b1c84ad3a9ba3e64011d1a04d41.camel@linux.ibm.com Link: https://lore.kernel.org/bpf/20231012055741.3375999-1-song@kernel.org
2023-10-23bpf: print full verifier states on infinite loop detectionEduard Zingerman
Additional logging in is_state_visited(): if infinite loop is detected print full verifier state for both current and equivalent states. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-8-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23bpf: correct loop detection for iterators convergenceEduard Zingerman
It turns out that .branches > 0 in is_state_visited() is not a sufficient condition to identify if two verifier states form a loop when iterators convergence is computed. This commit adds logic to distinguish situations like below: (I) initial (II) initial | | V V .---------> hdr .. | | | | V V | .------... .------.. | | | | | | V V V V | ... ... .-> hdr .. | | | | | | | V V | V V | succ <- cur | succ <- cur | | | | | V | V | ... | ... | | | | '----' '----' For both (I) and (II) successor 'succ' of the current state 'cur' was previously explored and has branches count at 0. However, loop entry 'hdr' corresponding to 'succ' might be a part of current DFS path. If that is the case 'succ' and 'cur' are members of the same loop and have to be compared exactly. Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reviewed-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23bpf: exact states comparison for iterator convergence checksEduard Zingerman
Convergence for open coded iterators is computed in is_state_visited() by examining states with branches count > 1 and using states_equal(). states_equal() computes sub-state relation using read and precision marks. Read and precision marks are propagated from children states, thus are not guaranteed to be complete inside a loop when branches count > 1. This could be demonstrated using the following unsafe program: 1. r7 = -16 2. r6 = bpf_get_prandom_u32() 3. while (bpf_iter_num_next(&fp[-8])) { 4. if (r6 != 42) { 5. r7 = -32 6. r6 = bpf_get_prandom_u32() 7. continue 8. } 9. r0 = r10 10. r0 += r7 11. r8 = *(u64 *)(r0 + 0) 12. r6 = bpf_get_prandom_u32() 13. } Here verifier would first visit path 1-3, create a checkpoint at 3 with r7=-16, continue to 4-7,3 with r7=-32. Because instructions at 9-12 had not been visitied yet existing checkpoint at 3 does not have read or precision mark for r7. Thus states_equal() would return true and verifier would discard current state, thus unsafe memory access at 11 would not be caught. This commit fixes this loophole by introducing exact state comparisons for iterator convergence logic: - registers are compared using regs_exact() regardless of read or precision marks; - stack slots have to have identical type. Unfortunately, this is too strict even for simple programs like below: i = 0; while(iter_next(&it)) i++; At each iteration step i++ would produce a new distinct state and eventually instruction processing limit would be reached. To avoid such behavior speculatively forget (widen) range for imprecise scalar registers, if those registers were not precise at the end of the previous iteration and do not match exactly. This a conservative heuristic that allows to verify wide range of programs, however it precludes verification of programs that conjure an imprecise value on the first loop iteration and use it as precise on the second. Test case iter_task_vma_for_each() presents one of such cases: unsigned int seen = 0; ... bpf_for_each(task_vma, vma, task, 0) { if (seen >= 1000) break; ... seen++; } Here clang generates the following code: <LBB0_4>: 24: r8 = r6 ; stash current value of ... body ... 'seen' 29: r1 = r10 30: r1 += -0x8 31: call bpf_iter_task_vma_next 32: r6 += 0x1 ; seen++; 33: if r0 == 0x0 goto +0x2 <LBB0_6> ; exit on next() == NULL 34: r7 += 0x10 35: if r8 < 0x3e7 goto -0xc <LBB0_4> ; loop on seen < 1000 <LBB0_6>: ... exit ... Note that counter in r6 is copied to r8 and then incremented, conditional jump is done using r8. Because of this precision mark for r6 lags one state behind of precision mark on r8 and widening logic kicks in. Adding barrier_var(seen) after conditional is sufficient to force clang use the same register for both counting and conditional jump. This issue was discussed in the thread [1] which was started by Andrew Werner <awerner32@gmail.com> demonstrating a similar bug in callback functions handling. The callbacks would be addressed in a followup patch. [1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/ Co-developed-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Co-developed-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-4-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23bpf: extract same_callsites() as utility functionEduard Zingerman
Extract same_callsites() from clean_live_states() as a utility function. This function would be used by the next patch in the set. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-3-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23bpf: move explored_state() closer to the beginning of verifier.cEduard Zingerman
Subsequent patches would make use of explored_state() function. Move it up to avoid adding unnecessary prototype. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231024000917.12153-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-23bpf, tcx: Get rid of tcx_link_constDaniel Borkmann
Small clean up to get rid of the extra tcx_link_const() and only retain the tcx_link(). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20231023185015.21152-1-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-20bpf: Use bpf_global_percpu_ma for per-cpu kptr in __bpf_obj_drop_impl()Hou Tao
The following warning was reported when running "./test_progs -t test_bpf_ma/percpu_free_through_map_free": ------------[ cut here ]------------ WARNING: CPU: 1 PID: 68 at kernel/bpf/memalloc.c:342 CPU: 1 PID: 68 Comm: kworker/u16:2 Not tainted 6.6.0-rc2+ #222 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) Workqueue: events_unbound bpf_map_free_deferred RIP: 0010:bpf_mem_refill+0x21c/0x2a0 ...... Call Trace: <IRQ> ? bpf_mem_refill+0x21c/0x2a0 irq_work_single+0x27/0x70 irq_work_run_list+0x2a/0x40 irq_work_run+0x18/0x40 __sysvec_irq_work+0x1c/0xc0 sysvec_irq_work+0x73/0x90 </IRQ> <TASK> asm_sysvec_irq_work+0x1b/0x20 RIP: 0010:unit_free+0x50/0x80 ...... bpf_mem_free+0x46/0x60 __bpf_obj_drop_impl+0x40/0x90 bpf_obj_free_fields+0x17d/0x1a0 array_map_free+0x6b/0x170 bpf_map_free_deferred+0x54/0xa0 process_scheduled_works+0xba/0x370 worker_thread+0x16d/0x2e0 kthread+0x105/0x140 ret_from_fork+0x39/0x60 ret_from_fork_asm+0x1b/0x30 </TASK> ---[ end trace 0000000000000000 ]--- The reason is simple: __bpf_obj_drop_impl() does not know the freeing field is a per-cpu pointer and it uses bpf_global_ma to free the pointer. Because bpf_global_ma is not a per-cpu allocator, so ksize() is used to select the corresponding cache. The bpf_mem_cache with 16-bytes unit_size will always be selected to do the unmatched free and it will trigger the warning in free_bulk() eventually. Because per-cpu kptr doesn't support list or rb-tree now, so fix the problem by only checking whether or not the type of kptr is per-cpu in bpf_obj_free_fields(), and using bpf_global_percpu_ma to these kptrs. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-7-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20bpf: Move the declaration of __bpf_obj_drop_impl() to bpf.hHou Tao
both syscall.c and helpers.c have the declaration of __bpf_obj_drop_impl(), so just move it to a common header file. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20bpf: Use pcpu_alloc_size() in bpf_mem_free{_rcu}()Hou Tao
For bpf_global_percpu_ma, the pointer passed to bpf_mem_free_rcu() is allocated by kmalloc() and its size is fixed (16-bytes on x86-64). So no matter which cache allocates the dynamic per-cpu area, on x86-64 cache[2] will always be used to free the per-cpu area. Fix the unbalance by checking whether the bpf memory allocator is per-cpu or not and use pcpu_alloc_size() instead of ksize() to find the correct cache for per-cpu free. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-20bpf: Re-enable unit_size checking for global per-cpu allocatorHou Tao
With pcpu_alloc_size() in place, check whether or not the size of the dynamic per-cpu area is matched with unit_size. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19bpf: Let bpf_iter_task_new accept null task ptrChuyi Zhou
When using task_iter to iterate all threads of a specific task, we enforce that the user must pass a valid task pointer to ensure safety. However, when iterating all threads/process in the system, BPF verifier still require a valid ptr instead of "nullable" pointer, even though it's pointless, which is a kind of surprising from usability standpoint. It would be nice if we could let that kfunc accept a explicit null pointer when we are using BPF_TASK_ITER_ALL_{PROCS, THREADS} and a valid pointer when using BPF_TASK_ITER_THREAD. Given a trival kfunc: __bpf_kfunc void FN(struct TYPE_A *obj); BPF Prog would reject a nullptr for obj. The error info is: "arg#x pointer type xx xx must point to scalar, or struct with scalar" reported by get_kfunc_ptr_arg_type(). The reg->type is SCALAR_VALUE and the btf type of ref_t is not scalar or scalar_struct which leads to the rejection of get_kfunc_ptr_arg_type. This patch add "__nullable" annotation: __bpf_kfunc void FN(struct TYPE_A *obj__nullable); Here __nullable indicates obj can be optional, user can pass a explicit nullptr or a normal TYPE_A pointer. In get_kfunc_ptr_arg_type(), we will detect whether the current arg is optional and register is null, If so, return a new kfunc_ptr_arg_type KF_ARG_PTR_TO_NULL and skip to the next arg in check_kfunc_args(). Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-7-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19bpf: teach the verifier to enforce css_iter and task_iter in RCU CSChuyi Zhou
css_iter and task_iter should be used in rcu section. Specifically, in sleepable progs explicit bpf_rcu_read_lock() is needed before use these iters. In normal bpf progs that have implicit rcu_read_lock(), it's OK to use them directly. This patch adds a new a KF flag KF_RCU_PROTECTED for bpf_iter_task_new and bpf_iter_css_new. It means the kfunc should be used in RCU CS. We check whether we are in rcu cs before we want to invoke this kfunc. If the rcu protection is guaranteed, we would let st->type = PTR_TO_STACK | MEM_RCU. Once user do rcu_unlock during the iteration, state MEM_RCU of regs would be cleared. is_iter_reg_valid_init() will reject if reg->type is UNTRUSTED. It is worth noting that currently, bpf_rcu_read_unlock does not clear the state of the STACK_ITER reg, since bpf_for_each_spilled_reg only considers STACK_SPILL. This patch also let bpf_for_each_spilled_reg search STACK_ITER. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-6-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19bpf: Introduce css open-coded iterator kfuncsChuyi Zhou
This Patch adds kfuncs bpf_iter_css_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_css in open-coded iterator style. These kfuncs actually wrapps css_next_descendant_{pre, post}. css_iter can be used to: 1) iterating a sepcific cgroup tree with pre/post/up order 2) iterating cgroup_subsystem in BPF Prog, like for_each_mem_cgroup_tree/cpuset_for_each_descendant_pre in kernel. The API design is consistent with cgroup_iter. bpf_iter_css_new accepts parameters defining iteration order and starting css. Here we also reuse BPF_CGROUP_ITER_DESCENDANTS_PRE, BPF_CGROUP_ITER_DESCENDANTS_POST, BPF_CGROUP_ITER_ANCESTORS_UP enums. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-5-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19bpf: Introduce task open coded iterator kfuncsChuyi Zhou
This patch adds kfuncs bpf_iter_task_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_task in open-coded iterator style. BPF programs can use these kfuncs or through bpf_for_each macro to iterate all processes in the system. The API design keep consistent with SEC("iter/task"). bpf_iter_task_new() accepts a specific task and iterating type which allows: 1. iterating all process in the system (BPF_TASK_ITER_ALL_PROCS) 2. iterating all threads in the system (BPF_TASK_ITER_ALL_THREADS) 3. iterating all threads of a specific task (BPF_TASK_ITER_PROC_THREADS) Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Link: https://lore.kernel.org/r/20231018061746.111364-4-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19bpf: Introduce css_task open-coded iterator kfuncsChuyi Zhou
This patch adds kfuncs bpf_iter_css_task_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_css_task in open-coded iterator style. These kfuncs actually wrapps css_task_iter_{start,next, end}. BPF programs can use these kfuncs through bpf_for_each macro for iteration of all tasks under a css. css_task_iter_*() would try to get the global spin-lock *css_set_lock*, so the bpf side has to be careful in where it allows to use this iter. Currently we only allow it in bpf_lsm and bpf iter-s. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231018061746.111364-3-zhouchuyi@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-10-19bpf: Add sockptr support for setsockoptBreno Leitao
The whole network stack uses sockptr, and while it doesn't move to something more modern, let's use sockptr in setsockptr BPF hooks, so, it could be used by other callers. The main motivation for this change is to use it in the io_uring {g,s}etsockopt(), which will use a userspace pointer for *optval, but, a kernel value for optlen. Link: https://lore.kernel.org/all/ZSArfLaaGcfd8LH8@gmail.com/ Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20231016134750.1381153-3-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19bpf: Add sockptr support for getsockoptBreno Leitao
The whole network stack uses sockptr, and while it doesn't move to something more modern, let's use sockptr in getsockptr BPF hooks, so, it could be used by other callers. The main motivation for this change is to use it in the io_uring {g,s}etsockopt(), which will use a userspace pointer for *optval, but, a kernel value for optlen. Link: https://lore.kernel.org/all/ZSArfLaaGcfd8LH8@gmail.com/ Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20231016134750.1381153-2-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19file: convert to SLAB_TYPESAFE_BY_RCUChristian Brauner
In recent discussions around some performance improvements in the file handling area we discussed switching the file cache to rely on SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based freeing for files completely. This is a pretty sensitive change overall but it might actually be worth doing. The main downside is the subtlety. The other one is that we should really wait for Jann's patch to land that enables KASAN to handle SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this exists. With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times which requires a few changes. So it isn't sufficient anymore to just acquire a reference to the file in question under rcu using atomic_long_inc_not_zero() since the file might have already been recycled and someone else might have bumped the reference. In other words, callers might see reference count bumps from newer users. For this reason it is necessary to verify that the pointer is the same before and after the reference count increment. This pattern can be seen in get_file_rcu() and __files_get_rcu(). In addition, it isn't possible to access or check fields in struct file without first aqcuiring a reference on it. Not doing that was always very dodgy and it was only usable for non-pointer data in struct file. With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a reference under rcu or they must hold the files_lock of the fdtable. Failing to do either one of this is a bug. Thanks to Jann for pointing out that we need to ensure memory ordering between reallocations and pointer check by ensuring that all subsequent loads have a dependency on the second load in get_file_rcu() and providing a fixup that was folded into this patch. Cc: Jann Horn <jannh@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18bpf: convert to new timestamp accessorsJeff Layton
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-79-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-17bpf: Fix missed rcu read lock in bpf_task_under_cgroup()Yafang Shao
When employed within a sleepable program not under RCU protection, the use of 'bpf_task_under_cgroup()' may trigger a warning in the kernel log, particularly when CONFIG_PROVE_RCU is enabled: [ 1259.662357] WARNING: suspicious RCU usage [ 1259.662358] 6.5.0+ #33 Not tainted [ 1259.662360] ----------------------------- [ 1259.662361] include/linux/cgroup.h:423 suspicious rcu_dereference_check() usage! Other info that might help to debug this: [ 1259.662366] rcu_scheduler_active = 2, debug_locks = 1 [ 1259.662368] 1 lock held by trace/72954: [ 1259.662369] #0: ffffffffb5e3eda0 (rcu_read_lock_trace){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x0/0xb0 Stack backtrace: [ 1259.662385] CPU: 50 PID: 72954 Comm: trace Kdump: loaded Not tainted 6.5.0+ #33 [ 1259.662391] Call Trace: [ 1259.662393] <TASK> [ 1259.662395] dump_stack_lvl+0x6e/0x90 [ 1259.662401] dump_stack+0x10/0x20 [ 1259.662404] lockdep_rcu_suspicious+0x163/0x1b0 [ 1259.662412] task_css_set.part.0+0x23/0x30 [ 1259.662417] bpf_task_under_cgroup+0xe7/0xf0 [ 1259.662422] bpf_prog_7fffba481a3bcf88_lsm_run+0x5c/0x93 [ 1259.662431] bpf_trampoline_6442505574+0x60/0x1000 [ 1259.662439] bpf_lsm_bpf+0x5/0x20 [ 1259.662443] ? security_bpf+0x32/0x50 [ 1259.662452] __sys_bpf+0xe6/0xdd0 [ 1259.662463] __x64_sys_bpf+0x1a/0x30 [ 1259.662467] do_syscall_64+0x38/0x90 [ 1259.662472] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 1259.662479] RIP: 0033:0x7f487baf8e29 [...] [ 1259.662504] </TASK> This issue can be reproduced by executing a straightforward program, as demonstrated below: SEC("lsm.s/bpf") int BPF_PROG(lsm_run, int cmd, union bpf_attr *attr, unsigned int size) { struct cgroup *cgrp = NULL; struct task_struct *task; int ret = 0; if (cmd != BPF_LINK_CREATE) return 0; // The cgroup2 should be mounted first cgrp = bpf_cgroup_from_id(1); if (!cgrp) goto out; task = bpf_get_current_task_btf(); if (bpf_task_under_cgroup(task, cgrp)) ret = -1; bpf_cgroup_release(cgrp); out: return ret; } After running the program, if you subsequently execute another BPF program, you will encounter the warning. It's worth noting that task_under_cgroup_hierarchy() is also utilized by bpf_current_task_under_cgroup(). However, bpf_current_task_under_cgroup() doesn't exhibit this issue because it cannot be used in sleepable BPF programs. Fixes: b5ad4cdc46c7 ("bpf: Add bpf_task_under_cgroup() kfunc") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Cc: Feng Zhou <zhoufeng.zf@bytedance.com> Cc: KP Singh <kpsingh@kernel.org> Link: https://lore.kernel.org/bpf/20231007135945.4306-1-laoar.shao@gmail.com
2023-10-17net, bpf: Add a warning if NAPI cb missed xdp_do_flush().Sebastian Andrzej Siewior
A few drivers were missing a xdp_do_flush() invocation after XDP_REDIRECT. Add three helper functions each for one of the per-CPU lists. Return true if the per-CPU list is non-empty and flush the list. Add xdp_do_check_flushed() which invokes each helper functions and creates a warning if one of the functions had a non-empty list. Hide everything behind CONFIG_DEBUG_NET. Suggested-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20231016125738.Yt79p1uF@linutronix.de
2023-10-16Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-16 We've added 90 non-merge commits during the last 25 day(s) which contain a total of 120 files changed, 3519 insertions(+), 895 deletions(-). The main changes are: 1) Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs, from Jiri Olsa. 2) Add cgroup BPF sockaddr hooks for unix sockets. The use case is for systemd to reimplement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services, from Daan De Meyer. 3) Implement BPF CPUv4 support for s390x BPF JIT, from Ilya Leoshkevich. 4) Improve BPF verifier log output for scalar registers to better disambiguate their internal state wrt defaults vs min/max values matching, from Andrii Nakryiko. 5) Extend the BPF fib lookup helpers for IPv4/IPv6 to support retrieving the source IP address with a new BPF_FIB_LOOKUP_SRC flag, from Martynas Pumputis. 6) Add support for open-coded task_vma iterator to help with symbolization for BPF-collected user stacks, from Dave Marchevsky. 7) Add libbpf getters for accessing individual BPF ring buffers which is useful for polling them individually, for example, from Martin Kelly. 8) Extend AF_XDP selftests to validate the SHARED_UMEM feature, from Tushar Vyavahare. 9) Improve BPF selftests cross-building support for riscv arch, from Björn Töpel. 10) Add the ability to pin a BPF timer to the same calling CPU, from David Vernet. 11) Fix libbpf's bpf_tracing.h macros for riscv to use the generic implementation of PT_REGS_SYSCALL_REGS() to access syscall arguments, from Alexandre Ghiti. 12) Extend libbpf to support symbol versioning for uprobes, from Hengqi Chen. 13) Fix bpftool's skeleton code generation to guarantee that ELF data is 8 byte aligned, from Ian Rogers. 14) Inherit system-wide cpu_mitigations_off() setting for Spectre v1/v4 security mitigations in BPF verifier, from Yafang Shao. 15) Annotate struct bpf_stack_map with __counted_by attribute to prepare BPF side for upcoming __counted_by compiler support, from Kees Cook. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (90 commits) bpf: Ensure proper register state printing for cond jumps bpf: Disambiguate SCALAR register state output in verifier logs selftests/bpf: Make align selftests more robust selftests/bpf: Improve missed_kprobe_recursion test robustness selftests/bpf: Improve percpu_alloc test robustness selftests/bpf: Add tests for open-coded task_vma iter bpf: Introduce task_vma open-coded iterator kfuncs selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c bpf: Don't explicitly emit BTF for struct btf_iter_num bpf: Change syscall_nr type to int in struct syscall_tp_t net/bpf: Avoid unused "sin_addr_len" warning when CONFIG_CGROUP_BPF is not set bpf: Avoid unnecessary audit log for CPU security mitigations selftests/bpf: Add tests for cgroup unix socket address hooks selftests/bpf: Make sure mount directory exists documentation/bpf: Document cgroup unix socket address hooks bpftool: Add support for cgroup unix socket address hooks libbpf: Add support for cgroup unix socket address hooks bpf: Implement cgroup sockaddr hooks for unix sockets bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf bpf: Propagate modified uaddrlen from cgroup sockaddr programs ... ==================== Link: https://lore.kernel.org/r/20231016204803.30153-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-16bpf: Ensure proper register state printing for cond jumpsAndrii Nakryiko
Verifier emits relevant register state involved in any given instruction next to it after `;` to the right, if possible. Or, worst case, on the separate line repeating instruction index. E.g., a nice and simple case would be: 2: (d5) if r0 s<= 0x0 goto pc+1 ; R0_w=0 But if there is some intervening extra output (e.g., precision backtracking log) involved, we are supposed to see the state after the precision backtrack log: 4: (75) if r0 s>= 0x0 goto pc+1 mark_precise: frame0: last_idx 4 first_idx 0 subseq_idx -1 mark_precise: frame0: regs=r0 stack= before 2: (d5) if r0 s<= 0x0 goto pc+1 mark_precise: frame0: regs=r0 stack= before 1: (b7) r0 = 0 6: R0_w=0 First off, note that in `6: R0_w=0` instruction index corresponds to the next instruction, not to the conditional jump instruction itself, which is wrong and we'll get to that. But besides that, the above is a happy case that does work today. Yet, if it so happens that precision backtracking had to traverse some of the parent states, this `6: R0_w=0` state output would be missing. This is due to a quirk of print_verifier_state() routine, which performs mark_verifier_state_clean(env) at the end. This marks all registers as "non-scratched", which means that subsequent logic to print *relevant* registers (that is, "scratched ones") fails and doesn't see anything relevant to print and skips the output altogether. print_verifier_state() is used both to print instruction context, but also to print an **entire** verifier state indiscriminately, e.g., during precision backtracking (and in a few other situations, like during entering or exiting subprogram). Which means if we have to print entire parent state before getting to printing instruction context state, instruction context is marked as clean and is omitted. Long story short, this is definitely not intentional. So we fix this behavior in this patch by teaching print_verifier_state() to clear scratch state only if it was used to print instruction state, not the parent/callback state. This is determined by print_all option, so if it's not set, we don't clear scratch state. This fixes missing instruction state for these cases. As for the mismatched instruction index, we fix that by making sure we call print_insn_state() early inside check_cond_jmp_op() before we adjusted insn_idx based on jump branch taken logic. And with that we get desired correct information: 9: (16) if w4 == 0x1 goto pc+9 mark_precise: frame0: last_idx 9 first_idx 9 subseq_idx -1 mark_precise: frame0: parent state regs=r4 stack=: R2_w=1944 R4_rw=P1 R10=fp0 mark_precise: frame0: last_idx 8 first_idx 0 subseq_idx 9 mark_precise: frame0: regs=r4 stack= before 8: (66) if w4 s> 0x3 goto pc+5 mark_precise: frame0: regs=r4 stack= before 7: (b7) r4 = 1 9: R4=1 Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231011223728.3188086-6-andrii@kernel.org
2023-10-16bpf: Disambiguate SCALAR register state output in verifier logsAndrii Nakryiko
Currently the way that verifier prints SCALAR_VALUE register state (and PTR_TO_PACKET, which can have var_off and ranges info as well) is very ambiguous. In the name of brevity we are trying to eliminate "unnecessary" output of umin/umax, smin/smax, u32_min/u32_max, and s32_min/s32_max values, if possible. Current rules are that if any of those have their default value (which for mins is the minimal value of its respective types: 0, S32_MIN, or S64_MIN, while for maxs it's U32_MAX, S32_MAX, S64_MAX, or U64_MAX) *OR* if there is another min/max value that as matching value. E.g., if smin=100 and umin=100, we'll emit only umin=10, omitting smin altogether. This approach has a few problems, being both ambiguous and sort-of incorrect in some cases. Ambiguity is due to missing value could be either default value or value of umin/umax or smin/smax. This is especially confusing when we mix signed and unsigned ranges. Quite often, umin=0 and smin=0, and so we'll have only `umin=0` leaving anyone reading verifier log to guess whether smin is actually 0 or it's actually -9223372036854775808 (S64_MIN). And often times it's important to know, especially when debugging tricky issues. "Sort-of incorrectness" comes from mixing negative and positive values. E.g., if umin is some large positive number, it can be equal to smin which is, interpreted as signed value, is actually some negative value. Currently, that smin will be omitted and only umin will be emitted with a large positive value, giving an impression that smin is also positive. Anyway, ambiguity is the biggest issue making it impossible to have an exact understanding of register state, preventing any sort of automated testing of verifier state based on verifier log. This patch is attempting to rectify the situation by removing ambiguity, while minimizing the verboseness of register state output. The rules are straightforward: - if some of the values are missing, then it definitely has a default value. I.e., `umin=0` means that umin is zero, but smin is actually S64_MIN; - all the various boundaries that happen to have the same value are emitted in one equality separated sequence. E.g., if umin and smin are both 100, we'll emit `smin=umin=100`, making this explicit; - we do not mix negative and positive values together, and even if they happen to have the same bit-level value, they will be emitted separately with proper sign. I.e., if both umax and smax happen to be 0xffffffffffffffff, we'll emit them both separately as `smax=-1,umax=18446744073709551615`; - in the name of a bit more uniformity and consistency, {u32,s32}_{min,max} are renamed to {s,u}{min,max}32, which seems to improve readability. The above means that in case of all 4 ranges being, say, [50, 100] range, we'd previously see hugely ambiguous: R1=scalar(umin=50,umax=100) Now, we'll be more explicit: R1=scalar(smin=umin=smin32=umin32=50,smax=umax=smax32=umax32=100) This is slightly more verbose, but distinct from the case when we don't know anything about signed boundaries and 32-bit boundaries, which under new rules will match the old case: R1=scalar(umin=50,umax=100) Also, in the name of simplicity of implementation and consistency, order for {s,u}32_{min,max} are emitted *before* var_off. Previously they were emitted afterwards, for unclear reasons. This patch also includes a few fixes to selftests that expect exact register state to accommodate slight changes to verifier format. You can see that the changes are pretty minimal in common cases. Note, the special case when SCALAR_VALUE register is a known constant isn't changed, we'll emit constant value once, interpreted as signed value. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20231011223728.3188086-5-andrii@kernel.org
2023-10-13bpf: Introduce task_vma open-coded iterator kfuncsDave Marchevsky
This patch adds kfuncs bpf_iter_task_vma_{new,next,destroy} which allow creation and manipulation of struct bpf_iter_task_vma in open-coded iterator style. BPF programs can use these kfuncs directly or through bpf_for_each macro for natural-looking iteration of all task vmas. The implementation borrows heavily from bpf_find_vma helper's locking - differing only in that it holds the mmap_read lock for all iterations while the helper only executes its provided callback on a maximum of 1 vma. Aside from locking, struct vma_iterator and vma_next do all the heavy lifting. A pointer to an inner data struct, struct bpf_iter_task_vma_data, is the only field in struct bpf_iter_task_vma. This is because the inner data struct contains a struct vma_iterator (not ptr), whose size is likely to change under us. If bpf_iter_task_vma_kern contained vma_iterator directly such a change would require change in opaque bpf_iter_task_vma struct's size. So better to allocate vma_iterator using BPF allocator, and since that alloc must already succeed, might as well allocate all iter fields, thereby freezing struct bpf_iter_task_vma size. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231013204426.1074286-4-davemarchevsky@fb.com
2023-10-13bpf: Don't explicitly emit BTF for struct btf_iter_numDave Marchevsky
Commit 6018e1f407cc ("bpf: implement numbers iterator") added the BTF_TYPE_EMIT line that this patch is modifying. The struct btf_iter_num doesn't exist, so only a forward declaration is emitted in BTF: FWD 'btf_iter_num' fwd_kind=struct That commit was probably hoping to ensure that struct bpf_iter_num is emitted in vmlinux BTF. A previous version of this patch changed the line to emit the correct type, but Yonghong confirmed that it would definitely be emitted regardless in [0], so this patch simply removes the line. This isn't marked "Fixes" because the extraneous btf_iter_num FWD wasn't causing any issues that I noticed, aside from mild confusion when I looked through the code. [0]: https://lore.kernel.org/bpf/25d08207-43e6-36a8-5e0f-47a913d4cda5@linux.dev/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231013204426.1074286-2-davemarchevsky@fb.com
2023-10-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: kernel/bpf/verifier.c 829955981c55 ("bpf: Fix verifier log for async callback return values") a923819fb2c5 ("bpf: Treat first argument as return value for bpf_throw") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-10-11bpf: Implement cgroup sockaddr hooks for unix socketsDaan De Meyer
These hooks allows intercepting connect(), getsockname(), getpeername(), sendmsg() and recvmsg() for unix sockets. The unix socket hooks get write access to the address length because the address length is not fixed when dealing with unix sockets and needs to be modified when a unix socket address is modified by the hook. Because abstract socket unix addresses start with a NUL byte, we cannot recalculate the socket address in kernelspace after running the hook by calculating the length of the unix socket path using strlen(). These hooks can be used when users want to multiplex syscall to a single unix socket to multiple different processes behind the scenes by redirecting the connect() and other syscalls to process specific sockets. We do not implement support for intercepting bind() because when using bind() with unix sockets with a pathname address, this creates an inode in the filesystem which must be cleaned up. If we rewrite the address, the user might try to clean up the wrong file, leaking the socket in the filesystem where it is never cleaned up. Until we figure out a solution for this (and a use case for intercepting bind()), we opt to not allow rewriting the sockaddr in bind() calls. We also implement recvmsg() support for connected streams so that after a connect() that is modified by a sockaddr hook, any corresponding recmvsg() on the connected socket can also be modified to make the connected program think it is connected to the "intended" remote. Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com> Link: https://lore.kernel.org/r/20231011185113.140426-5-daan.j.demeyer@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-11bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpfDaan De Meyer
As prep for adding unix socket support to the cgroup sockaddr hooks, let's add a kfunc bpf_sock_addr_set_sun_path() that allows modifying a unix sockaddr from bpf. While this is already possible for AF_INET and AF_INET6, we'll need this kfunc when we add unix socket support since modifying the address for those requires modifying both the address and the sockaddr length. Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com> Link: https://lore.kernel.org/r/20231011185113.140426-4-daan.j.demeyer@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-11bpf: Propagate modified uaddrlen from cgroup sockaddr programsDaan De Meyer
As prep for adding unix socket support to the cgroup sockaddr hooks, let's propagate the sockaddr length back to the caller after running a bpf cgroup sockaddr hook program. While not important for AF_INET or AF_INET6, the sockaddr length is important when working with AF_UNIX sockaddrs as the size of the sockaddr cannot be determined just from the address family or the sockaddr's contents. __cgroup_bpf_run_filter_sock_addr() is modified to take the uaddrlen as an input/output argument. After running the program, the modified sockaddr length is stored in the uaddrlen pointer. Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com> Link: https://lore.kernel.org/r/20231011185113.140426-3-daan.j.demeyer@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-09bpf: Fix verifier log for async callback return valuesDavid Vernet
The verifier, as part of check_return_code(), verifies that async callbacks such as from e.g. timers, will return 0. It does this by correctly checking that R0->var_off is in tnum_const(0), which effectively checks that it's in a range of 0. If this condition fails, however, it prints an error message which says that the value should have been in (0x0; 0x1). This results in possibly confusing output such as the following in which an async callback returns 1: At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x1) The fix is easy -- we should just pass the tnum_const(0) as the correct range to verbose_invalid_scalar(), which will then print the following: At async callback the register R0 has value (0x1; 0x0) should have been in (0x0; 0x0) Fixes: bfc6bb74e4f1 ("bpf: Implement verifier support for validation of async callbacks.") Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231009161414.235829-1-void@manifault.com
2023-10-09bpf: Add ability to pin bpf timer to calling CPUDavid Vernet
BPF supports creating high resolution timers using bpf_timer_* helper functions. Currently, only the BPF_F_TIMER_ABS flag is supported, which specifies that the timeout should be interpreted as absolute time. It would also be useful to be able to pin that timer to a core. For example, if you wanted to make a subset of cores run without timer interrupts, and only have the timer be invoked on a single core. This patch adds support for this with a new BPF_F_TIMER_CPU_PIN flag. When specified, the HRTIMER_MODE_PINNED flag is passed to hrtimer_start(). A subsequent patch will update selftests to validate. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <song@kernel.org> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/bpf/20231004162339.200702-2-void@manifault.com
2023-10-06bpf: Refuse unused attributes in bpf_prog_{attach,detach}Lorenz Bauer
The recently added tcx attachment extended the BPF UAPI for attaching and detaching by a couple of fields. Those fields are currently only supported for tcx, other types like cgroups and flow dissector silently ignore the new fields except for the new flags. This is problematic once we extend bpf_mprog to older attachment types, since it's hard to figure out whether the syscall really was successful if the kernel silently ignores non-zero values. Explicitly reject non-zero fields relevant to bpf_mprog for attachment types which don't use the latter yet. Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support") Signed-off-by: Lorenz Bauer <lmb@isovalent.com> Co-developed-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20231006220655.1653-3-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-06bpf: Handle bpf_mprog_query with NULL entryDaniel Borkmann
Improve consistency for bpf_mprog_query() API and let the latter also handle a NULL entry as can be the case for tcx. Instead of returning -ENOENT, we copy a count of 0 and revision of 1 to user space, so that this can be fed into a subsequent bpf_mprog_attach() call as expected_revision. A BPF self- test as part of this series has been added to assert this case. Suggested-by: Lorenz Bauer <lmb@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20231006220655.1653-2-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-06bpf: Fix BPF_PROG_QUERY last field checkDaniel Borkmann
While working on the ebpf-go [0] library integration for bpf_mprog and tcx, Lorenz noticed that two subsequent BPF_PROG_QUERY requests currently fail. A typical workflow is to first gather the bpf_mprog count without passing program/ link arrays, followed by the second request which contains the actual array pointers. The initial call populates count and revision fields. The second call gets rejected due to a BPF_PROG_QUERY_LAST_FIELD bug which should point to query.revision instead of query.link_attach_flags since the former is really the last member. It was not noticed in libbpf as bpf_prog_query_opts() always calls bpf(2) with an on-stack bpf_attr that is memset() each time (and therefore query.revision was reset to zero). [0] https://ebpf-go.dev Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support") Reported-by: Lorenz Bauer <lmb@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20231006220655.1653-1-daniel@iogearbox.net Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-10-06bpf: Annotate struct bpf_stack_map with __counted_byKees Cook
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle [1], add __counted_by for struct bpf_stack_map. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1] Link: https://lore.kernel.org/bpf/20231006201657.work.531-kees@kernel.org
2023-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts (or adjacent changes of note). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-09-30bpf: Use kmalloc_size_roundup() to adjust size_indexHou Tao
Commit d52b59315bf5 ("bpf: Adjust size_index according to the value of KMALLOC_MIN_SIZE") uses KMALLOC_MIN_SIZE to adjust size_index, but as reported by Nathan, the adjustment is not enough, because __kmalloc_minalign() also decides the minimal alignment of slab object as shown in new_kmalloc_cache() and its value may be greater than KMALLOC_MIN_SIZE (e.g., 64 bytes vs 8 bytes under a riscv QEMU VM). Instead of invoking __kmalloc_minalign() in bpf subsystem to find the maximal alignment, just using kmalloc_size_roundup() directly to get the corresponding slab object size for each allocation size. If these two sizes are unmatched, adjust size_index to select a bpf_mem_cache with unit_size equal to the object_size of the underlying slab cache for the allocation size. Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/bpf/20230914181407.GA1000274@dev-arch.thelio-3990X/ Signed-off-by: Hou Tao <houtao1@huawei.com> Tested-by: Emil Renner Berthing <emil.renner.berthing@canonical.com> Link: https://lore.kernel.org/r/20230928101558.2594068-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-29bpf, mprog: Fix maximum program check on mprog attachmentDaniel Borkmann
After Paul's recent improvement to syzkaller to improve coverage for bpf_mprog and tcx, it hit a splat that the program limit was surpassed. What happened is that the maximum number of progs got added, followed by another prog add request which adds with BPF_F_BEFORE flag relative to the last program in the array. The idx >= bpf_mprog_max() check in bpf_mprog_attach() still passes because the index is below the maximum but the maximum will be surpassed. We need to add a check upfront for insertions to catch this situation. Fixes: 053c8e1f235d ("bpf: Add generic attach/detach/query API for multi-progs") Reported-by: syzbot+baa44e3dbbe48e05c1ad@syzkaller.appspotmail.com Reported-by: syzbot+b97d20ed568ce0951a06@syzkaller.appspotmail.com Reported-by: syzbot+2558ca3567a77b7af4e3@syzkaller.appspotmail.com Co-developed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: syzbot+baa44e3dbbe48e05c1ad@syzkaller.appspotmail.com Tested-by: syzbot+b97d20ed568ce0951a06@syzkaller.appspotmail.com Link: https://github.com/google/syzkaller/pull/4207 Link: https://lore.kernel.org/bpf/20230929204121.20305-1-daniel@iogearbox.net
2023-09-25bpf: Add missed value to kprobe perf link infoJiri Olsa
Add missed value to kprobe attached through perf link info to hold the stats of missed kprobe handler execution. The kprobe's missed counter gets incremented when kprobe handler is not executed due to another kprobe running on the same cpu. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20230920213145.1941596-4-jolsa@kernel.org
2023-09-21bpf: Disable zero-extension for BPF_MEMSXIlya Leoshkevich
On the architectures that use bpf_jit_needs_zext(), e.g., s390x, the verifier incorrectly inserts a zero-extension after BPF_MEMSX, leading to miscompilations like the one below: 24: 89 1a ff fe 00 00 00 00 "r1 = *(s16 *)(r10 - 2);" # zext_dst set 0x3ff7fdb910e: lgh %r2,-2(%r13,%r0) # load halfword 0x3ff7fdb9114: llgfr %r2,%r2 # wrong! 25: 65 10 00 03 00 00 7f ff if r1 s> 32767 goto +3 <l0_1> # check_cond_jmp_op() Disable such zero-extensions. The JITs need to insert sign-extension themselves, if necessary. Suggested-by: Puranjay Mohan <puranjay12@gmail.com> Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Reviewed-by: Puranjay Mohan <puranjay12@gmail.com> Link: https://lore.kernel.org/r/20230919101336.2223655-2-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni
Cross-merge networking fixes after downstream PR. No conflicts. Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-09-20bpf: unconditionally reset backtrack_state masks on global func exitAndrii Nakryiko
In mark_chain_precision() logic, when we reach the entry to a global func, it is expected that R1-R5 might be still requested to be marked precise. This would correspond to some integer input arguments being tracked as precise. This is all expected and handled as a special case. What's not expected is that we'll leave backtrack_state structure with some register bits set. This is because for subsequent precision propagations backtrack_state is reused without clearing masks, as all code paths are carefully written in a way to leave empty backtrack_state with zeroed out masks, for speed. The fix is trivial, we always clear register bit in the register mask, and then, optionally, set reg->precise if register is SCALAR_VALUE type. Reported-by: Chris Mason <clm@meta.com> Fixes: be2ef8161572 ("bpf: allow precision tracking for programs with subprogs") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20230918210110.2241458-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-19bpf: Remove unused variables.Alexei Starovoitov
Remove unused prev_offset, min_size, krec_size variables. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202309190634.fL17FWoT-lkp@intel.com/ Fixes: aaa619ebccb2 ("bpf: Refactor check_btf_func and split into two phases") Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-19bpf: Fix bpf_throw warning on 32-bit archKumar Kartikeya Dwivedi
On 32-bit architectures, the pointer width is 32-bit, while we try to cast from a u64 down to it, the compiler complains on mismatch in integer size. Fix this by first casting to long which should match the pointer width on targets supported by Linux. Fixes: ec5290a178b7 ("bpf: Prevent KASAN false positive with bpf_throw") Reported-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20230918155233.297024-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-09-17Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Alexei Starovoitov says: ==================== The following pull-request contains BPF updates for your *net-next* tree. We've added 73 non-merge commits during the last 9 day(s) which contain a total of 79 files changed, 5275 insertions(+), 600 deletions(-). The main changes are: 1) Basic BTF validation in libbpf, from Andrii Nakryiko. 2) bpf_assert(), bpf_throw(), exceptions in bpf progs, from Kumar Kartikeya Dwivedi. 3) next_thread cleanups, from Oleg Nesterov. 4) Add mcpu=v4 support to arm32, from Puranjay Mohan. 5) Add support for __percpu pointers in bpf progs, from Yonghong Song. 6) Fix bpf tailcall interaction with bpf trampoline, from Leon Hwang. 7) Raise irq_work in bpf_mem_alloc while irqs are disabled to improve refill probabablity, from Hou Tao. Please consider pulling these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git Thanks a lot! Also thanks to reporters, reviewers and testers of commits in this pull-request: Alan Maguire, Andrey Konovalov, Dave Marchevsky, "Eric W. Biederman", Jiri Olsa, Maciej Fijalkowski, Quentin Monnet, Russell King (Oracle), Song Liu, Stanislav Fomichev, Yonghong Song ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-16bpf: Fix kfunc callback register type handlingKumar Kartikeya Dwivedi
The kfunc code to handle KF_ARG_PTR_TO_CALLBACK does not check the reg type before using reg->subprogno. This can accidently permit invalid pointers from being passed into callback helpers (e.g. silently from different paths). Likewise, reg->subprogno from the per-register type union may not be meaningful either. We need to reject any other type except PTR_TO_FUNC. Acked-by: Dave Marchevsky <davemarchevsky@fb.com> Fixes: 5d92ddc3de1b ("bpf: Add callback validation to kfunc verifier logic") Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20230912233214.1518551-14-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>