summaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)Author
2023-12-05bpf: fix check for attempt to corrupt spilled pointerAndrii Nakryiko
When register is spilled onto a stack as a 1/2/4-byte register, we set slot_type[BPF_REG_SIZE - 1] (plus potentially few more below it, depending on actual spill size). So to check if some stack slot has spilled register we need to consult slot_type[7], not slot_type[0]. To avoid the need to remember and double-check this in the future, just use is_spilled_reg() helper. Fixes: 27113c59b6d0 ("bpf: Check the other end of slot_type for STACK_SPILL") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231205184248.1502704-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-05bpf: support non-r10 register spill/fill to/from stack in precision trackingAndrii Nakryiko
Use instruction (jump) history to record instructions that performed register spill/fill to/from stack, regardless if this was done through read-only r10 register, or any other register after copying r10 into it *and* potentially adjusting offset. To make this work reliably, we push extra per-instruction flags into instruction history, encoding stack slot index (spi) and stack frame number in extra 10 bit flags we take away from prev_idx in instruction history. We don't touch idx field for maximum performance, as it's checked most frequently during backtracking. This change removes basically the last remaining practical limitation of precision backtracking logic in BPF verifier. It fixes known deficiencies, but also opens up new opportunities to reduce number of verified states, explored in the subsequent patches. There are only three differences in selftests' BPF object files according to veristat, all in the positive direction (less states). File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) -------------------------------------- ------------- --------- --------- ------------- ---------- ---------- ------------- test_cls_redirect_dynptr.bpf.linked3.o cls_redirect 2987 2864 -123 (-4.12%) 240 231 -9 (-3.75%) xdp_synproxy_kern.bpf.linked3.o syncookie_tc 82848 82661 -187 (-0.23%) 5107 5073 -34 (-0.67%) xdp_synproxy_kern.bpf.linked3.o syncookie_xdp 85116 84964 -152 (-0.18%) 5162 5130 -32 (-0.62%) Note, I avoided renaming jmp_history to more generic insn_hist to minimize number of lines changed and potential merge conflicts between bpf and bpf-next trees. Notice also cur_hist_entry pointer reset to NULL at the beginning of instruction verification loop. This pointer avoids the problem of relying on last jump history entry's insn_idx to determine whether we already have entry for current instruction or not. It can happen that we added jump history entry because current instruction is_jmp_point(), but also we need to add instruction flags for stack access. In this case, we don't want to entries, so we need to reuse last added entry, if it is present. Relying on insn_idx comparison has the same ambiguity problem as the one that was fixed recently in [0], so we avoid that. [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231110002638.4168352-3-andrii@kernel.org/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reported-by: Tao Lyu <tao.lyu@epfl.ch> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231205184248.1502704-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04bpf: Optimize the free of inner mapHou Tao
When removing the inner map from the outer map, the inner map will be freed after one RCU grace period and one RCU tasks trace grace period, so it is certain that the bpf program, which may access the inner map, has exited before the inner map is freed. However there is no need to wait for one RCU tasks trace grace period if the outer map is only accessed by non-sleepable program. So adding sleepable_refcnt in bpf_map and increasing sleepable_refcnt when adding the outer map into env->used_maps for sleepable program. Although the max number of bpf program is INT_MAX - 1, the number of bpf programs which are being loaded may be greater than INT_MAX, so using atomic64_t instead of atomic_t for sleepable_refcnt. When removing the inner map from the outer map, using sleepable_refcnt to decide whether or not a RCU tasks trace grace period is needed before freeing the inner map. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-6-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04bpf: Defer the free of inner map when necessaryHou Tao
When updating or deleting an inner map in map array or map htab, the map may still be accessed by non-sleepable program or sleepable program. However bpf_map_fd_put_ptr() decreases the ref-counter of the inner map directly through bpf_map_put(), if the ref-counter is the last one (which is true for most cases), the inner map will be freed by ops->map_free() in a kworker. But for now, most .map_free() callbacks don't use synchronize_rcu() or its variants to wait for the elapse of a RCU grace period, so after the invocation of ops->map_free completes, the bpf program which is accessing the inner map may incur use-after-free problem. Fix the free of inner map by invoking bpf_map_free_deferred() after both one RCU grace period and one tasks trace RCU grace period if the inner map has been removed from the outer map before. The deferment is accomplished by using call_rcu() or call_rcu_tasks_trace() when releasing the last ref-counter of bpf map. The newly-added rcu_head field in bpf_map shares the same storage space with work field to reduce the size of bpf_map. Fixes: bba1dc0b55ac ("bpf: Remove redundant synchronize_rcu.") Fixes: 638e4b825d52 ("bpf: Allows per-cpu maps and map-in-map in sleepable programs") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-5-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04bpf: Set need_defer as false when clearing fd array during map freeHou Tao
Both map deletion operation, map release and map free operation use fd_array_map_delete_elem() to remove the element from fd array and need_defer is always true in fd_array_map_delete_elem(). For the map deletion operation and map release operation, need_defer=true is necessary, because the bpf program, which accesses the element in fd array, may still alive. However for map free operation, it is certain that the bpf program which owns the fd array has already been exited, so setting need_defer as false is appropriate for map free operation. So fix it by adding need_defer parameter to bpf_fd_array_map_clear() and adding a new helper __fd_array_map_delete_elem() to handle the map deletion, map release and map free operations correspondingly. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04bpf: Add map and need_defer parameters to .map_fd_put_ptr()Hou Tao
map is the pointer of outer map, and need_defer needs some explanation. need_defer tells the implementation to defer the reference release of the passed element and ensure that the element is still alive before the bpf program, which may manipulate it, exits. The following three cases will invoke map_fd_put_ptr() and different need_defer values will be passed to these callers: 1) release the reference of the old element in the map during map update or map deletion. The release must be deferred, otherwise the bpf program may incur use-after-free problem, so need_defer needs to be true. 2) release the reference of the to-be-added element in the error path of map update. The to-be-added element is not visible to any bpf program, so it is OK to pass false for need_defer parameter. 3) release the references of all elements in the map during map release. Any bpf program which has access to the map must have been exited and released, so need_defer=false will be OK. These two parameters will be used by the following patches to fix the potential use-after-free problem for map-in-map. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04bpf: Check rcu_read_lock_trace_held() before calling bpf map helpersHou Tao
These three bpf_map_{lookup,update,delete}_elem() helpers are also available for sleepable bpf program, so add the corresponding lock assertion for sleepable bpf program, otherwise the following warning will be reported when a sleepable bpf program manipulates bpf map under interpreter mode (aka bpf_jit_enable=0): WARNING: CPU: 3 PID: 4985 at kernel/bpf/helpers.c:40 ...... CPU: 3 PID: 4985 Comm: test_progs Not tainted 6.6.0+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...... RIP: 0010:bpf_map_lookup_elem+0x54/0x60 ...... Call Trace: <TASK> ? __warn+0xa5/0x240 ? bpf_map_lookup_elem+0x54/0x60 ? report_bug+0x1ba/0x1f0 ? handle_bug+0x40/0x80 ? exc_invalid_op+0x18/0x50 ? asm_exc_invalid_op+0x1b/0x20 ? __pfx_bpf_map_lookup_elem+0x10/0x10 ? rcu_lockdep_current_cpu_online+0x65/0xb0 ? rcu_is_watching+0x23/0x50 ? bpf_map_lookup_elem+0x54/0x60 ? __pfx_bpf_map_lookup_elem+0x10/0x10 ___bpf_prog_run+0x513/0x3b70 __bpf_prog_run32+0x9d/0xd0 ? __bpf_prog_enter_sleepable_recur+0xad/0x120 ? __bpf_prog_enter_sleepable_recur+0x3e/0x120 bpf_trampoline_6442580665+0x4d/0x1000 __x64_sys_getpgid+0x5/0x30 ? do_syscall_64+0x36/0xb0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 </TASK> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231204140425.1480317-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-04bpf: Minor logging improvementAndrei Matei
One place where we were logging a register was only logging the variable part, not also the fixed part. Signed-off-by: Andrei Matei <andreimatei1@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231204011248.2040084-1-andreimatei1@gmail.com
2023-12-02bpf: simplify tnum output if a fully known constantAndrii Nakryiko
Emit tnum representation as just a constant if all bits are known. Use decimal-vs-hex logic to determine exact format of emitted constant value, just like it's done for register range values. For that move tnum_strn() to kernel/bpf/log.c to reuse decimal-vs-hex determination logic and constants. Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-12-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02bpf: enforce precision of R0 on program/async callback returnAndrii Nakryiko
Given we enforce a valid range for program and async callback return value, we must mark R0 as precise to avoid incorrect state pruning. Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02bpf: unify async callback and program retval checksAndrii Nakryiko
Use common logic to verify program return values and async callback return values. This allows to avoid duplication of any extra steps necessary, like precision marking, which will be added in the next patch. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-8-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02bpf: enforce precise retval range on program exitAndrii Nakryiko
Similarly to subprog/callback logic, enforce return value of BPF program using more precise smin/smax range. We need to adjust a bunch of tests due to a changed format of an error message. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02bpf: enforce exact retval range on subprog/callback exitAndrii Nakryiko
Instead of relying on potentially imprecise tnum representation of expected return value range for callbacks and subprogs, validate that smin/smax range satisfy exact expected range of return values. E.g., if callback would need to return [0, 2] range, tnum can't represent this precisely and instead will allow [0, 3] range. By checking smin/smax range, we can make sure that subprog/callback indeed returns only valid [0, 2] range. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02bpf: enforce precision of R0 on callback returnAndrii Nakryiko
Given verifier checks actual value, r0 has to be precise, so we need to propagate precision properly. r0 also has to be marked as read, otherwise subsequent state comparisons will ignore such register as unimportant and precision won't really help here. Fixes: 69c087ba6225 ("bpf: Add bpf_for_each_map_elem() helper") Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-02bpf: provide correct register name for exception callback retval checkAndrii Nakryiko
bpf_throw() is checking R1, so let's report R1 in the log. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231202175705.885270-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-12-01bpf: Fix a verifier bug due to incorrect branch offset comparison with cpu=v4Yonghong Song
Bpf cpu=v4 support is introduced in [1] and Commit 4cd58e9af8b9 ("bpf: Support new 32bit offset jmp instruction") added support for new 32bit offset jmp instruction. Unfortunately, in function bpf_adj_delta_to_off(), for new branch insn with 32bit offset, the offset (plus/minor a small delta) compares to 16-bit offset bound [S16_MIN, S16_MAX], which caused the following verification failure: $ ./test_progs-cpuv4 -t verif_scale_pyperf180 ... insn 10 cannot be patched due to 16-bit range ... libbpf: failed to load object 'pyperf180.bpf.o' scale_test:FAIL:expect_success unexpected error: -12 (errno 12) #405 verif_scale_pyperf180:FAIL Note that due to recent llvm18 development, the patch [2] (already applied in bpf-next) needs to be applied to bpf tree for testing purpose. The fix is rather simple. For 32bit offset branch insn, the adjusted offset compares to [S32_MIN, S32_MAX] and then verification succeeded. [1] https://lore.kernel.org/all/20230728011143.3710005-1-yonghong.song@linux.dev [2] https://lore.kernel.org/bpf/20231110193644.3130906-1-yonghong.song@linux.dev Fixes: 4cd58e9af8b9 ("bpf: Support new 32bit offset jmp instruction") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231201024640.3417057-1-yonghong.song@linux.dev
2023-11-30Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-11-30 We've added 30 non-merge commits during the last 7 day(s) which contain a total of 58 files changed, 1598 insertions(+), 154 deletions(-). The main changes are: 1) Add initial TX metadata implementation for AF_XDP with support in mlx5 and stmmac drivers. Two types of offloads are supported right now, that is, TX timestamp and TX checksum offload, from Stanislav Fomichev with stmmac implementation from Song Yoong Siang. 2) Change BPF verifier logic to validate global subprograms lazily instead of unconditionally before the main program, so they can be guarded using BPF CO-RE techniques, from Andrii Nakryiko. 3) Add BPF link_info support for uprobe multi link along with bpftool integration for the latter, from Jiri Olsa. 4) Use pkg-config in BPF selftests to determine ld flags which is in particular needed for linking statically, from Akihiko Odaki. 5) Fix a few BPF selftest failures to adapt to the upcoming LLVM18, from Yonghong Song. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (30 commits) bpf/tests: Remove duplicate JSGT tests selftests/bpf: Add TX side to xdp_hw_metadata selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP selftests/bpf: Add TX side to xdp_metadata selftests/bpf: Add csum helpers selftests/xsk: Support tx_metadata_len xsk: Add option to calculate TX checksum in SW xsk: Validate xsk_tx_metadata flags xsk: Document tx_metadata_len layout net: stmmac: Add Tx HWTS support to XDP ZC net/mlx5e: Implement AF_XDP TX timestamp and checksum offload tools: ynl: Print xsk-features from the sample xsk: Add TX timestamp and TX checksum offload support xsk: Support tx_metadata_len selftests/bpf: Use pkg-config for libelf selftests/bpf: Override PKG_CONFIG for static builds selftests/bpf: Choose pkg-config for the target bpftool: Add support to display uprobe_multi links selftests/bpf: Add link_info test for uprobe_multi link selftests/bpf: Use bpf_link__destroy in fill_link_info tests ... ==================== Conflicts: Documentation/netlink/specs/netdev.yaml: 839ff60df3ab ("net: page_pool: add nlspec for basic access to page pools") 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support") https://lore.kernel.org/all/20231201094705.1ee3cab8@canb.auug.org.au/ While at it also regen, tree is dirty after: 48eb03dd2630 ("xsk: Add TX timestamp and TX checksum offload support") looks like code wasn't re-rendered after "render-max" was removed. Link: https://lore.kernel.org/r/20231130145708.32573-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-26bpf: Add missed allocation hint for bpf_mem_cache_alloc_flags()Hou Tao
bpf_mem_cache_alloc_flags() may call __alloc() directly when there is no free object in free list, but it doesn't initialize the allocation hint for the returned pointer. It may lead to bad memory dereference when freeing the pointer, so fix it by initializing the allocation hint. Fixes: 822fb26bdb55 ("bpf: Add a hint to allocated objects.") Signed-off-by: Hou Tao <houtao1@huawei.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231111043821.2258513-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-24bpf: Validate global subprogs lazilyAndrii Nakryiko
Slightly change BPF verifier logic around eagerness and order of global subprog validation. Instead of going over every global subprog eagerly and validating it before main (entry) BPF program is verified, turn it around. Validate main program first, mark subprogs that were called from main program for later verification, but otherwise assume it is valid. Afterwards, go over marked global subprogs and validate those, potentially marking some more global functions as being called. Continue this process until all (transitively) callable global subprogs are validated. It's a BFS traversal at its heart and will always converge. This is an important change because it allows to feature-gate some subprograms that might not be verifiable on some older kernel, depending on supported set of features. E.g., at some point, global functions were allowed to accept a pointer to memory, which size is identified by user-provided type. Unfortunately, older kernels don't support this feature. With BPF CO-RE approach, the natural way would be to still compile BPF object file once and guard calls to this global subprog with some CO-RE check or using .rodata variables. That's what people do to guard usage of new helpers or kfuncs, and any other new BPF-side feature that might be missing on old kernels. That's currently impossible to do with global subprogs, unfortunately, because they are eagerly and unconditionally validated. This patch set aims to change this, so that in the future when global funcs gain new features, those can be guarded using BPF CO-RE techniques in the same fashion as any other new kernel feature. Two selftests had to be adjusted in sync with these changes. test_global_func12 relied on eager global subprog validation failing before main program failure is detected (unknown return value). Fix by making sure that main program is always valid. verifier_subprog_precision's parent_stack_slot_precise subtest relied on verifier checkpointing heuristic to do a checkpoint at instruction #5, but that's no longer true because we don't have enough jumps validated before reaching insn #5 due to global subprogs being validated later. Other than that, no changes, as one would expect. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231124035937.403208-3-andrii@kernel.org
2023-11-24bpf: Emit global subprog name in verifier logsAndrii Nakryiko
We have the name, instead of emitting just func#N to identify global subprog, augment verifier log messages with actual function name to make it more user-friendly. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20231124035937.403208-2-andrii@kernel.org
2023-11-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/intel/ice/ice_main.c c9663f79cd82 ("ice: adjust switchdev rebuild path") 7758017911a4 ("ice: restore timestamp configuration after device reset") https://lore.kernel.org/all/20231121211259.3348630-1-anthony.l.nguyen@intel.com/ Adjacent changes: kernel/bpf/verifier.c bb124da69c47 ("bpf: keep track of max number of bpf_loop callback iterations") 5f99f312bd3b ("bpf: add register bounds sanity checks and sanitization") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-21Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-11-21 We've added 85 non-merge commits during the last 12 day(s) which contain a total of 63 files changed, 4464 insertions(+), 1484 deletions(-). The main changes are: 1) Huge batch of verifier changes to improve BPF register bounds logic and range support along with a large test suite, and verifier log improvements, all from Andrii Nakryiko. 2) Add a new kfunc which acquires the associated cgroup of a task within a specific cgroup v1 hierarchy where the latter is identified by its id, from Yafang Shao. 3) Extend verifier to allow bpf_refcount_acquire() of a map value field obtained via direct load which is a use-case needed in sched_ext, from Dave Marchevsky. 4) Fix bpf_get_task_stack() helper to add the correct crosstask check for the get_perf_callchain(), from Jordan Rome. 5) Fix BPF task_iter internals where lockless usage of next_thread() was wrong. The rework also simplifies the code, from Oleg Nesterov. 6) Fix uninitialized tail padding via LIBBPF_OPTS_RESET, and another fix for certain BPF UAPI structs to fix verifier failures seen in bpf_dynptr usage, from Yonghong Song. 7) Add BPF selftest fixes for map_percpu_stats flakes due to per-CPU BPF memory allocator not being able to allocate per-CPU pointer successfully, from Hou Tao. 8) Add prep work around dynptr and string handling for kfuncs which is later going to be used by file verification via BPF LSM and fsverity, from Song Liu. 9) Improve BPF selftests to update multiple prog_tests to use ASSERT_* macros, from Yuran Pereira. 10) Optimize LPM trie lookup to check prefixlen before walking the trie, from Florian Lehner. 11) Consolidate virtio/9p configs from BPF selftests in config.vm file given they are needed consistently across archs, from Manu Bretelle. 12) Small BPF verifier refactor to remove register_is_const(), from Shung-Hsi Yu. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (85 commits) selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in vmlinux selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_obj_id selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bind_perm selftests/bpf: Replaces the usage of CHECK calls for ASSERTs in bpf_tcp_ca selftests/bpf: reduce verboseness of reg_bounds selftest logs bpf: bpf_iter_task_next: use next_task(kit->task) rather than next_task(kit->pos) bpf: bpf_iter_task_next: use __next_thread() rather than next_thread() bpf: task_group_seq_get_next: use __next_thread() rather than next_thread() bpf: emit frameno for PTR_TO_STACK regs if it differs from current one bpf: smarter verifier log number printing logic bpf: omit default off=0 and imm=0 in register state log bpf: emit map name in register state if applicable and available bpf: print spilled register state in stack slot bpf: extract register state printing bpf: move verifier state printing code to kernel/bpf/log.c bpf: move verbose_linfo() into kernel/bpf/log.c bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTS bpf: Remove test for MOVSX32 with offset=32 selftests/bpf: add iter test requiring range x range logic veristat: add ability to set BPF_F_TEST_SANITY_STRICT flag with -r flag ... ==================== Link: https://lore.kernel.org/r/20231122000500.28126-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-11-20bpf: keep track of max number of bpf_loop callback iterationsEduard Zingerman
In some cases verifier can't infer convergence of the bpf_loop() iteration. E.g. for the following program: static int cb(__u32 idx, struct num_context* ctx) { ctx->i++; return 0; } SEC("?raw_tp") int prog(void *_) { struct num_context ctx = { .i = 0 }; __u8 choice_arr[2] = { 0, 1 }; bpf_loop(2, cb, &ctx, 0); return choice_arr[ctx.i]; } Each 'cb' simulation would eventually return to 'prog' and reach 'return choice_arr[ctx.i]' statement. At which point ctx.i would be marked precise, thus forcing verifier to track multitude of separate states with {.i=0}, {.i=1}, ... at bpf_loop() callback entry. This commit allows "brute force" handling for such cases by limiting number of callback body simulations using 'umax' value of the first bpf_loop() parameter. For this, extend bpf_func_state with 'callback_depth' field. Increment this field when callback visiting state is pushed to states traversal stack. For frame #N it's 'callback_depth' field counts how many times callback with frame depth N+1 had been executed. Use bpf_func_state specifically to allow independent tracking of callback depths when multiple nested bpf_loop() calls are present. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231121020701.26440-11-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20bpf: widening for callback iteratorsEduard Zingerman
Callbacks are similar to open coded iterators, so add imprecise widening logic for callback body processing. This makes callback based loops behave identically to open coded iterators, e.g. allowing to verify programs like below: struct ctx { u32 i; }; int cb(u32 idx, struct ctx* ctx) { ++ctx->i; return 0; } ... struct ctx ctx = { .i = 0 }; bpf_loop(100, cb, &ctx, 0); ... Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231121020701.26440-9-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20bpf: verify callbacks as if they are called unknown number of timesEduard Zingerman
Prior to this patch callbacks were handled as regular function calls, execution of callback body was modeled exactly once. This patch updates callbacks handling logic as follows: - introduces a function push_callback_call() that schedules callback body verification in env->head stack; - updates prepare_func_exit() to reschedule callback body verification upon BPF_EXIT; - as calls to bpf_*_iter_next(), calls to callback invoking functions are marked as checkpoints; - is_state_visited() is updated to stop callback based iteration when some identical parent state is found. Paths with callback function invoked zero times are now verified first, which leads to necessity to modify some selftests: - the following negative tests required adding release/unlock/drop calls to avoid previously masked unrelated error reports: - cb_refs.c:underflow_prog - exceptions_fail.c:reject_rbtree_add_throw - exceptions_fail.c:reject_with_cp_reference - the following precision tracking selftests needed change in expected log trace: - verifier_subprog_precision.c:callback_result_precise (note: r0 precision is no longer propagated inside callback and I think this is a correct behavior) - verifier_subprog_precision.c:parent_callee_saved_reg_precise_with_callback - verifier_subprog_precision.c:parent_stack_slot_precise_with_callback Reported-by: Andrew Werner <awerner32@gmail.com> Closes: https://lore.kernel.org/bpf/CA+vRuzPChFNXmouzGG+wsy=6eMcfr1mFG0F3g7rbg-sedGKW3w@mail.gmail.com/ Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231121020701.26440-7-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20bpf: extract setup_func_entry() utility functionEduard Zingerman
Move code for simulated stack frame creation to a separate utility function. This function would be used in the follow-up change for callbacks handling. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231121020701.26440-6-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-20bpf: extract __check_reg_arg() utility functionEduard Zingerman
Split check_reg_arg() into two utility functions: - check_reg_arg() operating on registers from current verifier state; - __check_reg_arg() operating on a specific set of registers passed as a parameter; The __check_reg_arg() function would be used by a follow-up change for callbacks handling. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20231121020701.26440-5-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-19bpf: bpf_iter_task_next: use next_task(kit->task) rather than ↵Oleg Nesterov
next_task(kit->pos) This looks more clear and simplifies the code. While at it, remove the unnecessary initialization of pos/task at the start of bpf_iter_task_new(). Note that we can even kill kit->task, we can just use pos->group_leader, but I don't understand the BUILD_BUG_ON() checks in bpf_iter_task_new(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231114163239.GA903@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-19bpf: bpf_iter_task_next: use __next_thread() rather than next_thread()Oleg Nesterov
Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c is the last user and the usage is wrong. bpf_iter_task_next() can loop forever, "kit->pos == kit->task" can never happen if kit->pos execs. Change this code to use __next_thread(). With or without this change the usage of kit->pos/task and next_task() doesn't look nice, see the next patch. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231114163237.GA897@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-19bpf: task_group_seq_get_next: use __next_thread() rather than next_thread()Oleg Nesterov
Lockless use of next_thread() should be avoided, kernel/bpf/task_iter.c is the last user and the usage is wrong. task_group_seq_get_next() can return the group leader twice if it races with mt-thread exec which changes the group->leader's pid. Change the main loop to use __next_thread(), kill "next_tid == common->pid" check. __next_thread() can't loop forever, we can also change this code to retry if next_tid == 0. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20231114163234.GA890@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: emit frameno for PTR_TO_STACK regs if it differs from current oneAndrii Nakryiko
It's possible to pass a pointer to parent's stack to child subprogs. In such case verifier state output is ambiguous not showing whether register container a pointer to "current" stack, belonging to current subprog (frame), or it's actually a pointer to one of parent frames. So emit this information if frame number differs between the state which register is part of. E.g., if current state is in frame 2 and it has a register pointing to stack in grand parent state (frame #0), we'll see something like 'R1=fp[0]-16', while "local stack pointer" will be just 'R2=fp-16'. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: smarter verifier log number printing logicAndrii Nakryiko
Instead of always printing numbers as either decimals (and in some cases, like for "imm=%llx", in hexadecimals), decide the form based on actual values. For numbers in a reasonably small range (currently, [0, U16_MAX] for unsigned values, and [S16_MIN, S16_MAX] for signed ones), emit them as decimals. In all other cases, even for signed values, emit them in hexadecimals. For large values hex form is often times way more useful: it's easier to see an exact difference between 0xffffffff80000000 and 0xffffffff7fffffff, than between 18446744071562067966 and 18446744071562067967, as one particular example. Small values representing small pointer offsets or application constants, on the other hand, are way more useful to be represented in decimal notation. Adjust reg_bounds register state parsing logic to take into account this change. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-8-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: omit default off=0 and imm=0 in register state logAndrii Nakryiko
Simplify BPF verifier log further by omitting default (and frequently irrelevant) off=0 and imm=0 parts for non-SCALAR_VALUE registers. As can be seen from fixed tests, this is often a visual noise for PTR_TO_CTX register and even for PTR_TO_PACKET registers. Omitting default values follows the rest of register state logic: we omit default values to keep verifier log succinct and to highlight interesting state that deviates from default one. E.g., we do the same for var_off, when it's unknown, which gives no additional information. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: emit map name in register state if applicable and availableAndrii Nakryiko
In complicated real-world applications, whenever debugging some verification error through verifier log, it often would be very useful to see map name for PTR_TO_MAP_VALUE register. Usually this needs to be inferred from key/value sizes and maybe trying to guess C code location, but it's not always clear. Given verifier has the name, and it's never too long, let's just emit it for ptr_to_map_key, ptr_to_map_value, and const_ptr_to_map registers. We reshuffle the order a bit, so that map name, key size, and value size appear before offset and immediate values, which seems like a more logical order. Current output: R1_w=map_ptr(map=array_map,ks=4,vs=8,off=0,imm=0) But we'll get rid of useless off=0 and imm=0 parts in the next patch. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: print spilled register state in stack slotAndrii Nakryiko
Print the same register state representation when printing stack state, as we do for normal registers. Note that if stack slot contains subregister spill (1, 2, or 4 byte long), we'll still emit "m0?" mask for those bytes that are not part of spilled register. While means we can get something like fp-8=0000scalar() for a 4-byte spill with other 4 bytes still being STACK_ZERO. Some example before and after, taken from the log of pyperf_subprogs.bpf.o: 49: (7b) *(u64 *)(r10 -256) = r1 ; frame1: R1_w=ctx(off=0,imm=0) R10=fp0 fp-256_w=ctx 49: (7b) *(u64 *)(r10 -256) = r1 ; frame1: R1_w=ctx(off=0,imm=0) R10=fp0 fp-256_w=ctx(off=0,imm=0) 150: (7b) *(u64 *)(r10 -264) = r0 ; frame1: R0_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) R10=fp0 fp-264_w=map_value_or_null 150: (7b) *(u64 *)(r10 -264) = r0 ; frame1: R0_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) R10=fp0 fp-264_w=map_value_or_null(id=6,off=0,ks=192,vs=4,imm=0) 5192: (61) r1 = *(u32 *)(r10 -272) ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) R10=fp0 fp-272= 5192: (61) r1 = *(u32 *)(r10 -272) ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) R10=fp0 fp-272=????scalar(smin=smin32=0,smax=umax=smax32=umax32=15,var_off=(0x0; 0xf)) While at it, do a few other simple clean ups: - skip slot if it's not scratched before detecting whether it's valid; - move taking spilled_reg pointer outside of switch (only DYNPTR has to adjust that to get to the "main" slot); - don't recalculate types_buf second time for MISC/ZERO/default case. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: extract register state printingAndrii Nakryiko
Extract printing register state representation logic into a separate helper, as we are going to reuse it for spilled register state printing in the next patch. This also nicely reduces code nestedness. No functional changes. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: move verifier state printing code to kernel/bpf/log.cAndrii Nakryiko
Move a good chunk of code from verifier.c to log.c: verifier state verbose printing logic. This is an important and very much logging/debugging oriented code. It fits the overlall log.c's focus on verifier logging, and moving it allows to keep growing it without unnecessarily adding to verifier.c code that otherwise contains a core verification logic. There are not many shared dependencies between this code and the rest of verifier.c code, except a few single-line helpers for various register type checks and a bit of state "scratching" helpers. We move all such trivial helpers into include/bpf/bpf_verifier.h as static inlines. No functional changes in this patch. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-18bpf: move verbose_linfo() into kernel/bpf/log.cAndrii Nakryiko
verifier.c is huge. Let's try to move out parts that are logging-related into log.c, as we previously did with bpf_log() and other related stuff. This patch moves line info verbose output routines: it's pretty self-contained and isolated code, so there is no problem with this. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231118034623.3320920-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-17bpf: rename BPF_F_TEST_SANITY_STRICT to BPF_F_TEST_REG_INVARIANTSAndrii Nakryiko
Rename verifier internal flag BPF_F_TEST_SANITY_STRICT to more neutral BPF_F_TEST_REG_INVARIANTS. This is a follow up to [0]. A few selftests and veristat need to be adjusted in the same patch as well. [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231112010609.848406-5-andrii@kernel.org/ Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231117171404.225508-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15bpf: make __reg{32,64}_deduce_bounds logic more robustAndrii Nakryiko
This change doesn't seem to have any effect on selftests and production BPF object files, but we preemptively try to make it more robust. First, "learn sign from signed bounds" comment is misleading, as we are learning not just sign, but also values. Second, we simplify the check for determining whether entire range is positive or negative similarly to other checks added earlier, using appropriate u32/u64 cast and single comparisons. As explain in comments in __reg64_deduce_bounds(), the checks are equivalent. Last but not least, smin/smax and s32_min/s32_max reassignment based on min/max of both umin/umax and smin/smax (and 32-bit equivalents) is hard to explain and justify. We are updating unsigned bounds from signed bounds, why would we update signed bounds at the same time? This might be correct, but it's far from obvious why and the code or comments don't try to justify this. Given we've added a separate deduction of signed bounds from unsigned bounds earlier, this seems at least redundant, if not just wrong. In short, we remove doubtful pieces, and streamline the rest to follow the logic and approach of the rest of reg_bounds_sync() checks. Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231112010609.848406-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15bpf: remove redundant s{32,64} -> u{32,64} deduction logicAndrii Nakryiko
Equivalent checks were recently added in more succinct and, arguably, safer form in: - f188765f23a5 ("bpf: derive smin32/smax32 from umin32/umax32 bounds"); - 2e74aef782d3 ("bpf: derive smin/smax from umin/max bounds"). The checks we are removing in this patch set do similar checks to detect if entire u32/u64 range has signed bit set or not set, but does it with two separate checks. Further, we forcefully overwrite either smin or smax (and 32-bit equvalents) without applying normal min/max intersection logic. It's not clear why that would be correct in all cases and seems to work by accident. This logic is also "gated" by previous signed -> unsigned derivation, which returns early. All this is quite confusing and seems error-prone, while we already have at least equivalent checks happening earlier. So remove this duplicate and error-prone logic to simplify things a bit. Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231112010609.848406-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15bpf: add register bounds sanity checks and sanitizationAndrii Nakryiko
Add simple sanity checks that validate well-formed ranges (min <= max) across u64, s64, u32, and s32 ranges. Also for cases when the value is constant (either 64-bit or 32-bit), we validate that ranges and tnums are in agreement. These bounds checks are performed at the end of BPF_ALU/BPF_ALU64 operations, on conditional jumps, and for LDX instructions (where subreg zero/sign extension is probably the most important to check). This covers most of the interesting cases. Also, we validate the sanity of the return register when manually adjusting it for some special helpers. By default, sanity violation will trigger a warning in verifier log and resetting register bounds to "unbounded" ones. But to aid development and debugging, BPF_F_TEST_SANITY_STRICT flag is added, which will trigger hard failure of verification with -EFAULT on register bounds violations. This allows selftests to catch such issues. veristat will also gain a CLI option to enable this behavior. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231112010609.848406-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15bpf: enhance BPF_JEQ/BPF_JNE is_branch_taken logicAndrii Nakryiko
Use 32-bit subranges to prune some 64-bit BPF_JEQ/BPF_JNE conditions that otherwise would be "inconclusive" (i.e., is_branch_taken() would return -1). This can happen, for example, when registers are initialized as 64-bit u64/s64, then compared for inequality as 32-bit subregisters, and then followed by 64-bit equality/inequality check. That 32-bit inequality can establish some pattern for lower 32 bits of a register (e.g., s< 0 condition determines whether the bit #31 is zero or not), while overall 64-bit value could be anything (according to a value range representation). This is not a fancy quirky special case, but actually a handling that's necessary to prevent correctness issue with BPF verifier's range tracking: set_range_min_max() assumes that register ranges are non-overlapping, and if that condition is not guaranteed by is_branch_taken() we can end up with invalid ranges, where min > max. [0] https://lore.kernel.org/bpf/CACkBjsY2q1_fUohD7hRmKGqv1MV=eP2f6XK8kjkYNw7BaiF8iQ@mail.gmail.com/ Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231112010609.848406-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15bpf: generalize is_scalar_branch_taken() logicAndrii Nakryiko
Generalize is_branch_taken logic for SCALAR_VALUE register to handle cases when both registers are not constants. Previously supported <range> vs <scalar> cases are a natural subset of more generic <range> vs <range> set of cases. Generalized logic relies on straightforward segment intersection checks. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231112010609.848406-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15bpf: generalize reg_set_min_max() to handle non-const register comparisonsAndrii Nakryiko
Generalize bounds adjustment logic of reg_set_min_max() to handle not just register vs constant case, but in general any register vs any register cases. For most of the operations it's trivial extension based on range vs range comparison logic, we just need to properly pick min/max of a range to compare against min/max of the other range. For BPF_JSET we keep the original capabilities, just make sure JSET is integrated in the common framework. This is manifested in the internal-only BPF_JSET + BPF_X "opcode" to allow for simpler and more uniform rev_opcode() handling. See the code for details. This allows to reuse the same code exactly both for TRUE and FALSE branches without explicitly handling both conditions with custom code. Note also that now we don't need a special handling of BPF_JEQ/BPF_JNE case none of the registers are constants. This is now just a normal generic case handled by reg_set_min_max(). To make tnum handling cleaner, tnum_with_subreg() helper is added, as that's a common operator when dealing with 32-bit subregister bounds. This keeps the overall logic much less noisy when it comes to tnums. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231112010609.848406-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-15bpf: Do not allocate percpu memory at init stageYonghong Song
Kirill Shutemov reported significant percpu memory consumption increase after booting in 288-cpu VM ([1]) due to commit 41a5db8d8161 ("bpf: Add support for non-fix-size percpu mem allocation"). The percpu memory consumption is increased from 111MB to 969MB. The number is from /proc/meminfo. I tried to reproduce the issue with my local VM which at most supports upto 255 cpus. With 252 cpus, without the above commit, the percpu memory consumption immediately after boot is 57MB while with the above commit the percpu memory consumption is 231MB. This is not good since so far percpu memory from bpf memory allocator is not widely used yet. Let us change pre-allocation in init stage to on-demand allocation when verifier detects there is a need of percpu memory for bpf program. With this change, percpu memory consumption after boot can be reduced signicantly. [1] https://lore.kernel.org/lkml/20231109154934.4saimljtqx625l3v@box.shutemov.name/ Fixes: 41a5db8d8161 ("bpf: Add support for non-fix-size percpu mem allocation") Reported-and-tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231111013928.948838-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-14bpf: Add a new kfunc for cgroup1 hierarchyYafang Shao
A new kfunc is added to acquire cgroup1 of a task: - bpf_task_get_cgroup1 Acquires the associated cgroup of a task whithin a specific cgroup1 hierarchy. The cgroup1 hierarchy is identified by its hierarchy ID. This new kfunc enables the tracing of tasks within a designated container or cgroup directory in BPF programs. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20231111090034.4248-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-10bpf: Add crosstask check to __bpf_get_stackJordan Rome
Currently get_perf_callchain only supports user stack walking for the current task. Passing the correct *crosstask* param will return 0 frames if the task passed to __bpf_get_stack isn't the current one instead of a single incorrect frame/address. This change passes the correct *crosstask* param but also does a preemptive check in __bpf_get_stack if the task is current and returns -EOPNOTSUPP if it is not. This issue was found using bpf_get_task_stack inside a BPF iterator ("iter/task"), which iterates over all tasks. bpf_get_task_stack works fine for fetching kernel stacks but because get_perf_callchain relies on the caller to know if the requested *task* is the current one (via *crosstask*) it was failing in a confusing way. It might be possible to get user stacks for all tasks utilizing something like access_process_vm but that requires the bpf program calling bpf_get_task_stack to be sleepable and would therefore be a breaking change. Fixes: fa28dcb82a38 ("bpf: Introduce helper bpf_get_task_stack()") Signed-off-by: Jordan Rome <jordalgo@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231108112334.3433136-1-jordalgo@meta.com
2023-11-09bpf: fix control-flow graph checking in privileged modeAndrii Nakryiko
When BPF program is verified in privileged mode, BPF verifier allows bounded loops. This means that from CFG point of view there are definitely some back-edges. Original commit adjusted check_cfg() logic to not detect back-edges in control flow graph if they are resulting from conditional jumps, which the idea that subsequent full BPF verification process will determine whether such loops are bounded or not, and either accept or reject the BPF program. At least that's my reading of the intent. Unfortunately, the implementation of this idea doesn't work correctly in all possible situations. Conditional jump might not result in immediate back-edge, but just a few unconditional instructions later we can arrive at back-edge. In such situations check_cfg() would reject BPF program even in privileged mode, despite it might be bounded loop. Next patch adds one simple program demonstrating such scenario. To keep things simple, instead of trying to detect back edges in privileged mode, just assume every back edge is valid and let subsequent BPF verification prove or reject bounded loops. Note a few test changes. For unknown reason, we have a few tests that are specified to detect a back-edge in a privileged mode, but looking at their code it seems like the right outcome is passing check_cfg() and letting subsequent verification to make a decision about bounded or not bounded looping. Bounded recursion case is also interesting. The example should pass, as recursion is limited to just a few levels and so we never reach maximum number of nested frames and never exhaust maximum stack depth. But the way that max stack depth logic works today it falsely detects this as exceeding max nested frame count. This patch series doesn't attempt to fix this orthogonal problem, so we just adjust expected verifier failure. Suggested-by: Alexei Starovoitov <ast@kernel.org> Fixes: 2589726d12a1 ("bpf: introduce bounded loops") Reported-by: Hao Sun <sunhao.th@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231110061412.2995786-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>