linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2023-11-09	Merge branch 'bpf-control-flow-graph-and-precision-backtrack-fixes'	Alexei Starovoitov
	Andrii Nakryiko says: ==================== BPF control flow graph and precision backtrack fixes A small fix to BPF verifier's CFG logic around handling and reporting ldimm64 instructions. Patch #1 was previously submitted separately ([0]), and so this patch set supersedes that patch. Second patch is fixing obscure corner case in mark_chain_precise() logic. See patch for details. Patch #3 adds a dedicated test, however fragile it might. [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231101205626.119243-1-andrii@kernel.org/ ==================== Link: https://lore.kernel.org/r/20231110002638.4168352-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	selftests/bpf: add edge case backtracking logic test	Andrii Nakryiko
	Add a dedicated selftests to try to set up conditions to have a state with same first and last instruction index, but it actually is a loop 3->4->1->2->3. This confuses mark_chain_precision() if verifier doesn't take into account jump history. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231110002638.4168352-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: fix precision backtracking instruction iteration	Andrii Nakryiko
	Fix an edge case in __mark_chain_precision() which prematurely stops backtracking instructions in a state if it happens that state's first and last instruction indexes are the same. This situations doesn't necessarily mean that there were no instructions simulated in a state, but rather that we starting from the instruction, jumped around a bit, and then ended up at the same instruction before checkpointing or marking precision. To distinguish between these two possible situations, we need to consult jump history. If it's empty or contain a single record "bridging" parent state and first instruction of processed state, then we indeed backtracked all instructions in this state. But if history is not empty, we are definitely not done yet. Move this logic inside get_prev_insn_idx() to contain it more nicely. Use -ENOENT return code to denote "we are out of instructions" situation. This bug was exposed by verifier_loop1.c's bounded_recursion subtest, once the next fix in this patch set is applied. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Fixes: b5dc0163d8fd ("bpf: precise scalar_value tracking") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231110002638.4168352-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: handle ldimm64 properly in check_cfg()	Andrii Nakryiko
	ldimm64 instructions are 16-byte long, and so have to be handled appropriately in check_cfg(), just like the rest of BPF verifier does. This has implications in three places: - when determining next instruction for non-jump instructions; - when determining next instruction for callback address ldimm64 instructions (in visit_func_call_insn()); - when checking for unreachable instructions, where second half of ldimm64 is expected to be unreachable; We take this also as an opportunity to report jump into the middle of ldimm64. And adjust few test_verifier tests accordingly. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reported-by: Hao Sun <sunhao.th@gmail.com> Fixes: 475fb78fbf48 ("bpf: verifier (add branch/goto checks)") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231110002638.4168352-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	selftests: bpf: xskxceiver: ksft_print_msg: fix format type error	Anders Roxell
	Crossbuilding selftests/bpf for architecture arm64, format specifies type error show up like. xskxceiver.c:912:34: error: format specifies type 'int' but the argument has type '__u64' (aka 'unsigned long long') [-Werror,-Wformat] ksft_print_msg("[%s] expected meta_count [%d], got meta_count [%d]\n", ~~ %llu __func__, pkt->pkt_nb, meta->count); ^~~~~~~~~~~ xskxceiver.c:929:55: error: format specifies type 'unsigned long long' but the argument has type 'u64' (aka 'unsigned long') [-Werror,-Wformat] ksft_print_msg("Frag invalid addr: %llx len: %u\n", addr, len); ~~~~ ^~~~ Fixing the issues by casting to (unsigned long long) and changing the specifiers to be %llu from %d and %u, since with u64s it might be %llx or %lx, depending on architecture. Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Link: https://lore.kernel.org/r/20231109174328.1774571-1-anders.roxell@linaro.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: Use named fields for certain bpf uapi structs	Yonghong Song
	Martin and Vadim reported a verifier failure with bpf_dynptr usage. The issue is mentioned but Vadim workarounded the issue with source change ([1]). The below describes what is the issue and why there is a verification failure. int BPF_PROG(skb_crypto_setup) { struct bpf_dynptr algo, key; ... bpf_dynptr_from_mem(..., ..., 0, &algo); ... } The bpf program is using vmlinux.h, so we have the following definition in vmlinux.h: struct bpf_dynptr { long: 64; long: 64; }; Note that in uapi header bpf.h, we have struct bpf_dynptr { long: 64; long: 64; } __attribute__((aligned(8))); So we lost alignment information for struct bpf_dynptr by using vmlinux.h. Let us take a look at a simple program below: $ cat align.c typedef unsigned long long __u64; struct bpf_dynptr_no_align { __u64 :64; __u64 :64; }; struct bpf_dynptr_yes_align { __u64 :64; __u64 :64; } __attribute__((aligned(8))); void bar(void , void ); int foo() { struct bpf_dynptr_no_align a; struct bpf_dynptr_yes_align b; bar(&a, &b); return 0; } $ clang --target=bpf -O2 -S -emit-llvm align.c Look at the generated IR file align.ll: ... %a = alloca %struct.bpf_dynptr_no_align, align 1 %b = alloca %struct.bpf_dynptr_yes_align, align 8 ... The compiler dictates the alignment for struct bpf_dynptr_no_align is 1 and the alignment for struct bpf_dynptr_yes_align is 8. So theoretically compiler could allocate variable %a with alignment 1 although in reallity the compiler may choose a different alignment by considering other local variables. In [1], the verification failure happens because variable 'algo' is allocated on the stack with alignment 4 (fp-28). But the verifer wants its alignment to be 8. To fix the issue, the RFC patch ([1]) tried to add '__attribute__((aligned(8)))' to struct bpf_dynptr plus other similar structs. Andrii suggested that we could directly modify uapi struct with named fields like struct 'bpf_iter_num': struct bpf_iter_num { /* opaque iterator state; having __u64 here allows to preserve correct * alignment requirements in vmlinux.h, generated from BTF */ __u64 __opaque[1]; } __attribute__((aligned(8))); Indeed, adding named fields for those affected structs in this patch can preserve alignment when bpf program references them in vmlinux.h. With this patch, the verification failure in [1] can also be resolved. [1] https://lore.kernel.org/bpf/1b100f73-7625-4c1f-3ae5-50ecf84d3ff0@linux.dev/ [2] https://lore.kernel.org/bpf/20231103055218.2395034-1-yonghong.song@linux.dev/ Cc: Vadim Fedorenko <vadfed@meta.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231104024900.1539182-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	Merge branch 'allow-bpf_refcount_acquire-of-mapval-obtained-via-direct-ld'	Alexei Starovoitov
	Dave Marchevsky says: ==================== Allow bpf_refcount_acquire of mapval obtained via direct LD Consider this BPF program: struct cgv_node { int d; struct bpf_refcount r; struct bpf_rb_node rb; }; struct val_stash { struct cgv_node __kptr v; }; struct { __uint(type, BPF_MAP_TYPE_ARRAY); __type(key, int); __type(value, struct val_stash); __uint(max_entries, 10); } array_map SEC(".maps"); long bpf_program(void ctx) { struct val_stash mapval; struct cgv_node p; int idx = 0; mapval = bpf_map_lookup_elem(&array_map, &idx); if (!mapval \|\| !mapval->v) { /* omitted / } p = bpf_refcount_acquire(mapval->v); / Verification FAILs here / / Add p to some tree / return 0; } Verification fails on the refcount_acquire: 160: (79) r1 = (u64 )(r8 +8) ; R1_w=untrusted_ptr_or_null_cgv_node(id=11,off=0,imm=0) R8_w=map_value(id=10,off=0,ks=8,vs=16,imm=0) refs=6 161: (b7) r2 = 0 ; R2_w=0 refs=6 162: (85) call bpf_refcount_acquire_impl#117824 arg#0 is neither owning or non-owning ref The above verifier dump is actually from sched_ext's scx_flatcg [0], which is the motivating usecase for this series' changes. Specifically, scx_flatcg stashes a rb_node type w/ cgroup-specific info (struct cgv_node) in a map when the cgroup is created, then later puts that cgroup's node in a rbtree in .enqueue . Making struct cgv_node refcounted would simplify the code a bit by virtue of allowing us to remove the kptr_xchg's, but "later puts that cgroups node in a rbtree" is not possible without a refcount_acquire, which suffers from the above verification failure. If we get rid of PTR_UNTRUSTED flag, and add MEM_ALLOC \| NON_OWN_REF, mapval->v would be a non-owning ref and verification would succeed. Due to the most recent set of refcount changes [1], which modified bpf_obj_drop behavior to not reuse refcounted graph node's underlying memory until after RCU grace period, this is safe to do. Once mapval->v has the correct flags it _is_ a non-owning reference and verification of the motivating example will succeed. [0]: https://github.com/sched-ext/sched_ext/blob/52911e1040a0f94b9c426dddcc00be5364a7ae9f/tools/sched_ext/scx_flatcg.bpf.c#L275 [1]: https://lore.kernel.org/bpf/20230821193311.3290257-1-davemarchevsky@fb.com/ Summary of patches: Patch 1 fixes an issue with bpf_refcount_acquire verification letting MAYBE_NULL ptrs through * Patch 2 tests Patch 1's fix * Patch 3 broadens the use of "free only after RCU GP" to all user-allocated types * Patch 4 is a small nonfunctional refactoring * Patch 5 changes verifier to mark direct LD of stashed graph node kptr as non-owning ref * Patch 6 tests Patch 5's verifier changes Changelog: v1 -> v2: https://lore.kernel.org/bpf/20231025214007.2920506-1-davemarchevsky@fb.com/ Series title changed to "Allow bpf_refcount_acquire of mapval obtained via direct LD". V1's title was mistakenly truncated. * Patch 5 ("bpf: Mark direct ld of stashed bpf_{rb,list}_node as non-owning ref") * Direct LD of percpu kptr should not have MEM_ALLOC flag (Yonghong) * Patch 6 ("selftests/bpf: Test bpf_refcount_acquire of node obtained via direct ld") * Test read from stashed value as well ==================== Link: https://lore.kernel.org/r/20231107085639.3016113-1-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: replace register_is_const() with is_reg_const()	Shung-Hsi Yu
	The addition of is_reg_const() in commit 171de12646d2 ("bpf: generalize is_branch_taken to handle all conditional jumps in one place") has made the register_is_const() redundant. Give the former has more feature, plus the fact the latter is only used in one place, replace register_is_const() with is_reg_const(), and remove the definition of register_is_const. This requires moving the definition of is_reg_const() further up. And since the comment of reg_const_value() reference is_reg_const(), move it up as well. Signed-off-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231108140043.12282-1-shung-hsi.yu@suse.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	selftests/bpf: Test bpf_refcount_acquire of node obtained via direct ld	Dave Marchevsky
	This patch demonstrates that verifier changes earlier in this series result in bpf_refcount_acquire(mapval->stashed_kptr) passing verification. The added test additionally validates that stashing a kptr in mapval and - in a separate BPF program - refcount_acquiring the kptr without unstashing works as expected at runtime. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20231107085639.3016113-7-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	veristat: add ability to filter top N results	Andrii Nakryiko
	Add ability to filter top B results, both in replay/verifier mode and comparison mode. Just adding `-n10` will emit only first 10 rows, or less, if there is not enough rows. This is not just a shortcut instead of passing veristat output through `head`, though. Filtering out all the other rows influences final table formatting, as table column widths are calculated based on actual emitted test. To demonstrate the difference, compare two "equivalent" forms below, one using head and another using -n argument. TOP N FEATURE ============= [vmuser@archvm bpf]$ sudo ./veristat -C ~/baseline-results-selftests.csv ~/sanity2-results-selftests.csv -e file,prog,insns,states -s '\|insns_diff\|' -n10 File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) ---------------------------------------- --------------------- --------- --------- ------------ ---------- ---------- ------------- test_seg6_loop.bpf.linked3.o __add_egr_x 12440 12360 -80 (-0.64%) 364 357 -7 (-1.92%) async_stack_depth.bpf.linked3.o async_call_root_check 145 145 +0 (+0.00%) 3 3 +0 (+0.00%) async_stack_depth.bpf.linked3.o pseudo_call_check 139 139 +0 (+0.00%) 3 3 +0 (+0.00%) atomic_bounds.bpf.linked3.o sub 7 7 +0 (+0.00%) 0 0 +0 (+0.00%) bench_local_storage_create.bpf.linked3.o kmalloc 5 5 +0 (+0.00%) 0 0 +0 (+0.00%) bench_local_storage_create.bpf.linked3.o sched_process_fork 22 22 +0 (+0.00%) 2 2 +0 (+0.00%) bench_local_storage_create.bpf.linked3.o socket_post_create 23 23 +0 (+0.00%) 2 2 +0 (+0.00%) bind4_prog.bpf.linked3.o bind_v4_prog 358 358 +0 (+0.00%) 33 33 +0 (+0.00%) bind6_prog.bpf.linked3.o bind_v6_prog 429 429 +0 (+0.00%) 37 37 +0 (+0.00%) bind_perm.bpf.linked3.o bind_v4_prog 15 15 +0 (+0.00%) 1 1 +0 (+0.00%) PIPING TO HEAD ============== [vmuser@archvm bpf]$ sudo ./veristat -C ~/baseline-results-selftests.csv ~/sanity2-results-selftests.csv -e file,prog,insns,states -s '\|insns_diff\|' \| head -n12 File Program Insns (A) Insns (B) Insns (DIFF) States (A) States (B) States (DIFF) ----------------------------------------------------- ---------------------------------------------------- --------- --------- ------------ ---------- ---------- ------------- test_seg6_loop.bpf.linked3.o __add_egr_x 12440 12360 -80 (-0.64%) 364 357 -7 (-1.92%) async_stack_depth.bpf.linked3.o async_call_root_check 145 145 +0 (+0.00%) 3 3 +0 (+0.00%) async_stack_depth.bpf.linked3.o pseudo_call_check 139 139 +0 (+0.00%) 3 3 +0 (+0.00%) atomic_bounds.bpf.linked3.o sub 7 7 +0 (+0.00%) 0 0 +0 (+0.00%) bench_local_storage_create.bpf.linked3.o kmalloc 5 5 +0 (+0.00%) 0 0 +0 (+0.00%) bench_local_storage_create.bpf.linked3.o sched_process_fork 22 22 +0 (+0.00%) 2 2 +0 (+0.00%) bench_local_storage_create.bpf.linked3.o socket_post_create 23 23 +0 (+0.00%) 2 2 +0 (+0.00%) bind4_prog.bpf.linked3.o bind_v4_prog 358 358 +0 (+0.00%) 33 33 +0 (+0.00%) bind6_prog.bpf.linked3.o bind_v6_prog 429 429 +0 (+0.00%) 37 37 +0 (+0.00%) bind_perm.bpf.linked3.o bind_v4_prog 15 15 +0 (+0.00%) 1 1 +0 (+0.00%) Note all the wasted whitespace in the "PIPING TO HEAD" variant. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231108051430.1830950-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: Mark direct ld of stashed bpf_{rb,list}_node as non-owning ref	Dave Marchevsky
	This patch enables the following pattern: /* mapval contains a __kptr pointing to refcounted local kptr / mapval = bpf_map_lookup_elem(&map, &idx); if (!mapval \|\| !mapval->some_kptr) { / omitted */ } p = bpf_refcount_acquire(&mapval->some_kptr); Currently this doesn't work because bpf_refcount_acquire expects an owning or non-owning ref. The verifier defines non-owning ref as a type: PTR_TO_BTF_ID \| MEM_ALLOC \| NON_OWN_REF while mapval->some_kptr is PTR_TO_BTF_ID \| PTR_UNTRUSTED. It's possible to do the refcount_acquire by first bpf_kptr_xchg'ing mapval->some_kptr into a temp kptr, refcount_acquiring that, and xchg'ing back into mapval, but this is unwieldy and shouldn't be necessary. This patch modifies btf_ld_kptr_type such that user-allocated types are marked MEM_ALLOC and if those types have a bpf_{rb,list}_node they're marked NON_OWN_REF as well. Additionally, due to changes to bpf_obj_drop_impl earlier in this series, rcu_protected_object now returns true for all user-allocated types, resulting in mapval->some_kptr being marked MEM_RCU. After this patch's changes, mapval->some_kptr is now: PTR_TO_BTF_ID \| MEM_ALLOC \| NON_OWN_REF \| MEM_RCU which results in it passing the non-owning ref test, and the motivating example passing verification. Future work will likely get rid of special non-owning ref lifetime logic in the verifier, at which point we'll be able to delete the NON_OWN_REF flag entirely. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20231107085639.3016113-6-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	veristat: add ability to sort by stat's absolute value	Andrii Nakryiko
	Add ability to sort results by absolute values of specified stats. This is especially useful to find biggest deviations in comparison mode. When comparing verifier change effect against a large base of BPF object files, it's necessary to see big changes both in positive and negative directions, as both might be a signal for regressions or bugs. The syntax is natural, e.g., adding `-s '\|insns_diff\|'^` will instruct veristat to sort by absolute value of instructions difference in ascending order. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231108051430.1830950-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: Move GRAPH_{ROOT,NODE}_MASK macros into btf_field_type enum	Dave Marchevsky
	This refactoring patch removes the unused BPF_GRAPH_NODE_OR_ROOT btf_field_type and moves BPF_GRAPH_{NODE,ROOT} macros into the btf_field_type enum. Further patches in the series will use BPF_GRAPH_NODE, so let's move this useful definition out of btf.c. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20231107085639.3016113-5-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	libbpf: Fix potential uninitialized tail padding with LIBBPF_OPTS_RESET	Yonghong Song
	Martin reported that there is a libbpf complaining of non-zero-value tail padding with LIBBPF_OPTS_RESET macro if struct bpf_netkit_opts is modified to have a 4-byte tail padding. This only happens to clang compiler. The commend line is: ./test_progs -t tc_netkit_multi_links Martin and I did some investigation and found this indeed the case and the following are the investigation details. Clang: clang version 18.0.0 <I tried clang15/16/17 and they all have similar results> tools/lib/bpf/libbpf_common.h: #define LIBBPF_OPTS_RESET(NAME, ...) \ do { \ memset(&NAME, 0, sizeof(NAME)); \ NAME = (typeof(NAME)) { \ .sz = sizeof(NAME), \ __VA_ARGS__ \ }; \ } while (0) #endif tools/lib/bpf/libbpf.h: struct bpf_netkit_opts { /* size of this struct, for forward/backward compatibility / size_t sz; __u32 flags; __u32 relative_fd; __u32 relative_id; __u64 expected_revision; size_t :0; }; #define bpf_netkit_opts__last_field expected_revision In the above struct bpf_netkit_opts, there is no tail padding. prog_tests/tc_netkit.c: static void serial_test_tc_netkit_multi_links_target(int mode, int target) { ... LIBBPF_OPTS(bpf_netkit_opts, optl); ... LIBBPF_OPTS_RESET(optl, .flags = BPF_F_BEFORE, .relative_fd = bpf_program__fd(skel->progs.tc1), ); ... } Let us make the following source change, note that we have a 4-byte tailing padding now. diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index 6cd9c501624f..0dd83910ae9a 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -803,13 +803,13 @@ bpf_program__attach_tcx(const struct bpf_program prog, int ifindex, struct bpf_netkit_opts { /* size of this struct, for forward/backward compatibility */ size_t sz; - __u32 flags; __u32 relative_fd; __u32 relative_id; __u64 expected_revision; + __u32 flags; size_t :0; }; -#define bpf_netkit_opts__last_field expected_revision +#define bpf_netkit_opts__last_field flags The clang 18 generated asm code looks like below: ; LIBBPF_OPTS_RESET(optl, 55e3: 48 8d 7d 98 leaq -0x68(%rbp), %rdi 55e7: 31 f6 xorl %esi, %esi 55e9: ba 20 00 00 00 movl $0x20, %edx 55ee: e8 00 00 00 00 callq 0x55f3 <serial_test_tc_netkit_multi_links_target+0x18d3> 55f3: 48 c7 85 10 fd ff ff 20 00 00 00 movq $0x20, -0x2f0(%rbp) 55fe: 48 8b 85 68 ff ff ff movq -0x98(%rbp), %rax 5605: 48 8b 78 18 movq 0x18(%rax), %rdi 5609: e8 00 00 00 00 callq 0x560e <serial_test_tc_netkit_multi_links_target+0x18ee> 560e: 89 85 18 fd ff ff movl %eax, -0x2e8(%rbp) 5614: c7 85 1c fd ff ff 00 00 00 00 movl $0x0, -0x2e4(%rbp) 561e: 48 c7 85 20 fd ff ff 00 00 00 00 movq $0x0, -0x2e0(%rbp) 5629: c7 85 28 fd ff ff 08 00 00 00 movl $0x8, -0x2d8(%rbp) 5633: 48 8b 85 10 fd ff ff movq -0x2f0(%rbp), %rax 563a: 48 89 45 98 movq %rax, -0x68(%rbp) 563e: 48 8b 85 18 fd ff ff movq -0x2e8(%rbp), %rax 5645: 48 89 45 a0 movq %rax, -0x60(%rbp) 5649: 48 8b 85 20 fd ff ff movq -0x2e0(%rbp), %rax 5650: 48 89 45 a8 movq %rax, -0x58(%rbp) 5654: 48 8b 85 28 fd ff ff movq -0x2d8(%rbp), %rax 565b: 48 89 45 b0 movq %rax, -0x50(%rbp) ; link = bpf_program__attach_netkit(skel->progs.tc2, ifindex, &optl); At -O0 level, the clang compiler creates an intermediate copy. We have below to store 'flags' with 4-byte store and leave another 4 byte in the same 8-byte-aligned storage undefined, 5629: c7 85 28 fd ff ff 08 00 00 00 movl $0x8, -0x2d8(%rbp) and later we store 8-byte to the original zero'ed buffer 5654: 48 8b 85 28 fd ff ff movq -0x2d8(%rbp), %rax 565b: 48 89 45 b0 movq %rax, -0x50(%rbp) This caused a problem as the 4-byte value at [%rbp-0x2dc, %rbp-0x2e0) may be garbage. gcc (gcc 11.4) does not have this issue as it does zeroing struct first before doing assignments: ; LIBBPF_OPTS_RESET(optl, 50fd: 48 8d 85 40 fc ff ff leaq -0x3c0(%rbp), %rax 5104: ba 20 00 00 00 movl $0x20, %edx 5109: be 00 00 00 00 movl $0x0, %esi 510e: 48 89 c7 movq %rax, %rdi 5111: e8 00 00 00 00 callq 0x5116 <serial_test_tc_netkit_multi_links_target+0x1522> 5116: 48 8b 45 f0 movq -0x10(%rbp), %rax 511a: 48 8b 40 18 movq 0x18(%rax), %rax 511e: 48 89 c7 movq %rax, %rdi 5121: e8 00 00 00 00 callq 0x5126 <serial_test_tc_netkit_multi_links_target+0x1532> 5126: 48 c7 85 40 fc ff ff 00 00 00 00 movq $0x0, -0x3c0(%rbp) 5131: 48 c7 85 48 fc ff ff 00 00 00 00 movq $0x0, -0x3b8(%rbp) 513c: 48 c7 85 50 fc ff ff 00 00 00 00 movq $0x0, -0x3b0(%rbp) 5147: 48 c7 85 58 fc ff ff 00 00 00 00 movq $0x0, -0x3a8(%rbp) 5152: 48 c7 85 40 fc ff ff 20 00 00 00 movq $0x20, -0x3c0(%rbp) 515d: 89 85 48 fc ff ff movl %eax, -0x3b8(%rbp) 5163: c7 85 58 fc ff ff 08 00 00 00 movl $0x8, -0x3a8(%rbp) ; link = bpf_program__attach_netkit(skel->progs.tc2, ifindex, &optl); It is not clear how to resolve the compiler code generation as the compiler generates correct code w.r.t. how to handle unnamed padding in C standard. So this patch changed LIBBPF_OPTS_RESET macro to avoid uninitialized tail padding. We already knows LIBBPF_OPTS macro works on both gcc and clang, even with tail padding. So LIBBPF_OPTS_RESET is changed to be a LIBBPF_OPTS followed by a memcpy(), thus avoiding uninitialized tail padding. The below is asm code generated with this patch and with clang compiler: ; LIBBPF_OPTS_RESET(optl, 55e3: 48 8d bd 10 fd ff ff leaq -0x2f0(%rbp), %rdi 55ea: 31 f6 xorl %esi, %esi 55ec: ba 20 00 00 00 movl $0x20, %edx 55f1: e8 00 00 00 00 callq 0x55f6 <serial_test_tc_netkit_multi_links_target+0x18d6> 55f6: 48 c7 85 10 fd ff ff 20 00 00 00 movq $0x20, -0x2f0(%rbp) 5601: 48 8b 85 68 ff ff ff movq -0x98(%rbp), %rax 5608: 48 8b 78 18 movq 0x18(%rax), %rdi 560c: e8 00 00 00 00 callq 0x5611 <serial_test_tc_netkit_multi_links_target+0x18f1> 5611: 89 85 18 fd ff ff movl %eax, -0x2e8(%rbp) 5617: c7 85 1c fd ff ff 00 00 00 00 movl $0x0, -0x2e4(%rbp) 5621: 48 c7 85 20 fd ff ff 00 00 00 00 movq $0x0, -0x2e0(%rbp) 562c: c7 85 28 fd ff ff 08 00 00 00 movl $0x8, -0x2d8(%rbp) 5636: 48 8b 85 10 fd ff ff movq -0x2f0(%rbp), %rax 563d: 48 89 45 98 movq %rax, -0x68(%rbp) 5641: 48 8b 85 18 fd ff ff movq -0x2e8(%rbp), %rax 5648: 48 89 45 a0 movq %rax, -0x60(%rbp) 564c: 48 8b 85 20 fd ff ff movq -0x2e0(%rbp), %rax 5653: 48 89 45 a8 movq %rax, -0x58(%rbp) 5657: 48 8b 85 28 fd ff ff movq -0x2d8(%rbp), %rax 565e: 48 89 45 b0 movq %rax, -0x50(%rbp) ; link = bpf_program__attach_netkit(skel->progs.tc2, ifindex, &optl); In the above code, a temporary buffer is zeroed and then has proper value assigned. Finally, values in temporary buffer are copied to the original variable buffer, hence tail padding is guaranteed to be 0. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/bpf/20231107201511.2548645-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: Use bpf_mem_free_rcu when bpf_obj_dropping non-refcounted nodes	Dave Marchevsky
	The use of bpf_mem_free_rcu to free refcounted local kptrs was added in commit 7e26cd12ad1c ("bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes"). In the cover letter for the series containing that patch [0] I commented: Perhaps it makes sense to move to mem_free_rcu for _all_ non-owning refs in the future, not just refcounted. This might allow custom non-owning ref lifetime + invalidation logic to be entirely subsumed by MEM_RCU handling. IMO this needs a bit more thought and should be tackled outside of a fix series, so it's not attempted here. It's time to start moving in the "non-owning refs have MEM_RCU lifetime" direction. As mentioned in that comment, using bpf_mem_free_rcu for all local kptrs - not just refcounted - is necessarily the first step towards that goal. This patch does so. After this patch the memory pointed to by all local kptrs will not be reused until RCU grace period elapses. The verifier's understanding of non-owning ref validity and the clobbering logic it uses to enforce that understanding are not changed here, that'll happen gradually in future work, including further patches in the series. [0]: https://lore.kernel.org/all/20230821193311.3290257-1-davemarchevsky@fb.com/ Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20231107085639.3016113-4-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	selftests/bpf: Add test passing MAYBE_NULL reg to bpf_refcount_acquire	Dave Marchevsky
	The test added in this patch exercises the logic fixed in the previous patch in this series. Before the previous patch's changes, bpf_refcount_acquire accepts MAYBE_NULL local kptrs; after the change the verifier correctly rejects the such a call. Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Link: https://lore.kernel.org/r/20231107085639.3016113-3-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: Add KF_RCU flag to bpf_refcount_acquire_impl	Dave Marchevsky
	Refcounted local kptrs are kptrs to user-defined types with a bpf_refcount field. Recent commits ([0], [1]) modified the lifetime of refcounted local kptrs such that the underlying memory is not reused until RCU grace period has elapsed. Separately, verification of bpf_refcount_acquire calls currently succeeds for MAYBE_NULL non-owning reference input, which is a problem as bpf_refcount_acquire_impl has no handling for this case. This patch takes advantage of aforementioned lifetime changes to tag bpf_refcount_acquire_impl kfunc KF_RCU, thereby preventing MAYBE_NULL input to the kfunc. The KF_RCU flag applies to all kfunc params; it's fine for it to apply to the void *meta__ign param as that's populated by the verifier and is tagged __ign regardless. [0]: commit 7e26cd12ad1c ("bpf: Use bpf_mem_free_rcu when bpf_obj_dropping refcounted nodes") is the actual change to allocation behaivor [1]: commit 0816b8c6bf7f ("bpf: Consider non-owning refs to refcounted nodes RCU protected") modified verifier understanding of refcounted local kptrs to match [0]'s changes Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Fixes: 7c50b1cb76ac ("bpf: Add bpf_refcount_acquire kfunc") Link: https://lore.kernel.org/r/20231107085639.3016113-2-davemarchevsky@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	Merge branch 'bpf: __bpf_dynptr_data* and __str annotation'	Andrii Nakryiko
	Song Liu says: ==================== This set contains the first 3 patches of set [1]. Currently, [1] is waiting for [3] to be merged to bpf-next tree. So send these 3 patches first to unblock other works, such as [2]. This set is verified with new version of [1] in CI run [4]. Changes since v12 of [1]: 1. Reuse bpf_dynptr_slice() in __bpf_dynptr_data(). (Andrii) 2. Add Acked-by from Vadim Fedorenko. [1] https://lore.kernel.org/bpf/20231104001313.3538201-1-song@kernel.org/ [2] https://lore.kernel.org/bpf/20231031134900.1432945-1-vadfed@meta.com/ [3] https://lore.kernel.org/bpf/20231031215625.2343848-1-davemarchevsky@fb.com/ [4] https://github.com/kernel-patches/bpf/actions/runs/6779945063/job/18427926114 ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf, lpm: Fix check prefixlen before walking trie	Florian Lehner
	When looking up an element in LPM trie, the condition 'matchlen == trie->max_prefixlen' will never return true, if key->prefixlen is larger than trie->max_prefixlen. Consequently all elements in the LPM trie will be visited and no element is returned in the end. To resolve this, check key->prefixlen first before walking the LPM trie. Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation") Signed-off-by: Florian Lehner <dev@der-flo.net> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231105085801.3742-1-dev@der-flo.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: Introduce KF_ARG_PTR_TO_CONST_STR	Song Liu
	Similar to ARG_PTR_TO_CONST_STR for BPF helpers, KF_ARG_PTR_TO_CONST_STR specifies kfunc args that point to const strings. Annotation "__str" is used to specify kfunc arg of type KF_ARG_PTR_TO_CONST_STR. Also, add documentation for the "__str" annotation. bpf_get_file_xattr() will be the first kfunc that uses this type. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/bpf/20231107045725.2278852-4-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	selftests/bpf: Disable CONFIG_DEBUG_INFO_REDUCED in config.aarch64	Anders Roxell
	Building an arm64 kernel and seftests/bpf with defconfig + selftests/bpf/config and selftests/bpf/config.aarch64 the fragment CONFIG_DEBUG_INFO_REDUCED is enabled in arm64's defconfig, it should be disabled in file sefltests/bpf/config.aarch64 since if its not disabled CONFIG_DEBUG_INFO_BTF wont be enabled. Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231103220912.333930-1-anders.roxell@linaro.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: Factor out helper check_reg_const_str()	Song Liu
	ARG_PTR_TO_CONST_STR is used to specify constant string args for BPF helpers. The logic that verifies a reg is ARG_PTR_TO_CONST_STR is implemented in check_func_arg(). As we introduce kfuncs with constant string args, it is necessary to do the same check for kfuncs (in check_kfunc_args). Factor out the logic for ARG_PTR_TO_CONST_STR to a new check_reg_const_str() so that it can be reused. check_func_arg() ensures check_reg_const_str() is only called with reg of type PTR_TO_MAP_VALUE. Add a redundent type check in check_reg_const_str() to avoid misuse in the future. Other than this redundent check, there is no change in behavior. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/bpf/20231107045725.2278852-3-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpftool: Fix prog object type in manpage	Artem Savkov
	bpftool's man page lists "program" as one of possible values for OBJECT, while in fact bpftool accepts "prog" instead. Reported-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Artem Savkov <asavkov@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Quentin Monnet <quentin@isovalent.com> Link: https://lore.kernel.org/bpf/20231103081126.170034-1-asavkov@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: Add __bpf_dynptr_data* for in kernel use	Song Liu
	Different types of bpf dynptr have different internal data storage. Specifically, SKB and XDP type of dynptr may have non-continuous data. Therefore, it is not always safe to directly access dynptr->data. Add __bpf_dynptr_data and __bpf_dynptr_data_rw to replace direct access to dynptr->data. Update bpf_verify_pkcs7_signature to use __bpf_dynptr_data instead of dynptr->data. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://lore.kernel.org/bpf/20231107045725.2278852-2-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	selftests/bpf: Consolidate VIRTIO/9P configs in config.vm file	Manu Bretelle
	Those configs are needed to be able to run VM somewhat consistently. For instance, ATM, s390x is missing the `CONFIG_VIRTIO_CONSOLE` which prevents s390x kernels built in CI to leverage qemu-guest-agent. By moving them to `config,vm`, we should have selftest kernels which are equal in term of VM functionalities when they include this file. The set of config unabled were picked using grep -h -E '(_9P\|_VIRTIO)' config.x86_64 config \| sort \| uniq added to `config.vm` and then grep -vE '(_9P\|_VIRTIO)' config.{x86_64,aarch64,s390x} as a side-effect, some config may have disappeared to the aarch64 and s390x kernels, but they should not be needed. CI will tell. Signed-off-by: Manu Bretelle <chantr4@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231031212717.4037892-1-chantr4@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	Merge branch 'selftests/bpf: Fixes for map_percpu_stats test'	Andrii Nakryiko
	Hou Tao says: ==================== From: Hou Tao <houtao1@huawei.com> Hi, BPF CI failed due to map_percpu_stats_percpu_hash from time to time [1]. It seems that the failure reason is per-cpu bpf memory allocator may not be able to allocate per-cpu pointer successfully and it can not refill free llist timely, and bpf_map_update_elem() will return -ENOMEM. Patch #1 fixes the size of value passed to per-cpu map update API. The problem was found when fixing the ENOMEM problem, so also post it in this patchset. Patch #2 & #3 mitigates the ENOMEM problem by retrying the update operation for non-preallocated per-cpu map. Please see individual patches for more details. And comments are always welcome. Regards, Tao [1]: https://github.com/kernel-patches/bpf/actions/runs/6713177520/job/18244865326?pr=5909 ==================== Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	Merge branch 'bpf-register-bounds-logic-and-testing-improvements'	Alexei Starovoitov
	Andrii Nakryiko says: ==================== BPF register bounds logic and testing improvements This patch set adds a big set of manual and auto-generated test cases validating BPF verifier's register bounds tracking and deduction logic. See details in the last patch. We start with building a tester that validates existing <range> vs <scalar> verifier logic for range bounds. To make all this work, BPF verifier's logic needed a bunch of improvements to handle some cases that previously were not covered. This had no implications as to correctness of verifier logic, but it was incomplete enough to cause significant disagreements with alternative implementation of register bounds logic that tests in this patch set implement. So we need BPF verifier logic improvements to make all the tests pass. This is what we do in patches #3 through #9. The end goal of this work, though, is to extend BPF verifier range state tracking such as to allow to derive new range bounds when comparing non-const registers. There is some more investigative work required to investigate and fix existing potential issues with range tracking as part of ALU/ALU64 operations, so <range> x <range> part of v5 patch set ([0]) is dropped until these issues are sorted out. For now, we include preparatory refactorings and clean ups, that set up BPF verifier code base to extend the logic to <range> vs <range> logic in subsequent patch set. Patches #10-#16 perform preliminary refactorings without functionally changing anything. But they do clean up check_cond_jmp_op() logic and generalize a bunch of other pieces in is_branch_taken() logic. [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=797178&state=* v5->v6: - dropped <range> vs <range> patches (original patches #18 through #23) to add more register range sanity checks and fix preexisting issues; - comments improvements, addressing other feedback on first 17 patches (Eduard, Alexei); v4->v5: - added entirety of verifier reg bounds tracking changes, now handling <range> vs <range> cases (Alexei); - added way more comments trying to explain why deductions added are correct, hopefully they are useful and clarify things a bit (Daniel, Shung-Hsi); - added two preliminary selftests fixes necessary for RELEASE=1 build to work again, it keeps breaking. v3->v4: - improvements to reg_bounds tester (progress report, split 32-bit and 64-bit ranges, fix various verbosity output issues, etc); v2->v3: - fix a subtle little-endianness assumption inside parge_reg_state() (CI); v1->v2: - fix compilation when building selftests with llvm-16 toolchain (CI). ==================== Link: https://lore.kernel.org/r/20231102033759.2541186-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	selftsets/bpf: Retry map update for non-preallocated per-cpu map	Hou Tao
	BPF CI failed due to map_percpu_stats_percpu_hash from time to time [1]. It seems that the failure reason is per-cpu bpf memory allocator may not be able to allocate per-cpu pointer successfully and it can not refill free llist timely, and bpf_map_update_elem() will return -ENOMEM. So mitigate the problem by retrying the update operation for non-preallocated per-cpu map. [1]: https://github.com/kernel-patches/bpf/actions/runs/6713177520/job/18244865326?pr=5909 Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231101032455.3808547-4-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: generalize reg_set_min_max() to handle two sets of two registers	Andrii Nakryiko
	Change reg_set_min_max() to take FALSE/TRUE sets of two registers each, instead of assuming that we are always comparing to a constant. For now we still assume that right-hand side registers are constants (and make sure that's the case by swapping src/dst regs, if necessary), but subsequent patches will remove this limitation. reg_set_min_max() is now called unconditionally for any register comparison, so that might include pointer vs pointer. This makes it consistent with is_branch_taken() generality. But we currently only support adjustments based on SCALAR vs SCALAR comparisons, so reg_set_min_max() has to guard itself againts pointers. Taking two by two registers allows to further unify and simplify check_cond_jmp_op() logic. We utilize fake register for BPF_K conditional jump case, just like with is_branch_taken() part. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-18-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	selftests/bpf: Export map_update_retriable()	Hou Tao
	Export map_update_retriable() to make it usable for other map_test cases. These cases may only need retry for specific errno, so add a new callback parameter to let map_update_retriable() decide whether or not the errno is retriable. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231101032455.3808547-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: prepare reg_set_min_max for second set of registers	Andrii Nakryiko
	Similarly to is_branch_taken()-related refactorings, start preparing reg_set_min_max() to handle more generic case of two non-const registers. Start with renaming arguments to accommodate later addition of second register as an input argument. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-17-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	selftests/bpf: Use value with enough-size when updating per-cpu map	Hou Tao
	When updating per-cpu map in map_percpu_stats test, patch_map_thread() only passes 4-bytes-sized value to bpf_map_update_elem(). The expected size of the value is 8 * num_possible_cpus(), so fix it by passing a value with enough-size for per-cpu map update. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20231101032455.3808547-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: unify 32-bit and 64-bit is_branch_taken logic	Andrii Nakryiko
	Combine 32-bit and 64-bit is_branch_taken logic for SCALAR_VALUE registers. It makes it easier to see parallels between two domains (32-bit and 64-bit), and makes subsequent refactoring more straightforward. No functional changes. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-16-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: generalize is_branch_taken to handle all conditional jumps in one place	Andrii Nakryiko
	Make is_branch_taken() a single entry point for branch pruning decision making, handling both pointer vs pointer, pointer vs scalar, and scalar vs scalar cases in one place. This also nicely cleans up check_cond_jmp_op(). Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-15-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: move is_branch_taken() down	Andrii Nakryiko
	Move is_branch_taken() slightly down. In subsequent patched we'll need both flip_opcode() and is_pkt_ptr_branch_taken() for is_branch_taken(), but instead of sprinkling forward declarations around, it makes more sense to move is_branch_taken() lower below is_pkt_ptr_branch_taken(), and also keep it closer to very tightly related reg_set_min_max(), as they are two critical parts of the same SCALAR range tracking logic. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-14-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: generalize is_branch_taken() to work with two registers	Andrii Nakryiko
	While still assuming that second register is a constant, generalize is_branch_taken-related code to accept two registers instead of register plus explicit constant value. This also, as a side effect, allows to simplify check_cond_jmp_op() by unifying BPF_K case with BPF_X case, for which we use a fake register to represent BPF_K's imm constant as a register. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231102033759.2541186-13-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: rename is_branch_taken reg arguments to prepare for the second one	Andrii Nakryiko
	Just taking mundane refactoring bits out into a separate patch. No functional changes. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231102033759.2541186-12-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: drop knowledge-losing __reg_combine_{32,64}_into_{64,32} logic	Andrii Nakryiko
	When performing 32-bit conditional operation operating on lower 32 bits of a full 64-bit register, register full value isn't changed. We just potentially gain new knowledge about that register's lower 32 bits. Unfortunately, __reg_combine_{32,64}_into_{64,32} logic that reg_set_min_max() performs as a last step, can lose information in some cases due to __mark_reg64_unbounded() and __reg_assign_32_into_64(). That's bad and completely unnecessary. Especially __reg_assign_32_into_64() looks completely out of place here, because we are not performing zero-extending subregister assignment during conditional jump. So this patch replaced __reg_combine_* with just a normal reg_bounds_sync() which will do a proper job of deriving u64/s64 bounds from u32/s32, and vice versa (among all other combinations). __reg_combine_64_into_32() is also used in one more place, coerce_reg_to_size(), while handling 1- and 2-byte register loads. Looking into this, it seems like besides marking subregister as unbounded before performing reg_bounds_sync(), we were also performing deduction of smin32/smax32 and umin32/umax32 bounds from respective smin/smax and umin/umax bounds. It's now redundant as reg_bounds_sync() performs all the same logic more generically (e.g., without unnecessary assumption that upper 32 bits of full register should be zero). Long story short, we remove __reg_combine_64_into_32() completely, and coerce_reg_to_size() now only does resetting subreg to unbounded and then performing reg_bounds_sync() to recover as much information as possible from 64-bit umin/umax and smin/smax bounds, set explicitly in coerce_reg_to_size() earlier. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231102033759.2541186-10-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: try harder to deduce register bounds from different numeric domains	Andrii Nakryiko
	There are cases (caught by subsequent reg_bounds tests in selftests/bpf) where performing one round of __reg_deduce_bounds() doesn't propagate all the information from, say, s32 to u32 bounds and than from newly learned u32 bounds back to u64 and s64. So perform __reg_deduce_bounds() twice to make sure such derivations are propagated fully after reg_bounds_sync(). One such example is test `(s64)[0xffffffff00000001; 0] (u64)< 0xffffffff00000000` from selftest patch from this patch set. It demonstrates an intricate dance of u64 -> s64 -> u64 -> u32 bounds adjustments, which requires two rounds of __reg_deduce_bounds(). Here are corresponding refinement log from selftest, showing evolution of knowledge. REFINING (FALSE R1) (u64)SRC=[0xffffffff00000000; U64_MAX] (u64)DST_OLD=[0; U64_MAX] (u64)DST_NEW=[0xffffffff00000000; U64_MAX] REFINING (FALSE R1) (u64)SRC=[0xffffffff00000000; U64_MAX] (s64)DST_OLD=[0xffffffff00000001; 0] (s64)DST_NEW=[0xffffffff00000001; -1] REFINING (FALSE R1) (s64)SRC=[0xffffffff00000001; -1] (u64)DST_OLD=[0xffffffff00000000; U64_MAX] (u64)DST_NEW=[0xffffffff00000001; U64_MAX] REFINING (FALSE R1) (u64)SRC=[0xffffffff00000001; U64_MAX] (u32)DST_OLD=[0; U32_MAX] (u32)DST_NEW=[1; U32_MAX] R1 initially has smin/smax set to [0xffffffff00000001; -1], while umin/umax is unknown. After (u64)< comparison, in FALSE branch we gain knowledge that umin/umax is [0xffffffff00000000; U64_MAX]. That causes smin/smax to learn that zero can't happen and upper bound is -1. Then smin/smax is adjusted from umin/umax improving lower bound from 0xffffffff00000000 to 0xffffffff00000001. And then eventually umin32/umax32 bounds are drived from umin/umax and become [1; U32_MAX]. Selftest in the last patch is actually implementing a multi-round fixed-point convergence logic, but so far all the tests are handled by two rounds of reg_bounds_sync() on the verifier state, so we keep it simple for now. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-9-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: improve deduction of 64-bit bounds from 32-bit bounds	Andrii Nakryiko
	Add a few interesting cases in which we can tighten 64-bit bounds based on newly learnt information about 32-bit bounds. E.g., when full u64/s64 registers are used in BPF program, and then eventually compared as u32/s32. The latter comparison doesn't change the value of full register, but it does impose new restrictions on possible lower 32 bits of such full registers. And we can use that to derive additional full register bounds information. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Link: https://lore.kernel.org/r/20231102033759.2541186-8-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: add special smin32/smax32 derivation from 64-bit bounds	Andrii Nakryiko
	Add a special case where we can derive valid s32 bounds from umin/umax or smin/smax by stitching together negative s32 subrange and non-negative s32 subrange. That requires upper 32 bits to form a [N, N+1] range in u32 domain (taking into account wrap around, so 0xffffffff to 0x00000000 is a valid [N, N+1] range in this sense). See code comment for concrete examples. Eduard Zingerman also provided an alternative explanation ([0]) for more mathematically inclined readers: Suppose: . there are numbers a, b, c . 231 <= b < 232 . 0 <= c < 231 . umin = 232 * a + b . umax = 2*32 (a + 1) + c The number of values in the range represented by [umin; umax] is: . N = umax - umin + 1 = 232 + c - b + 1 . min(N) = 232 + 0 - (232-1) + 1 = 2, with b = 232-1, c = 0 . max(N) = 232 + (231 - 1) - 231 + 1 = 232, with b = 231, c = 231-1 Hence [(s32)b; (s32)c] forms a valid range. [0] https://lore.kernel.org/bpf/d7af631802f0cfae20df77fe70068702d24bbd31.camel@gmail.com/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-7-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: derive subreg bounds from full bounds when upper 32 bits are constant	Andrii Nakryiko
	Comments in code try to explain the idea behind why this is correct. Please check the code and comments. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: derive smin32/smax32 from umin32/umax32 bounds	Andrii Nakryiko
	All the logic that applies to u64 vs s64, equally applies for u32 vs s32 relationships (just taken in a smaller 32-bit numeric space). So do the same deduction of smin32/smax32 from umin32/umax32, if we can. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	bpf: derive smin/smax from umin/max bounds	Andrii Nakryiko
	Add smin/smax derivation from appropriate umin/umax values. Previously the logic was surprisingly asymmetric, trying to derive umin/umax from smin/smax (if possible), but not trying to do the same in the other direction. A simple addition to __reg64_deduce_bounds() fixes this. Added also generic comment about u64/s64 ranges and their relationship. Hopefully that helps readers to understand all the bounds deductions a bit better. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	selftests/bpf: Add malloc failure checks in bpf_iter	Yuran Pereira
	Since some malloc calls in bpf_iter may at times fail, this patch adds the appropriate fail checks, and ensures that any previously allocated resource is appropriately destroyed before returning the function. Signed-off-by: Yuran Pereira <yuran.pereira@hotmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/DB3PR10MB6835F0ECA792265FA41FC39BE8A3A@DB3PR10MB6835.EURPRD10.PROD.OUTLOOK.COM Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	selftests/bpf: satisfy compiler by having explicit return in btf test	Andrii Nakryiko
	Some compilers complain about get_pprint_mapv_size() not returning value in some code paths. Fix with explicit return. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	selftests/bpf: Convert CHECK macros to ASSERT_* macros in bpf_iter	Yuran Pereira
	As it was pointed out by Yonghong Song [1], in the bpf selftests the use of the ASSERT_* series of macros is preferred over the CHECK macro. This patch replaces all CHECK calls in bpf_iter with the appropriate ASSERT_* macros. [1] https://lore.kernel.org/lkml/0a142924-633c-44e6-9a92-2dc019656bf2@linux.dev Suggested-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Yuran Pereira <yuran.pereira@hotmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Kui-Feng Lee <thinker.li@gmail.com> Link: https://lore.kernel.org/r/DB3PR10MB6835E9C8DFCA226DD6FEF914E8A3A@DB3PR10MB6835.EURPRD10.PROD.OUTLOOK.COM Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	selftests/bpf: fix RELEASE=1 build for tc_opts	Andrii Nakryiko
	Compiler complains about malloc(). We also don't need to dynamically allocate anything, so make the life easier by using statically sized buffer. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20231102033759.2541186-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2023-11-09	Merge tag 'net-6.7-rc1' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter and bpf. Current release - regressions: - sched: fix SKB_NOT_DROPPED_YET splat under debug config Current release - new code bugs: - tcp: - fix usec timestamps with TCP fastopen - fix possible out-of-bounds reads in tcp_hash_fail() - fix SYN option room calculation for TCP-AO - tcp_sigpool: fix some off by one bugs - bpf: fix compilation error without CGROUPS - ptp: - ptp_read() should not release queue - fix tsevqs corruption Previous releases - regressions: - llc: verify mac len before reading mac header Previous releases - always broken: - bpf: - fix check_stack_write_fixed_off() to correctly spill imm - fix precision tracking for BPF_ALU \| BPF_TO_BE \| BPF_END - check map->usercnt after timer->timer is assigned - dsa: lan9303: consequently nested-lock physical MDIO - dccp/tcp: call security_inet_conn_request() after setting IP addr - tg3: fix the TX ring stall due to incorrect full ring handling - phylink: initialize carrier state at creation - ice: fix direction of VF rules in switchdev mode Misc: - fill in a bunch of missing MODULE_DESCRIPTION()s, more to come" * tag 'net-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits) net: ti: icss-iep: fix setting counter value ptp: fix corrupted list in ptp_open ptp: ptp_read should not release queue net_sched: sch_fq: better validate TCA_FQ_WEIGHTS and TCA_FQ_PRIOMAP net: kcm: fill in MODULE_DESCRIPTION() net/sched: act_ct: Always fill offloading tuple iifidx netfilter: nat: fix ipv6 nat redirect with mapped and scoped addresses netfilter: xt_recent: fix (increase) ipv6 literal buffer length ipvs: add missing module descriptions netfilter: nf_tables: remove catchall element in GC sync path netfilter: add missing module descriptions drivers/net/ppp: use standard array-copy-function net: enetc: shorten enetc_setup_xdp_prog() error message to fit NETLINK_MAX_FMTMSG_LEN virtio/vsock: Fix uninit-value in virtio_transport_recv_pkt() r8169: respect userspace disabling IFF_MULTICAST selftests/bpf: get trusted cgrp from bpf_iter__cgroup directly bpf: Let verifier consider {task,cgroup} is trusted in bpf_iter_reg net: phylink: initialize carrier state at creation test/vsock: add dobule bind connect test test/vsock: refactor vsock_accept ...
2023-11-09	Merge tag 'v6.7-p2' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "This fixes a regression in ahash and hides the Kconfig sub-options for the jitter RNG" * tag 'v6.7-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: ahash - Set using_shash for cloned ahash wrapper over shash crypto: jitterentropy - Hide esoteric Kconfig options under FIPS and EXPERT