summaryrefslogtreecommitdiff
path: root/tools
AgeCommit message (Collapse)Author
2025-06-16tools headers UAPI: Sync KVM's vmx.h header with the kernel sourcesArnaldo Carvalho de Melo
To pick the changes in: 6c441e4d6e729616 ("KVM: TDX: Handle EXIT_REASON_OTHER_SMI") c42856af8f70d983 ("KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)") That makes 'perf kvm-stat' aware of this new TDCALL exit reason, thus addressing the following perf build warning: Warning: Kernel ABI header differences: diff -u tools/arch/x86/include/uapi/asm/vmx.h arch/x86/include/uapi/asm/vmx.h Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Isaku Yamahata <isaku.yamahata@intel.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/aErcVn_4plQyODR1@x1 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-06-16tools kvm headers arm64: Update KVM header from the kernel sourcesArnaldo Carvalho de Melo
To pick the changes from: b7628c7973765c85 ("KVM: arm64: Allow userspace to limit the number of PMU counters for EL2 VMs") That doesn't result in any changes in tooling (built on a Raspberry PI 5 running Debian GNU/Linux 12 (bookworm)), only addresses this perf build warning: Warning: Kernel ABI header differences: diff -u tools/arch/arm64/include/uapi/asm/kvm.h arch/arm64/include/uapi/asm/kvm.h Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/ Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-06-16tools headers UAPI: Sync linux/prctl.h with the kernel sources to pick FUTEX ↵Arnaldo Carvalho de Melo
knob To pick the changes in: 63e8595c060a1fef ("futex: Allow to make the private hash immutable") 80367ad01d93ac78 ("futex: Add basic infrastructure for local task local hash") That adds a FUTEX knob: $ tools/perf/trace/beauty/prctl_option.sh > before $ cp include/uapi/linux/prctl.h tools/perf/trace/beauty/include/uapi/linux/prctl.h $ tools/perf/trace/beauty/prctl_option.sh > after $ diff -u before after --- before 2025-06-09 14:50:45.162579336 -0300 +++ after 2025-06-09 14:50:52.797660024 -0300 @@ -72,6 +72,7 @@ [75] = "SET_SHADOW_STACK_STATUS", [76] = "LOCK_SHADOW_STACK_STATUS", [77] = "TIMER_CREATE_RESTORE_IDS", + [78] = "FUTEX_HASH", }; static const char *prctl_set_mm_options[] = { [1] = "START_CODE", $ That now will be used to decode the syscall option and also to compose filters, for instance: [root@five ~]# perf trace -e syscalls:sys_enter_prctl --filter option==SET_NAME 0.000 Isolated Servi/3474327 syscalls:sys_enter_prctl(option: SET_NAME, arg2: 0x7f23f13b7aee) 0.032 DOM Worker/3474327 syscalls:sys_enter_prctl(option: SET_NAME, arg2: 0x7f23deb25670) 7.920 :3474328/3474328 syscalls:sys_enter_prctl(option: SET_NAME, arg2: 0x7f23e24fbb10) 7.935 StreamT~s #374/3474328 syscalls:sys_enter_prctl(option: SET_NAME, arg2: 0x7f23e24fb970) 8.400 Isolated Servi/3474329 syscalls:sys_enter_prctl(option: SET_NAME, arg2: 0x7f23e24bab10) 8.418 StreamT~s #374/3474329 syscalls:sys_enter_prctl(option: SET_NAME, arg2: 0x7f23e24ba970) ^C[root@five ~]# This addresses this perf build warning: Warning: Kernel ABI header differences: diff -u tools/perf/trace/beauty/include/uapi/linux/prctl.h include/uapi/linux/prctl.h Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/aEiYOtKkrVDT03hZ@x1 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-06-16perf mem: Document new output fields (op, cache, mem, dtlb, snoop)Namhyung Kim
Update the documentation of the new fields with examples and caveats. Also update the related documentation for AMD IBS. Reviewed-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250610005742.2173050-1-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-06-16tools headers: Update the fs headers with the kernel sourcesArnaldo Carvalho de Melo
To pick up changes from: 5d894321c49e6137 ("fs: add atomic write unit max opt to statx") a516403787e08119 ("fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions") c07d3aede2b26830 ("fscrypt: add support for hardware-wrapped keys") These are used to beautify fs syscall arguments, albeit the changes in this update are not affecting those beautifiers. This addresses these tools/ build warnings: Warning: Kernel ABI header differences: diff -u tools/include/uapi/linux/fscrypt.h include/uapi/linux/fscrypt.h diff -u tools/include/uapi/linux/stat.h include/uapi/linux/stat.h diff -u tools/perf/trace/beauty/include/uapi/linux/fs.h include/uapi/linux/fs.h diff -u tools/perf/trace/beauty/include/uapi/linux/stat.h include/uapi/linux/stat.h Please see tools/include/uapi/README for details (it's in the first patch of this series). Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Eric Biggers <ebiggers@google.com> Cc: Ian Rogers <irogers@google.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Garry <john.g.garry@oracle.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Liam R. Howlett <liam.howlett@oracle.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/aEce1keWdO-vGeqe@x1 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-06-16perf test: Restrict uniquifying test to machines with 'uncore_imc'Chun-Tse Shao
The test would fail if target machine does not have 'uncore_imc' devices. Since event uniquifying behavior is similar among different architectures, we are restricting the test to only run on machines with `uncore_imc` devices. Suggested-by: Ian Rogers <irogers@google.com> Signed-off-by: Chun-Tse Shao <ctshao@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250521224513.1104129-1-ctshao@google.com [ Skip the test, i.e. return 2, instead of returning 0 as if the test had succeed ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-06-16selftests/coredump: make sure invalid paths are rejectedChristian Brauner
Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-8-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-14Merge tag 'mm-hotfixes-stable-2025-06-13-21-56' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "9 hotfixes. 3 are cc:stable and the remainder address post-6.15 issues or aren't considered necessary for -stable kernels. Only 4 are for MM" * tag 'mm-hotfixes-stable-2025-06-13-21-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: add mmap_prepare() compatibility layer for nested file systems init: fix build warnings about export.h MAINTAINERS: add Barry as a THP reviewer drivers/rapidio/rio_cm.c: prevent possible heap overwrite mm: close theoretical race where stale TLB entries could linger mm/vma: reset VMA iterator on commit_merge() OOM failure docs: proc: update VmFlags documentation in smaps scatterlist: fix extraneous '@'-sign kernel-doc notation selftests/mm: skip failed memfd setups in gup_longterm
2025-06-13Merge tag 'pm-6.16-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix the cpupower utility installation, fix up the recently added Rust abstractions for cpufreq and OPP, restore the x86 update eliminating mwait_play_dead_cpuid_hint() that has been reverted during the 6.16 merge window along with preventing the failure caused by it from happening, and clean up mwait_idle_with_hints() usage in intel_idle: - Implement CpuId Rust abstraction and use it to fix doctest failure related to the recently introduced cpumask abstraction (Viresh Kumar) - Do minor cleanups in the `# Safety` sections for cpufreq abstractions added recently (Viresh Kumar) - Unbreak cpupower systemd service units installation on some systems by adding a unitdir variable for specifying the location to install them (Francesco Poli) - Eliminate mwait_play_dead_cpuid_hint() again after reverting its elimination during the 6.16 merge window due to a problem with handling "dead" SMT siblings, but this time prevent leaving them in C1 after initialization by taking them online and back offline when a proper cpuidle driver for the platform has been registered (Rafael Wysocki) - Update data types of variables passed as arguments to mwait_idle_with_hints() to match the function definition after recent changes (Uros Bizjak)" * tag 'pm-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: rust: cpu: Add CpuId::current() to retrieve current CPU ID rust: Use CpuId in place of raw CPU numbers rust: cpu: Introduce CpuId abstraction intel_idle: Update arguments of mwait_idle_with_hints() cpufreq: Convert `/// SAFETY` lines to `# Safety` sections cpupower: split unitdir from libdir in Makefile Reapply "x86/smp: Eliminate mwait_play_dead_cpuid_hint()" ACPI: processor: Rescan "dead" SMT siblings during initialization intel_idle: Rescan "dead" SMT siblings during initialization x86/smp: PM/hibernate: Split arch_resume_nosmt() intel_idle: Use subsys_initcall_sync() for initialization
2025-06-13selftests/bpf: verify jset handling in CFG computationEduard Zingerman
A test case to check if both branches of jset are explored when computing program CFG. At 'if r1 & 0x7 ...': - register 'r2' is computed alive only if jump branch of jset instruction is followed; - register 'r0' is computed alive only if fallthrough branch of jset instruction is followed. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250613175331.3238739-2-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-13veristat: Memory accounting for bpf programsEduard Zingerman
This commit adds a new field mem_peak / "Peak memory (MiB)" field to a set of gathered statistics. The field is intended as an estimate for peak verifier memory consumption for processing of a given program. Mechanically stat is collected as follows: - At the beginning of handle_verif_mode() a new cgroup is created and veristat process is moved into this cgroup. - At each program load: - bpf_object__load() is split into bpf_object__prepare() and bpf_object__load() to avoid accounting for memory allocated for maps; - before bpf_object__load(): - a write to "memory.peak" file of the new cgroup is used to reset cgroup statistics; - updated value is read from "memory.peak" file and stashed; - after bpf_object__load() "memory.peak" is read again and difference between new and stashed values is used as a metric. If any of the above steps fails veristat proceeds w/o collecting mem_peak information for a program, reporting mem_peak as -1. While memcg provides data in bytes (converted from pages), veristat converts it to megabytes to avoid jitter when comparing results of different executions. The change has no measurable impact on veristat running time. A correlation between "Peak states" and "Peak memory" fields provides a sanity check for gathered statistics, e.g. a sample of data for sched_ext programs: Program Peak states Peak memory (MiB) ------------------------ ----------- ----------------- lavd_select_cpu 2153 44 lavd_enqueue 1982 41 lavd_dispatch 3480 28 layered_dispatch 1417 17 layered_enqueue 760 11 lavd_cpu_offline 349 6 lavd_cpu_online 349 6 lavd_init 394 6 rusty_init 350 5 layered_select_cpu 391 4 ... rusty_stopping 134 1 arena_topology_node_init 170 0 Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250613072147.3938139-3-eddyz87@gmail.com
2025-06-13bpf/veristat: Fix veristat for map type BPF_MAP_TYPE_CGRP_STORAGESong Liu
BPF_MAP_TYPE_CGRP_STORAGE doesn't allow non-zero max_entries. So veristat should not set it to 1. Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250613050001.1058733-1-song@kernel.org
2025-06-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM: - Rework of system register accessors for system registers that are directly writen to memory, so that sanitisation of the in-memory value happens at the correct time (after the read, or before the write). For convenience, RMW-style accessors are also provided. - Multiple fixes for the so-called "arch-timer-edge-cases' selftest, which was always broken. x86: - Make KVM_PRE_FAULT_MEMORY stricter for TDX, allowing userspace to pass only the "untouched" addresses and flipping the shared/private bit in the implementation. - Disable SEV-SNP support on initialization failure * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86/mmu: Reject direct bits in gpa passed to KVM_PRE_FAULT_MEMORY KVM: x86/mmu: Embed direct bits into gpa for KVM_PRE_FAULT_MEMORY KVM: SEV: Disable SEV-SNP support on initialization failure KVM: arm64: selftests: Determine effective counter width in arch_timer_edge_cases KVM: arm64: selftests: Fix xVAL init in arch_timer_edge_cases KVM: arm64: selftests: Fix thread migration in arch_timer_edge_cases KVM: arm64: selftests: Fix help text for arch_timer_edge_cases KVM: arm64: Make __vcpu_sys_reg() a pure rvalue operand KVM: arm64: Don't use __vcpu_sys_reg() to get the address of a sysreg KVM: arm64: Add RMW specific sysreg accessor KVM: arm64: Add assignment-specific sysreg accessor
2025-06-13selftests: Add tests for PR_SYS_DISPATCH_INCLUSIVE_ONDmitry Vyukov
Add tests for PR_SYS_DISPATCH_INCLUSIVE_ON correct/incorrect args, and a test that ensures that the specified range is respected by both PR_SYS_DISPATCH_EXCLUSIVE_ON and PR_SYS_DISPATCH_INCLUSIVE_ON. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/8df4b50b176b073550bab5b3f3faff752c5f8e17.1747839857.git.dvyukov@google.com
2025-06-13syscall_user_dispatch: Add PR_SYS_DISPATCH_INCLUSIVE_ONDmitry Vyukov
There are two possible scenarios for syscall filtering: - having a trusted/allowed range of PCs, and intercepting everything else - or the opposite: a single untrusted/intercepted range and allowing everything else (this is relevant for any kind of sandboxing scenario, or monitoring behavior of a single library) The current API only allows the former use case due to allowed range wrap-around check. Add PR_SYS_DISPATCH_INCLUSIVE_ON that enables the second use case. Add PR_SYS_DISPATCH_EXCLUSIVE_ON alias for PR_SYS_DISPATCH_ON to make it clear how it's different from the new PR_SYS_DISPATCH_INCLUSIVE_ON. Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/97947cc8e205ff49675826d7b0327ef2e2c66eea.1747839857.git.dvyukov@google.com
2025-06-13selftests: Fix errno checking in syscall_user_dispatch testDmitry Vyukov
Successful syscalls don't change errno, so checking errno is wrong to ensure that a syscall has failed. For example for the following sequence: prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0xff, 0); EXPECT_EQ(EINVAL, errno); prctl(PR_SET_SYSCALL_USER_DISPATCH, op, 0x0, 0x0, &sel); EXPECT_EQ(EINVAL, errno); only the first syscall may fail and set errno, but the second may succeed and keep errno intact, and the check will falsely pass. Or if errno happened to be EINVAL before, even the first check may falsely pass. Also use EXPECT/ASSERT consistently. Currently there is an inconsistent mix without obvious reasons for usage of one or another. Fixes: 179ef035992e ("selftests: Add kselftest for syscall user dispatch") Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/af6a04dbfef9af8570f5bab43e3ef1416b62699a.1747839857.git.dvyukov@google.com
2025-06-12mm: add mmap_prepare() compatibility layer for nested file systemsLorenzo Stoakes
Nested file systems, that is those which invoke call_mmap() within their own f_op->mmap() handlers, may encounter underlying file systems which provide the f_op->mmap_prepare() hook introduced by commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"). We have a chicken-and-egg scenario here - until all file systems are converted to using .mmap_prepare(), we cannot convert these nested handlers, as we can't call f_op->mmap from an .mmap_prepare() hook. So we have to do it the other way round - invoke the .mmap_prepare() hook from an .mmap() one. in order to do so, we need to convert VMA state into a struct vm_area_desc descriptor, invoking the underlying file system's f_op->mmap_prepare() callback passing a pointer to this, and then setting VMA state accordingly and safely. This patch achieves this via the compat_vma_mmap_prepare() function, which we invoke from call_mmap() if f_op->mmap_prepare() is specified in the passed in file pointer. We place the fundamental logic into mm/vma.h where VMA manipulation belongs. We also update the VMA userland tests to accommodate the changes. The compat_vma_mmap_prepare() function and its associated machinery is temporary, and will be removed once the conversion of file systems is complete. We carefully place this code so it can be used with CONFIG_MMU and also with cutting edge nommu silicon. [akpm@linux-foundation.org: export compat_vma_mmap_prepare tp fix build] [lorenzo.stoakes@oracle.com: remove unused declarations] Link: https://lkml.kernel.org/r/ac3ae324-4c65-432a-8c6d-2af988b18ac8@lucifer.local Link: https://lkml.kernel.org/r/20250609165749.344976-1-lorenzo.stoakes@oracle.com Fixes: c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file callback"). Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Closes: https://lore.kernel.org/linux-mm/CAG48ez04yOEVx1ekzOChARDDBZzAKwet8PEoPM4Ln3_rk91AzQ@mail.gmail.com/ Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-12tools/bpf_jit_disasm: Fix potential negative tpath index in get_exec_path()Ruslan Semchenko
If readlink() fails, len will be -1, which can cause negative indexing and undefined behavior. This patch ensures that len is set to 0 on readlink failure, preventing such issues. Signed-off-by: Ruslan Semchenko <uncleruc2075@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20250612131816.1870-1-uncleruc2075@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-12selftests/bpf: Fix xdp_do_redirect failure with 64KB page sizeYonghong Song
On arm64 with 64KB page size, the selftest xdp_do_redirect failed like below: ... test_xdp_do_redirect:PASS:pkt_count_tc 0 nsec test_max_pkt_size:PASS:prog_run_max_size 0 nsec test_max_pkt_size:FAIL:prog_run_too_big unexpected prog_run_too_big: actual -28 != expected -22 With 64KB page size, the xdp frame size will be much bigger so the existing test will fail. Adjust various parameters so the test can also work on 64K page size. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250612035042.2208630-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-12selftests/bpf: Fix two net related test failures with 64K page sizeYonghong Song
When running BPF selftests on arm64 with a 64K page size, I encountered the following two test failures: sockmap_basic/sockmap skb_verdict change tail:FAIL tc_change_tail:FAIL With further debugging, I identified the root cause in the following kernel code within __bpf_skb_change_tail(): u32 max_len = BPF_SKB_MAX_LEN; u32 min_len = __bpf_skb_min_len(skb); int ret; if (unlikely(flags || new_len > max_len || new_len < min_len)) return -EINVAL; With a 4K page size, new_len = 65535 and max_len = 16064, the function returns -EINVAL. However, With a 64K page size, max_len increases to 261824, allowing execution to proceed further in the function. This is because BPF_SKB_MAX_LEN scales with the page size and larger page sizes result in higher max_len values. Updating the new_len parameter in both tests based on actual kernel page size resolved both failures. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250612035037.2207911-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-12bpf: Fix an issue in bpf_prog_test_run_xdp when page size greater than 4KYonghong Song
The bpf selftest xdp_adjust_tail/xdp_adjust_frags_tail_grow failed on arm64 with 64KB page: xdp_adjust_tail/xdp_adjust_frags_tail_grow:FAIL In bpf_prog_test_run_xdp(), the xdp->frame_sz is set to 4K, but later on when constructing frags, with 64K page size, the frag data_len could be more than 4K. This will cause problems in bpf_xdp_frags_increase_tail(). To fix the failure, the xdp->frame_sz is set to be PAGE_SIZE so kernel can test different page size properly. With the kernel change, the user space and bpf prog needs adjustment. Currently, the MAX_SKB_FRAGS default value is 17, so for 4K page, the maximum packet size will be less than 68K. To test 64K page, a bigger maximum packet size than 68K is desired. So two different functions are implemented for subtest xdp_adjust_frags_tail_grow. Depending on different page size, different data input/output sizes are used to adapt with different page size. Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250612035032.2207498-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-12selftests: tcp_ao: fix spelling in seq-ext.c commentAnkit Chauhan
Spelling fix: conneciton --> connection This is a non-functional change aimed at improving code clarity. Signed-off-by: Ankit Chauhan <ankitchauhan2065@gmail.com> Link: https://patch.msgid.link/20250610071903.67180-1-ankitchauhan2065@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12selftests/bpf: fix signedness bug in redir_partial()Fushuai Wang
When xsend() returns -1 (error), the check 'n < sizeof(buf)' incorrectly treats it as success due to unsigned promotion. Explicitly check for -1 first. Fixes: a4b7193d8efd ("selftests/bpf: Add sockmap test for redirecting partial skb data") Signed-off-by: Fushuai Wang <wangfushuai@baidu.com> Link: https://lore.kernel.org/r/20250612084208.27722-1-wangfushuai@baidu.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-12selftests/bpf: tests with a loop state missing read/precision markEduard Zingerman
The test case absent_mark_in_the_middle_state is equivalent of the following C program: 1: r8 = bpf_get_prandom_u32(); 2: r6 = -32; 3: bpf_iter_num_new(&fp[-8], 0, 10); 4: if (unlikely(bpf_get_prandom_u32())) 5: r6 = -31; 6: for (;;) { 7: if (!bpf_iter_num_next(&fp[-8])) 8: break; 9: if (unlikely(bpf_get_prandom_u32())) 10: *(u64 *)(fp + r6) = 7; 11: } 12: bpf_iter_num_destroy(&fp[-8]); 13: return 0; W/o a fix that instructs verifier to ignore branches count for loop entries verification proceeds as follows: - 1-4, state is {r6=-32,fp-8=active}; - 6, checkpoint A is created with {r6=-32,fp-8=active}; - 7, checkpoint B is created with {r6=-32,fp-8=active}, push state {r6=-32,fp-8=active} from 7 to 9; - 8,12,13, {r6=-32,fp-8=drained}, exit; - pop state with {r6=-32,fp-8=active} from 7 to 9; - 9, push state {r6=-32,fp-8=active} from 9 to 10; - 6, checkpoint C is created with {r6=-32,fp-8=active}; - 7, checkpoint A is hit, no precision propagated for r6 to C; - pop state {r6=-32,fp-8=active} from 9 to 10; - 10, state is {r6=-31,fp-8=active}, r6 is marked as read and precise, these marks are propagated to checkpoints A and B (but not C, as it is not the parent of current state; - 6, {r6=-31,fp-8=active} checkpoint C is hit, because r6 is not marked precise for this checkpoint; - the program is accepted, despite a possibility of unaligned u64 stack access at offset -31. The test case absent_mark_in_the_middle_state2 is similar except the following change: r8 = bpf_get_prandom_u32(); r6 = -32; bpf_iter_num_new(&fp[-8], 0, 10); if (unlikely(bpf_get_prandom_u32())) { r6 = -31; + jump_into_loop: + goto +0; + goto loop; + } + if (unlikely(bpf_get_prandom_u32())) + goto jump_into_loop; + loop: for (;;) { if (!bpf_iter_num_next(&fp[-8])) break; if (unlikely(bpf_get_prandom_u32())) *(u64 *)(fp + r6) = 7; } bpf_iter_num_destroy(&fp[-8]) return 0 The goal is to check that read/precision marks are propagated to checkpoint created at 'goto +0' that resides outside of the loop. The test case absent_mark_in_the_middle_state3 is a bit different and is equivalent to the C program below: int absent_mark_in_the_middle_state3(void) { bpf_iter_num_new(&fp[-8], 0, 10) loop1(-32, &fp[-8]) loop1_wrapper(&fp[-8]) bpf_iter_num_destroy(&fp[-8]) } int loop1(num, iter) { while (bpf_iter_num_next(iter)) { if (unlikely(bpf_get_prandom_u32())) *(fp + num) = 7; } return 0 } int loop1_wrapper(iter) { r6 = -32; if (unlikely(bpf_get_prandom_u32())) r6 = -31; loop1(r6, iter); return 0; } The unsafe state is reached in a similar manner, but the loop is located inside a subprogram that is called from two locations in the main subprogram. This detail is important for exercising bpf_scc_visit->backedges memory management. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20250611200836.4135542-11-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc2). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12Merge tag 'net-6.16-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bluetooth and wireless. Current release - regressions: - af_unix: allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD Current release - new code bugs: - eth: airoha: correct enable mask for RX queues 16-31 - veth: prevent NULL pointer dereference in veth_xdp_rcv when peer disappears under traffic - ipv6: move fib6_config_validate() to ip6_route_add(), prevent invalid routes Previous releases - regressions: - phy: phy_caps: don't skip better duplex match on non-exact match - dsa: b53: fix untagged traffic sent via cpu tagged with VID 0 - Revert "wifi: mwifiex: Fix HT40 bandwidth issue.", it caused transient packet loss, exact reason not fully understood, yet Previous releases - always broken: - net: clear the dst when BPF is changing skb protocol (IPv4 <> IPv6) - sched: sfq: fix a potential crash on gso_skb handling - Bluetooth: intel: improve rx buffer posting to avoid causing issues in the firmware - eth: intel: i40e: make reset handling robust against multiple requests - eth: mlx5: ensure FW pages are always allocated on the local NUMA node, even when device is configure to 'serve' another node - wifi: ath12k: fix GCC_GCC_PCIE_HOT_RST definition for WCN7850, prevent kernel crashes - wifi: ath11k: avoid burning CPU in ath11k_debugfs_fw_stats_request() for 3 sec if fw_stats_done is not set" * tag 'net-6.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits) selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS context net: ethtool: Don't check if RSS context exists in case of context 0 af_unix: Allow passing cred for embryo without SO_PASSCRED/SO_PASSPIDFD. ipv6: Move fib6_config_validate() to ip6_route_add(). net: drv: netdevsim: don't napi_complete() from netpoll net/mlx5: HWS, Add error checking to hws_bwc_rule_complex_hash_node_get() veth: prevent NULL pointer dereference in veth_xdp_rcv net_sched: remove qdisc_tree_flush_backlog() net_sched: ets: fix a race in ets_qdisc_change() net_sched: tbf: fix a race in tbf_change() net_sched: red: fix a race in __red_change() net_sched: prio: fix a race in prio_tune() net_sched: sch_sfq: reject invalid perturb period net: phy: phy_caps: Don't skip better duplex macth on non-exact match MAINTAINERS: Update Kuniyuki Iwashima's email address. selftests: net: add test case for NAT46 looping back dst net: clear the dst when changing skb protocol net/mlx5e: Fix number of lanes to UNKNOWN when using data_rate_oper net/mlx5e: Fix leak of Geneve TLV option object net/mlx5: HWS, make sure the uplink is the last destination ...
2025-06-12selftests: drv-net: rss_ctx: Add test for ntuple rules targeting default RSS ↵Gal Pressman
context Add test_rss_default_context_rule() to verify that ntuple rules can correctly direct traffic to the default RSS context (context 0). The test creates two ntuple rules with explicit location priorities: - A high-priority rule (loc 0) directing specific port traffic to context 0. - A low-priority rule (loc 1) directing all other TCP traffic to context 1. This validates that: 1. Rules targeting the default context function properly. 2. Traffic steering works as expected when mixing default and additional RSS contexts. The test was written by AI, and reviewed by humans. Reviewed-by: Nimrod Oren <noren@nvidia.com> Signed-off-by: Gal Pressman <gal@nvidia.com> Link: https://patch.msgid.link/20250612071958.1696361-3-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-12selftests/coredump: add coredump server selftestsChristian Brauner
This adds extensive tests for the coredump server. Link: https://lore.kernel.org/20250603-work-coredump-socket-protocol-v2-5-05a5f0c18ecc@kernel.org Acked-by: Lennart Poettering <lennart@poettering.net> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-12tools: add coredump.h headerChristian Brauner
Copy the coredump header so we can rely on it in the selftests. Link: https://lore.kernel.org/20250603-work-coredump-socket-protocol-v2-4-05a5f0c18ecc@kernel.org Acked-by: Lennart Poettering <lennart@poettering.net> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-12selftests/coredump: cleanup coredump testsChristian Brauner
Make the selftests we added this cycle easier to read. Link: https://lore.kernel.org/20250603-work-coredump-socket-protocol-v2-3-05a5f0c18ecc@kernel.org Acked-by: Lennart Poettering <lennart@poettering.net> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-12selftests/coredump: fix buildChristian Brauner
Fix various warnings in the selftest build. Link: https://lore.kernel.org/20250603-work-coredump-socket-protocol-v2-2-05a5f0c18ecc@kernel.org Acked-by: Lennart Poettering <lennart@poettering.net> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-11selftests/mm: skip failed memfd setups in gup_longtermMark Brown
Unlike the other cases gup_longterm's memfd tests previously skipped the test when failing to set up the file descriptor to test. Restore this behavior to avoid hitting failures when hugetlb isn't configured. Link: https://lkml.kernel.org/r/20250605-selftest-mm-gup-longterm-tweaks-v1-1-2fae34b05958@kernel.org Fixes: 66bce7afbaca ("selftests/mm: fix test result reporting in gup_longterm") Signed-off-by: Mark Brown <broonie@kernel.org> Reported-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Closes: https://lkml.kernel.org/r/a76fc252-0fe3-4d4b-a9a1-4a2895c2680d@lucifer.local Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-11Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull BPF fixes from Alexei Starovoitov - Fix libbpf backward compatibility (Andrii Nakryiko) - Add Stanislav Fomichev as bpf/net reviewer - Fix resolve_btfid build when cross compiling (Suleiman Souhlal) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: MAINTAINERS: Add myself as bpf networking reviewer tools/resolve_btfids: Fix build when cross compiling kernel with clang. libbpf: Handle unsupported mmap-based /sys/kernel/btf/vmlinux correctly
2025-06-11selftests: net: add test case for NAT46 looping back dstJakub Kicinski
Simple test for crash involving multicast loopback and stale dst. Reuse exising NAT46 program. Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250610001245.1981782-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-11selftests/vsock: add initial vmtest.sh for vsockBobby Eshleman
This commit introduces a new vmtest.sh runner for vsock. It uses virtme-ng/qemu to run tests in a VM. The tests validate G2H, H2G, and loopback. The testing tools from tools/testing/vsock/ are reused. Currently, only vsock_test is used. VMCI and hyperv support is included in the config file to be built with the -b option, though not used in the tests. Only tested on x86. To run: $ make -C tools/testing/selftests TARGETS=vsock $ tools/testing/selftests/vsock/vmtest.sh or $ make -C tools/testing/selftests TARGETS=vsock run_tests Example runs (after make -C tools/testing/selftests TARGETS=vsock): $ ./tools/testing/selftests/vsock/vmtest.sh 1..3 ok 0 vm_server_host_client ok 1 vm_client_host_server ok 2 vm_loopback SUMMARY: PASS=3 SKIP=0 FAIL=0 Log: /tmp/vsock_vmtest_m7DI.log $ ./tools/testing/selftests/vsock/vmtest.sh vm_loopback 1..1 ok 0 vm_loopback SUMMARY: PASS=1 SKIP=0 FAIL=0 Log: /tmp/vsock_vmtest_a1IO.log $ mkdir -p ~/scratch $ make -C tools/testing/selftests install TARGETS=vsock INSTALL_PATH=~/scratch [... omitted ...] $ cd ~/scratch $ ./run_kselftest.sh TAP version 13 1..1 # timeout set to 300 # selftests: vsock: vmtest.sh # 1..3 # ok 0 vm_server_host_client # ok 1 vm_client_host_server # ok 2 vm_loopback # SUMMARY: PASS=3 SKIP=0 FAIL=0 # Log: /tmp/vsock_vmtest_svEl.log ok 1 selftests: vsock: vmtest.sh Future work can include vsock_diag_test. Because vsock requires a VM to test anything other than loopback, this patch adds vmtest.sh as a kselftest itself. This is different than other systems that have a "vmtest.sh", where it is used as a utility script to spin up a VM to run the selftests as a guest (but isn't hooked into kselftest). Signed-off-by: Bobby Eshleman <bobbyeshleman@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250609-vsock-vmtest-v10-1-7f37198e1cd4@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-11selftests/bpf: Fix cgroup_mprog_ordering failure due to uninitialized variableYonghong Song
On arm64, the cgroup_mprog_ordering selftest failed with test_progs run when building with clang compiler. The reason is due to socklen_t optlen not initialized. In kernel function do_ip_getsockopt(), we have if (copy_from_sockptr(&len, optlen, sizeof(int))) return -EFAULT; if (len < 0) return -EINVAL; The above 'len' variable is a negative value and hence the test failed. But the test is okay on x86_64. I checked the x86_64 asm code and I didn't see explicit initialization of 'optlen' but its value is 0 so kernel didn't return error. This should be a pure luck. Fix the bug by initializing 'oplen' var properly. Fixes: e422d5f118e4 ("selftests/bpf: Add two selftests for mprog API based cgroup progs") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250611162103.1623692-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-11Merge tag 'kvmarm-fixes-6.16-2' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.16, take #2 - Rework of system register accessors for system registers that are directly writen to memory, so that sanitisation of the in-memory value happens at the correct time (after the read, or before the write). For convenience, RMW-style accessors are also provided. - Multiple fixes for the so-called "arch-timer-edge-cases' selftest, which was always broken.
2025-06-11selftests/bpf: Add test to cover ktls with bpf_msg_pop_dataJiayuan Chen
The selftest can reproduce an issue where using bpf_msg_pop_data() in ktls causes errors on the receiving end. Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20250609020910.397930-3-jiayuan.chen@linux.dev
2025-06-10selftests/net: packetdrill: more xfail changesJakub Kicinski
Most of the packetdrill tests have not flaked once last week. Add the few which did to the XFAIL list. Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250610000001.1970934-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10net: stop napi kthreads when THREADED napi is disabledSamiullah Khawaja
Once the THREADED napi is disabled, the napi kthread should also be stopped. Keeping the kthread intact after disabling THREADED napi makes the PID of this kthread show up in the output of netlink 'napi-get' and ps -ef output. The is discussed in the patch below: https://lore.kernel.org/all/20250502191548.559cc416@kernel.org NAPI kthread should stop only if, - There are no pending napi poll scheduled for this thread. - There are no new napi poll scheduled for this thread while it has stopped. - The ____napi_schedule can correctly fallback to the softirq for napi polling. Since napi_schedule_prep provides mutual exclusion over STATE_SCHED bit, it is safe to unset the STATE_THREADED when SCHED_THREADED is set or the SCHED bit is not set. SCHED_THREADED being set means that SCHED is already set and the kthread owns this napi. To disable threaded napi, unset STATE_THREADED bit safely if SCHED_THREADED is set or SCHED is unset. Once STATE_THREADED is unset safely then wait for the kthread to unset the SCHED_THREADED bit so it safe to stop the kthread. Add a new test in nl_netdev to verify this behaviour. Tested: ./tools/testing/selftests/net/nl_netdev.py TAP version 13 1..6 ok 1 nl_netdev.empty_check ok 2 nl_netdev.lo_check ok 3 nl_netdev.page_pool_check ok 4 nl_netdev.napi_list_check ok 5 nl_netdev.dev_set_threaded ok 6 nl_netdev.nsim_rxq_reset_down # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0 Ran neper for 300 seconds and did enable/disable of thread napi in a loop continuously. Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Link: https://patch.msgid.link/20250609173015.3851695-1-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10selftests: netconsole: Add support for basic netconsole target formatBreno Leitao
Extend the netconsole selftest to validate both basic and extended target formats. The basic format is a simpler variant that doesn't support userdata or release functionality. The test now validates that netconsole works correctly in both configurations, improving test coverage for different netconsole deployment scenarios. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20250609-netcons_ext-v3-4-5336fa670326@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10selftests: netconsole: Do not exit from inside the validation functionBreno Leitao
Remove the exit call from validate_result() function and move the test exit logic to the main script. This allows the function to be reused in scenarios where the test needs to continue execution after validation, rather than terminating immediately. The validate_result() function should focus on validation logic only, while the calling script maintains control over program flow and exit conditions. This change improves code modularity and prepares for potential future enhancements where multiple validations might be needed in a single test run. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://patch.msgid.link/20250609-netcons_ext-v3-3-5336fa670326@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-10tools/resolve_btfids: Fix build when cross compiling kernel with clang.Suleiman Souhlal
When cross compiling the kernel with clang, we need to override CLANG_CROSS_FLAGS when preparing the step libraries. Prior to commit d1d096312176 ("tools: fix annoying "mkdir -p ..." logs when building tools in parallel"), MAKEFLAGS would have been set to a value that wouldn't set a value for CLANG_CROSS_FLAGS, hiding the fact that we weren't properly overriding it. Fixes: 56a2df7615fa ("tools/resolve_btfids: Compile resolve_btfids as host program") Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/bpf/20250606074538.1608546-1-suleiman@google.com
2025-06-10selftests/nolibc: make stackprotector probing more robustThomas Weißschuh
nolibc only supports symbol-based stackprotectors, based on the global variable __stack_chk_guard. Support for this differs between architectures and toolchains. Some use the symbol mode by default, some require a flag to enable it and some don't support it at all. Before the nolibc test Makefile required the availability of "-mstack-protector-guard=global" to enable stackprotectors. While this flag makes sure that the correct mode is available it doesn't work where the correct mode is the only supported one and therefore the flag is not implemented. Switch to a more dynamic probing mechanism. This correctly enables stack protectors for mips, loongarch and m68k. Acked-by: Willy Tarreau <w@1wt.eu> Link: https://lore.kernel.org/r/20250609-nolibc-stackprotector-robust-v1-1-a1cfc92a568a@weissschuh.net Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
2025-06-10bpf: adjust path to trace_output sample eBPF programTobias Klauser
The sample file was renamed from trace_output_kern.c to trace_output.bpf.c in commit d4fffba4d04b ("samples/bpf: Change _kern suffix to .bpf with syscall tracing program"). Adjust the path in the documentation comment for bpf_perf_event_output. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Link: https://lore.kernel.org/r/20250610140756.16332-1-tklauser@distanz.ch Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-09selftests/bpf: Add test for Spectre v1 mitigationLuis Gerhorst
This is based on the gadget from the description of commit 9183671af6db ("bpf: Fix leakage under speculation on mispredicted branches"). Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250603212814.338867-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-09bpf: Fall back to nospec for Spectre v1Luis Gerhorst
This implements the core of the series and causes the verifier to fall back to mitigating Spectre v1 using speculation barriers. The approach was presented at LPC'24 [1] and RAID'24 [2]. If we find any forbidden behavior on a speculative path, we insert a nospec (e.g., lfence speculation barrier on x86) before the instruction and stop verifying the path. While verifying a speculative path, we can furthermore stop verification of that path whenever we encounter a nospec instruction. A minimal example program would look as follows: A = true B = true if A goto e f() if B goto e unsafe() e: exit There are the following speculative and non-speculative paths (`cur->speculative` and `speculative` referring to the value of the push_stack() parameters): - A = true - B = true - if A goto e - A && !cur->speculative && !speculative - exit - !A && !cur->speculative && speculative - f() - if B goto e - B && cur->speculative && !speculative - exit - !B && cur->speculative && speculative - unsafe() If f() contains any unsafe behavior under Spectre v1 and the unsafe behavior matches `state->speculative && error_recoverable_with_nospec(err)`, do_check() will now add a nospec before f() instead of rejecting the program: A = true B = true if A goto e nospec f() if B goto e unsafe() e: exit Alternatively, the algorithm also takes advantage of nospec instructions inserted for other reasons (e.g., Spectre v4). Taking the program above as an example, speculative path exploration can stop before f() if a nospec was inserted there because of Spectre v4 sanitization. In this example, all instructions after the nospec are dead code (and with the nospec they are also dead code speculatively). For this, it relies on the fact that speculation barriers generally prevent all later instructions from executing if the speculation was not correct: * On Intel x86_64, lfence acts as full speculation barrier, not only as a load fence [3]: An LFENCE instruction or a serializing instruction will ensure that no later instructions execute, even speculatively, until all prior instructions complete locally. [...] Inserting an LFENCE instruction after a bounds check prevents later operations from executing before the bound check completes. This was experimentally confirmed in [4]. * On AMD x86_64, lfence is dispatch-serializing [5] (requires MSR C001_1029[1] to be set if the MSR is supported, this happens in init_amd()). AMD further specifies "A dispatch serializing instruction forces the processor to retire the serializing instruction and all previous instructions before the next instruction is executed" [8]. As dispatch is not specific to memory loads or branches, lfence therefore also affects all instructions there. Also, if retiring a branch means it's PC change becomes architectural (should be), this means any "wrong" speculation is aborted as required for this series. * ARM's SB speculation barrier instruction also affects "any instruction that appears later in the program order than the barrier" [6]. * PowerPC's barrier also affects all subsequent instructions [7]: [...] executing an ori R31,R31,0 instruction ensures that all instructions preceding the ori R31,R31,0 instruction have completed before the ori R31,R31,0 instruction completes, and that no subsequent instructions are initiated, even out-of-order, until after the ori R31,R31,0 instruction completes. The ori R31,R31,0 instruction may complete before storage accesses associated with instructions preceding the ori R31,R31,0 instruction have been performed Regarding the example, this implies that `if B goto e` will not execute before `if A goto e` completes. Once `if A goto e` completes, the CPU should find that the speculation was wrong and continue with `exit`. If there is any other path that leads to `if B goto e` (and therefore `unsafe()`) without going through `if A goto e`, then a nospec will still be needed there. However, this patch assumes this other path will be explored separately and therefore be discovered by the verifier even if the exploration discussed here stops at the nospec. This patch furthermore has the unfortunate consequence that Spectre v1 mitigations now only support architectures which implement BPF_NOSPEC. Before this commit, Spectre v1 mitigations prevented exploits by rejecting the programs on all architectures. Because some JITs do not implement BPF_NOSPEC, this patch therefore may regress unpriv BPF's security to a limited extent: * The regression is limited to systems vulnerable to Spectre v1, have unprivileged BPF enabled, and do NOT emit insns for BPF_NOSPEC. The latter is not the case for x86 64- and 32-bit, arm64, and powerpc 64-bit and they are therefore not affected by the regression. According to commit a6f6a95f2580 ("LoongArch, bpf: Fix jit to skip speculation barrier opcode"), LoongArch is not vulnerable to Spectre v1 and therefore also not affected by the regression. * To the best of my knowledge this regression may therefore only affect MIPS. This is deemed acceptable because unpriv BPF is still disabled there by default. As stated in a previous commit, BPF_NOSPEC could be implemented for MIPS based on GCC's speculation_barrier implementation. * It is unclear which other architectures (besides x86 64- and 32-bit, ARM64, PowerPC 64-bit, LoongArch, and MIPS) supported by the kernel are vulnerable to Spectre v1. Also, it is not clear if barriers are available on these architectures. Implementing BPF_NOSPEC on these architectures therefore is non-trivial. Searching GCC and the kernel for speculation barrier implementations for these architectures yielded no result. * If any of those regressed systems is also vulnerable to Spectre v4, the system was already vulnerable to Spectre v4 attacks based on unpriv BPF before this patch and the impact is therefore further limited. As an alternative to regressing security, one could still reject programs if the architecture does not emit BPF_NOSPEC (e.g., by removing the empty BPF_NOSPEC-case from all JITs except for LoongArch where it appears justified). However, this will cause rejections on these archs that are likely unfounded in the vast majority of cases. In the tests, some are now successful where we previously had a false-positive (i.e., rejection). Change them to reflect where the nospec should be inserted (using __xlated_unpriv) and modify the error message if the nospec is able to mitigate a problem that previously shadowed another problem (in that case __xlated_unpriv does not work, therefore just add a comment). Define SPEC_V1 to avoid duplicating this ifdef whenever we check for nospec insns using __xlated_unpriv, define it here once. This also improves readability. PowerPC can probably also be added here. However, omit it for now because the BPF CI currently does not include a test. Limit it to EPERM, EACCES, and EINVAL (and not everything except for EFAULT and ENOMEM) as it already has the desired effect for most real-world programs. Briefly went through all the occurrences of EPERM, EINVAL, and EACCESS in verifier.c to validate that catching them like this makes sense. Thanks to Dustin for their help in checking the vendor documentation. [1] https://lpc.events/event/18/contributions/1954/ ("Mitigating Spectre-PHT using Speculation Barriers in Linux eBPF") [2] https://arxiv.org/pdf/2405.00078 ("VeriFence: Lightweight and Precise Spectre Defenses for Untrusted Linux Kernel Extensions") [3] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/runtime-speculative-side-channel-mitigations.html ("Managed Runtime Speculative Execution Side Channel Mitigations") [4] https://dl.acm.org/doi/pdf/10.1145/3359789.3359837 ("Speculator: a tool to analyze speculative execution attacks and mitigations" - Section 4.6 "Stopping Speculative Execution") [5] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/software-techniques-for-managing-speculation.pdf ("White Paper - SOFTWARE TECHNIQUES FOR MANAGING SPECULATION ON AMD PROCESSORS - REVISION 5.09.23") [6] https://developer.arm.com/documentation/ddi0597/2020-12/Base-Instructions/SB--Speculation-Barrier- ("SB - Speculation Barrier - Arm Armv8-A A32/T32 Instruction Set Architecture (2020-12)") [7] https://wiki.raptorcs.com/w/images/5/5f/OPF_PowerISA_v3.1C.pdf ("Power ISA™ - Version 3.1C - May 26, 2024 - Section 9.2.1 of Book III") [8] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/40332.pdf ("AMD64 Architecture Programmer’s Manual Volumes 1–5 - Revision 4.08 - April 2024 - 7.6.4 Serializing Instructions") Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Henriette Herzog <henriette.herzog@rub.de> Cc: Dustin Nguyen <nguyen@cs.fau.de> Cc: Maximilian Ott <ott@cs.fau.de> Cc: Milan Stephan <milan.stephan@fau.de> Link: https://lore.kernel.org/r/20250603212428.338473-1-luis.gerhorst@fau.de Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-09bpftool: Display cookie for tracing link probeTao Chen
Display cookie for tracing link probe, in plain mode: #bpftool link 5: tracing prog 34 prog_type tracing attach_type trace_fentry target_obj_id 1 target_btf_id 60355 cookie 4503599627370496 pids test_progs(176) And in json mode: #bpftool link -j | jq { "id": 5, "type": "tracing", "prog_id": 34, "prog_type": "tracing", "attach_type": "trace_fentry", "target_obj_id": 1, "target_btf_id": 60355, "cookie": 4503599627370496, "pids": [ { "pid": 176, "comm": "test_progs" } ] } Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606165818.3394397-3-chen.dylane@linux.dev
2025-06-09selftests/bpf: Add cookies check for tracing fill_link_info testTao Chen
Adding tests for getting cookie with fill_link_info for tracing. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606165818.3394397-2-chen.dylane@linux.dev
2025-06-09bpf: Add cookie to tracing bpf_link_infoTao Chen
bpf_tramp_link includes cookie info, we can add it in bpf_link_info. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250606165818.3394397-1-chen.dylane@linux.dev