summaryrefslogtreecommitdiff
path: root/tools
AgeCommit message (Collapse)Author
2025-05-13perf list: Display the PMU name associated with a perf metric in JSONIan Rogers
The 'perf stat --cputype' option can be used to filter which metrics will be applied, for this reason the JSON metrics have an associated PMU. List this PMU name in the 'perf list' output in JSON mode so that tooling may access it. An example of the new field is: ``` { "MetricGroup": "Backend", "MetricName": "tma_core_bound", "MetricExpr": "max(0, tma_backend_bound - tma_memory_bound)", "MetricThreshold": "tma_core_bound > 0.1 & tma_backend_bound > 0.2", "ScaleUnit": "100%", "BriefDescription": "This metric represents fraction of slots where ... "PublicDescription": "This metric represents fraction of slots where ... "Unit": "cpu_core" }, ``` Signed-off-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Weilin Wang <weilin.wang@intel.com> Link: https://lore.kernel.org/r/20250512184700.11691-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-13perf metricgroup: Binary search when resolving referred to metricsIan Rogers
Unlike with events, metrics can be matched by name or a list of metric groups. However, when a metric refers to another metric it isn't referring to a group but the singular metric in question. Prior to this change every "id" in a metric expression is checked to see if it is a metric by scanning all the metrics in the metrics table. As the table is sorted my metric name we can speed the search in the resolution case by binary searching for the metric. Rename some of the metricgroup functions to make it clearer whether they match a metric by name or by both name and group. Before: ``` $ time perf test -v 10 10: PMU JSON event tests : 10.1: PMU event table sanity : Ok 10.2: PMU event map aliases : Ok 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok 10.5: Parsing of metric thresholds with fake PMUs : Ok real 0m15.972s user 0m13.176s sys 0m3.001s ``` After: ``` $ time perf test -v 10 10: PMU JSON event tests : 10.1: PMU event table sanity : Ok 10.2: PMU event map aliases : Ok 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok 10.5: Parsing of metric thresholds with fake PMUs : Ok real 0m5.343s user 0m1.871s sys 0m2.128s ``` Committer testing: root@number:~# grep -m1 'model name' /proc/cpuinfo model name : AMD Ryzen 9 9950X3D 16-Core Processor root@number:~# Before: root@number:~# time perf test "Parsing of PMU event table metrics" 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok real 0m9.286s user 0m9.354s sys 0m0.062s root@number:~# After: root@number:~# time perf test "Parsing of PMU event table metrics" 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok real 0m0.689s user 0m0.766s sys 0m0.042s root@number:~# time perf test 10 10: PMU JSON event tests : 10.1: PMU event table sanity : Ok 10.2: PMU event map aliases : Ok 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok 10.5: Parsing of metric thresholds with fake PMUs : Ok real 0m0.696s user 0m0.807s sys 0m0.064s root@number:~# Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Xu Yang <xu.yang_2@nxp.com> Link: https://lore.kernel.org/r/20250512194622.33258-4-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-13perf pmu: Change aliases from list to hashmapIan Rogers
Finding an alias for things like perf_pmu__have_event() would need to search the aliases list, whilst this happens relatively infrequently it can be a significant overhead in testing. Switch to using a hashmap. Move common initialization code to perf_pmu__init(). Refactor the test 'struct perf_pmu_test_pmu' to not have perf pmu within it to better support the perf_pmu__init() function. Before: ``` $ time perf test "Parsing of PMU event table metrics" 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok real 0m13.287s user 0m13.026s sys 0m0.532s ``` After: ``` $ time perf test "Parsing of PMU event table metrics" 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok real 0m13.011s user 0m12.885s sys 0m0.485s ``` Committer testing: root@number:~# grep -m1 'model name' /proc/cpuinfo model name : AMD Ryzen 9 9950X3D 16-Core Processor root@number:~# Before: root@number:~# time perf test "Parsing of PMU event table metrics" 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok real 0m9.296s user 0m9.361s sys 0m0.063s root@number:~# After: root@number:~# time perf test "Parsing of PMU event table metrics" 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok real 0m9.286s user 0m9.354s sys 0m0.062s root@number:~# Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Xu Yang <xu.yang_2@nxp.com> Link: https://lore.kernel.org/r/20250512194622.33258-3-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-13perf fncache: Switch to using hashmapIan Rogers
The existing fncache can get large in testing situations. As the bucket array is a fixed size this leads to it degrading to O(n) performance. Use a regular hashmap that can dynamically reallocate its array. Before: ``` $ time perf test "Parsing of PMU event table metrics" 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok real 0m14.132s user 0m17.806s sys 0m0.557s ``` After: ``` $ time perf test "Parsing of PMU event table metrics" 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok real 0m13.287s user 0m13.026s sys 0m0.532s ``` Committer notes: root@number:~# grep -m1 'model name' /proc/cpuinfo model name : AMD Ryzen 9 9950X3D 16-Core Processor root@number:~# Before: root@number:~# time perf test "Parsing of PMU event table metrics" 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok real 0m9.277s user 0m9.979s sys 0m0.055s root@number:~# After: root@number:~# time perf test "Parsing of PMU event table metrics" 10.3: Parsing of PMU event table metrics : Ok 10.4: Parsing of PMU event table metrics with fake PMUs : Ok real 0m9.296s user 0m9.361s sys 0m0.063s root@number:~# Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@amd.com> Cc: Thomas Richter <tmricht@linux.ibm.com> Cc: Xu Yang <xu.yang_2@nxp.com> Link: https://lore.kernel.org/r/20250512194622.33258-2-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-13tools: ynl-gen: support struct for binary attributesJakub Kicinski
Support using a struct pointer for binary attrs. Len field is maintained because the structs may grow with newer kernel versions. Or, which matters more, be shorter if the binary is built against newer uAPI than kernel against which it's executed. Since we are storing a pointer to a struct type - always allocate at least the amount of memory needed by the struct per current uAPI headers (unused mem is zeroed). Technically users should check the length field but per modern ASAN checks storing a short object under a pointer seems like a bad idea. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250509154213.1747885-4-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13tools: ynl-gen: auto-indent elseJakub Kicinski
We auto-indent if statements (increase the indent of the subsequent line by 1), do the same thing for else branches without a block. There hasn't been any else branches before but we're about to add one. Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20250509154213.1747885-3-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13tools: ynl-gen: support sub-type for binary attributesJakub Kicinski
Sub-type annotation on binary attributes may indicate that the attribute carries an array of simple types (also referred to as "C array" in docs). Support rendering them as such in the C user code. For example for u32, instead of: struct { u32 arr; } _len; void *arr; render: struct { u32 arr; } _count; __u32 *arr; Note that count is the number of elements while len was the length in bytes. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Donald Hunter <donald.hunter@gmail.com> Link: https://patch.msgid.link/20250509154213.1747885-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13selftests: ncdevmem: Implement devmem TCP TXMina Almasry
Add support for devmem TX in ncdevmem. This is a combination of the ncdevmem from the devmem TCP series RFCv1 which included the TX path, and work by Stan to include the netlink API and refactored on top of his generic memory_provider support. Signed-off-by: Mina Almasry <almasrymina@google.com> Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-10-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13net: devmem: TCP tx netlink apiStanislav Fomichev
Add bind-tx netlink call to attach dmabuf for TX; queue is not required, only ifindex and dmabuf fd for attachment. Signed-off-by: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20250508004830.4100853-4-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13Merge commit 'its-for-linus-20250509-merge' into x86/core, to resolve conflictsIngo Molnar
Conflicts: Documentation/admin-guide/hw-vuln/index.rst arch/x86/include/asm/cpufeatures.h arch/x86/kernel/alternative.c arch/x86/kernel/cpu/bugs.c arch/x86/kernel/cpu/common.c drivers/base/cpu.c include/linux/cpu.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-13Merge branch 'x86/mm' into x86/core, to resolve conflictsIngo Molnar
Conflicts: arch/x86/mm/numa.c arch/x86/mm/pgtable.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-13Merge branch 'x86/fpu' into x86/core, to merge dependent commitsIngo Molnar
Prepare to resolve conflicts with an upstream series of fixes that conflict with pending x86 changes: 6f5bf947bab0 Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-13Merge branch 'x86/cpu' into x86/core, to resolve conflictsIngo Molnar
Conflicts: arch/x86/kernel/cpu/bugs.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-13Merge branch 'x86/boot' into x86/core, to merge dependent commitsIngo Molnar
Prepare to resolve conflicts with an upstream series of fixes that conflict with pending x86 changes: 6f5bf947bab0 Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-13Merge branch 'x86/asm' into x86/core, to merge dependent commitsIngo Molnar
Prepare to resolve conflicts with an upstream series of fixes that conflict with pending x86 changes: 6f5bf947bab0 Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-12mm: perform VMA allocation, freeing, duplication in mmLorenzo Stoakes
Right now these are performed in kernel/fork.c which is odd and a violation of separation of concerns, as well as preventing us from integrating this and related logic into userland VMA testing going forward. There is a fly in the ointment - nommu - mmap.c is not compiled if CONFIG_MMU not set, and neither is vma.c. To square the circle, let's add a new file - vma_init.c. This will be compiled for both CONFIG_MMU and nommu builds, and will also form part of the VMA userland testing. This allows us to de-duplicate code, while maintaining separation of concerns and the ability for us to userland test this logic. Update the VMA userland tests accordingly, additionally adding a detach_free_vma() helper function to correctly detach VMAs before freeing them in test code, as this change was triggering the assert for this. [akpm@linux-foundation.org: remove stray newline, per Liam] Link: https://lkml.kernel.org/r/f97b3a85a6da0196b28070df331b99e22b263be8.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: abstract initial stack setup to mm subsystemLorenzo Stoakes
There are peculiarities within the kernel where what is very clearly mm code is performed elsewhere arbitrarily. This violates separation of concerns and makes it harder to refactor code to make changes to how fundamental initialisation and operation of mm logic is performed. One such case is the creation of the VMA containing the initial stack upon execve()'ing a new process. This is currently performed in __bprm_mm_init() in fs/exec.c. Abstract this operation to create_init_stack_vma(). This allows us to limit use of vma allocation and free code to fork and mm only. We previously did the same for the step at which we relocate the initial stack VMA downwards via relocate_vma_down(), now we move the initial VMA establishment too. Take the opportunity to also move insert_vm_struct() to mm/vma.c as it's no longer needed anywhere outside of mm. Link: https://lkml.kernel.org/r/118c950ef7a8dd19ab20a23a68c3603751acd30e.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm: establish mm/vma_exec.c for shared exec/mm VMA functionalityLorenzo Stoakes
Patch series "move all VMA allocation, freeing and duplication logic to mm", v3. Currently VMA allocation, freeing and duplication exist in kernel/fork.c, which is a violation of separation of concerns, and leaves these functions exposed to the rest of the kernel when they are in fact internal implementation details. Resolve this by moving this logic to mm, and making it internal to vma.c, vma.h. This also allows us, in future, to provide userland testing around this functionality. We additionally abstract dup_mmap() to mm, being careful to ensure kernel/fork.c acceses this via the mm internal header so it is not exposed elsewhere in the kernel. As part of this change, also abstract initial stack allocation performed in __bprm_mm_init() out of fs code into mm via the create_init_stack_vma(), as this code uses vm_area_alloc() and vm_area_free(). In order to do so sensibly, we introduce a new mm/vma_exec.c file, which contains the code that is shared by mm and exec. This file is added to both memory mapping and exec sections in MAINTAINERS so both sets of maintainers can maintain oversight. As part of this change, we also move relocate_vma_down() to mm/vma_exec.c so all shared mm/exec functionality is kept in one place. We add code shared between nommu and mmu-enabled configurations in order to share VMA allocation, freeing and duplication code correctly while also keeping these functions available in userland VMA testing. This is achieved by adding a mm/vma_init.c file which is also compiled by the userland tests. This patch (of 4): There is functionality that overlaps the exec and memory mapping subsystems. While it properly belongs in mm, it is important that exec maintainers maintain oversight of this functionality correctly. We can establish both goals by adding a new mm/vma_exec.c file which contains these 'glue' functions, and have fs/exec.c import them. As a part of this change, to ensure that proper oversight is achieved, add the file to both the MEMORY MAPPING and EXEC & BINFMT API, ELF sections. scripts/get_maintainer.pl can correctly handle files in multiple entries and this neatly handles the cross-over. [akpm@linux-foundation.org: fix comment typo] Link: https://lkml.kernel.org/r/80f0d0c6-0b68-47f9-ab78-0ab7f74677fc@lucifer.local Link: https://lkml.kernel.org/r/cover.1745853549.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/91f2cee8f17d65214a9d83abb7011aa15f1ea690.1745853549.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kees Cook <kees@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12mm/selftests: add a test to verify mmap_changing race with -EAGAINPeter Xu
Add an unit test to verify the recent mmap_changing ABI breakage. Note that I used some tricks here and there to make the test simple, e.g. I abused UFFDIO_MOVE on top of shmem with the fact that I know what I want to test will be even earlier than the vma type check. Rich comments were added to explain trivial details. Before that fix, -EAGAIN would have been written to the copy field most of the time but not always; the test should be able to reliably trigger the outlier case. After the fix, it's written always, the test verifies that making sure corresponding field (e.g. copy.copy for UFFDIO_COPY) is updated. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20250424215729.194656-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12memblock: add MEMBLOCK_RSRV_KERN flagMike Rapoport (Microsoft)
Patch series "kexec: introduce Kexec HandOver (KHO)", v8. Kexec today considers itself purely a boot loader: When we enter the new kernel, any state the previous kernel left behind is irrelevant and the new kernel reinitializes the system. However, there are use cases where this mode of operation is not what we actually want. In virtualization hosts for example, we want to use kexec to update the host kernel while virtual machine memory stays untouched. When we add device assignment to the mix, we also need to ensure that IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we need to do the same for the PCI subsystem. If we want to kexec while an SEV-SNP enabled virtual machine is running, we need to preserve the VM context pages and physical memory. See "pkernfs: Persisting guest memory and kernel/device state safely across kexec" Linux Plumbers Conference 2023 presentation for details: https://lpc.events/event/17/contributions/1485/ To start us on the journey to support all the use cases above, this patch implements basic infrastructure to allow hand over of kernel state across kexec (Kexec HandOver, aka KHO). As a really simple example target, we use memblock's reserve_mem. With this patchset applied, memory that was reserved using "reserve_mem" command line options remains intact after kexec and it is guaranteed to reside at the same physical address. == Alternatives == There are alternative approaches to (parts of) the problems above: * Memory Pools [1] - preallocated persistent memory region + allocator * PRMEM [2] - resizable persistent memory regions with fixed metadata pointer on the kernel command line + allocator * Pkernfs [3] - preallocated file system for in-kernel data with fixed address location on the kernel command line * PKRAM [4] - handover of user space pages using a fixed metadata page specified via command line All of the approaches above fundamentally have the same problem: They require the administrator to explicitly carve out a physical memory location because they have no mechanism outside of the kernel command line to pass data (including memory reservations) between kexec'ing kernels. KHO provides that base foundation. We will determine later whether we still need any of the approaches above for fast bulk memory handover of for example IOMMU page tables. But IMHO they would all be users of KHO, with KHO providing the foundational primitive to pass metadata and bulk memory reservations as well as provide easy versioning for data. == Overview == We introduce a metadata file that the kernels pass between each other. How they pass it is architecture specific. The file's format is a Flattened Device Tree (fdt) which has a generator and parser already included in Linux. KHO is enabled in the kernel command line by `kho=on`. When the root user enables KHO through /sys/kernel/debug/kho/out/finalize, the kernel invokes callbacks to every KHO users to register preserved memory regions, which contain drivers' states. When the actual kexec happens, the fdt is part of the image set that we boot into. In addition, we keep "scratch regions" available for kexec: physically contiguous memory regions that are guaranteed to not have any memory that KHO would preserve. The new kernel bootstraps itself using the scratch regions and sets all handed over memory as in use. When drivers initialize that support KHO, they introspect the fdt, restore preserved memory regions, and retrieve their states stored in the preserved memory. == Limitations == Currently KHO is only implemented for file based kexec. The kernel interfaces in the patch set are already in place to support user space kexec as well, but it is still not implemented it yet inside kexec tools. == How to Use == To use the code, please boot the kernel with the "kho=on" command line parameter. KHO will automatically create scratch regions. If you want to set the scratch size explicitly you can use "kho_scratch=" command line parameter. For instance, "kho_scratch=16M,512M,256M" will reserve a 16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB per NUMA node scratch regions on boot. Make sure to have a reserved memory range requested with reserv_mem command line option, for example, "reserve_mem=64m:4k:n1". Then before you invoke file based "kexec -l", finalize KHO FDT: # echo 1 > /sys/kernel/debug/kho/out/finalize You can preview the generated FDT using `dtc`, # dtc /sys/kernel/debug/kho/out/fdt # dtc /sys/kernel/debug/kho/out/sub_fdts/memblock `dtc` is available on ubuntu by `sudo apt-get install device-tree-compiler`. Now kexec into the new kernel, # kexec -l Image --initrd=initrd -s # kexec -e (The order of KHO finalization and "kexec -l" does not matter.) The new kernel will boot up and contain the previous kernel's reserve_mem contents at the same physical address as the first kernel. You can also review the FDT passed from the old kernel, # dtc /sys/kernel/debug/kho/in/fdt # dtc /sys/kernel/debug/kho/in/sub_fdts/memblock This patch (of 17): To denote areas that were reserved for kernel use either directly with memblock_reserve_kern() or via memblock allocations. Link: https://lore.kernel.org/lkml/20250424083258.2228122-1-changyuanl@google.com/ Link: https://lore.kernel.org/lkml/aAeaJ2iqkrv_ffhT@kernel.org/ Link: https://lore.kernel.org/lkml/35c58191-f774-40cf-8d66-d1e2aaf11a62@intel.com/ Link: https://lore.kernel.org/lkml/20250424093302.3894961-1-arnd@kernel.org/ Link: https://lkml.kernel.org/r/20250509074635.3187114-1-changyuanl@google.com Link: https://lkml.kernel.org/r/20250509074635.3187114-2-changyuanl@google.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Changyuan Lyu <changyuanl@google.com> Cc: Alexander Graf <graf@amazon.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Gowans <jgowans@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Pratyush Yadav <ptyadav@amazon.de> Cc: Rob Herring <robh@kernel.org> Cc: Saravana Kannan <saravanak@google.com> Cc: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: Will Deacon <will@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12selftests/mm: use long for dwRegionSizeSiddarth G
Change the type of 'dwRegionSize' in wp_init() and wp_free() from int to long to match callers that pass long or unsigned long long values. wp_addr_range function is left unchanged because it passes 'dwRegionSize' parameter directly to pagemap_ioctl, which expects an int. This patch does not fix any actual known issues. It aligns parameter types with their actual usage and avoids any potential future issues. Link: https://lkml.kernel.org/r/20250427102639.39978-1-siddarthsgml@gmail.com Signed-off-by: Siddarth G <siddarthsgml@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-12selftests/bpf: introduce tests for dynptr copy kfuncsMykyta Yatsenko
Introduce selftests verifying newly-added dynptr copy kfuncs. Covering contiguous and non-contiguous memory backed dynptrs. Disable test_probe_read_user_str_dynptr that triggers bug in strncpy_from_user_nofault. Patch to fix the issue [1]. [1] https://patchwork.kernel.org/project/linux-mm/patch/20250422131449.57177-1-mykyta.yatsenko5@gmail.com/ Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20250512205348.191079-4-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-12selftests: mptcp: remove rp_filter configurationHangbin Liu
Remove the rp_filter configuration from MPTCP tests, as it is now handled by setup_ns. Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250508081910.84216-7-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12selftests: netfilter: remove rp_filter configurationHangbin Liu
Remove the rp_filter configuration in netfilter lib, as setup_ns already sets it appropriately by default Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20250508081910.84216-6-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12selftests: net: use setup_ns for SRv6 tests and remove rp_filter configurationHangbin Liu
Some SRv6 tests manually set up network namespaces and disable rp_filter. Since the setup_ns library function already handles rp_filter configuration, convert these SRv6 tests to use setup_ns and remove the redundant rp_filter settings. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Andrea Mayer <andrea.mayer@uniroma2.it> Link: https://patch.msgid.link/20250508081910.84216-5-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12selftests: net: use setup_ns for bareudp testingHangbin Liu
Switch bareudp testing to use setup_ns, which sets up rp_filter by default. This allows us to remove the manual rp_filter configuration from the script. Additionally, since setup_ns handles namespace naming and cleanup, we no longer need a separate cleanup function. We also move the trap setup earlier in the script, before the test setup begins. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250508081910.84216-4-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12selftests: net: remove redundant rp_filter configurationHangbin Liu
The following tests use setup_ns to create a network namespace, which will disables rp_filter immediately after namespace creation. Therefore, it is no longer necessary to disable rp_filter again within these individual tests. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250508081910.84216-3-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12selftests: net: disable rp_filter after namespace initializationHangbin Liu
Some distributions enable rp_filter globally by default. To ensure consistent behavior across environments, we explicitly disable it in several test cases. This patch moves the rp_filter disabling logic to immediately after the network namespace is initialized. With this change, individual test cases with creating namespace via setup_ns no longer need to disable rp_filter again. This helps avoid redundancy and ensures test consistency. Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250508081910.84216-2-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12selftests: drv-net: ping: make sure the ping test restores checksum offloadJakub Kicinski
The ping test flips checksum offload on and off. Make sure the original value is restored if test fails. Reviewed-by: David Wei <dw@davidwei.uk> Link: https://patch.msgid.link/20250508214005.1518013-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-12libbpf: Use proper errno value in nlattrAnton Protopopov
Return value of the validate_nla() function can be propagated all the way up to users of libbpf API. In case of error this libbpf version of validate_nla returns -1 which will be seen as -EPERM from user's point of view. Instead, return a more reasonable -EINVAL. Fixes: bbf48c18ee0c ("libbpf: add error reporting in XDP") Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250510182011.2246631-1-a.s.protopopov@gmail.com
2025-05-12selftests/bpf: Allow skipping docs compilationMykyta Yatsenko
Currently rst2man is required to build bpf selftests, as the tool is used by Makefile.docs. rst2man may be missing in some build environments and is not essential for selftests. It makes sense to allow user to skip building docs. This patch adds SKIP_DOCS variable into bpf selftests Makefile that when set to 1 allows skipping building docs, for example: make -C tools/testing/selftests TARGETS=bpf SKIP_DOCS=1 Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250510002450.365613-1-mykyta.yatsenko5@gmail.com
2025-05-12perf tests: Harden branch stack sampling testIan Rogers
On continuous testing the perf script output can be empty, or nearly empty, causing tr/grep to exit and due to "set -e" the test traps and fails. Add some empty file handling that sets the test to skip and make grep and other text rewriting failures non-fatal by adding "|| true". Committer testing: root@number:~# grep -m1 "model name" /proc/cpuinfo model name : AMD Ryzen 9 9950X3D 16-Core Processor root@number:~# perf test "Check branch stack sampling" 104: Check branch stack sampling : Ok root@number:~# root@number:~# perf test -vvvvvvv "Check branch stack sampling" 104: Check branch stack sampling: --- start --- test child forked, pid 396047 142d22-142da0 l brstack_bench perf does have symbol 'brstack_bench' Testing user branch stack sampling Testing branch stack filtering permutation (any_call,CALL|IND_CALL|COND_CALL|SYSCALL|IRQ) Testing branch stack filtering permutation (call,CALL|SYSCALL) Testing branch stack filtering permutation (cond,COND) Testing branch stack filtering permutation (any_ret,RET|COND_RET|SYSRET|ERET) Testing branch stack filtering permutation (call,cond,CALL|SYSCALL|COND) Testing branch stack filtering permutation (any_call,cond,CALL|IND_CALL|COND_CALL|IRQ|SYSCALL|COND) Testing branch stack filtering permutation (cond,any_call,any_ret,COND|CALL|IND_CALL|COND_CALL|SYSCALL|IRQ|RET|COND_RET|SYSRET|ERET) ---- end(0) ---- 104: Check branch stack sampling : Ok root@number:~# Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: German Gomez <german.gomez@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250318161639.34446-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-12selftests/bpf: test_verifier verbose log overflowsGregory Bell
Tests: - 458/p ld_dw: xor semi-random 64-bit imms, test 5 - 501/p scale: scale test 1 - 502/p scale: scale test 2 fail in verbose mode due to bpf_vlog[] overflowing. These tests generate large verifier logs that exceed the current buffer size, causing them to fail to load. Increase the size of the bpf_vlog[] buffer to accommodate larger logs and prevent false failures during test runs with verbose output. Signed-off-by: Gregory Bell <grbell@redhat.com> Link: https://lore.kernel.org/r/e49267100f07f099a5877a3a5fc797b702bbaf0c.1747058195.git.grbell@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-12selftests/bpf: test_verifier verbose causes erroneous failuresGregory Bell
When running test_verifier with the -v flag and a test with `expected_ret==VERBOSE_ACCEPT`, the opts.log_level is unintentionally overwritten because the verbose flag takes precedence. This leads to a mismatch in the expected and actual contents of bpf_vlog, causing tests to fail incorrectly. Reorder the conditional logic that sets opts.log_level to preserve the expected log level and prevent it from being overridden by -v. Signed-off-by: Gregory Bell <grbell@redhat.com> Link: https://lore.kernel.org/r/182bf00474f817c99f968a9edb119882f62be0f8.1747058195.git.grbell@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-12perf parse-events: Add "cpu" term to set the CPU an event is recorded onIan Rogers
The -C option allows the CPUs for a list of events to be specified but its not possible to set the CPU for a single event. Add a term to allow this. The term isn't a general CPU list due to ',' already being a special character in event parsing instead multiple cpu= terms may be provided and they will be merged/unioned together. An example of mixing different types of events counted on different CPUs: ``` $ perf stat -A -C 0,4-5,8 -e "instructions/cpu=0/,l1d-misses/cpu=4,cpu=5/,inst_retired.any/cpu=8/,cycles" -a sleep 0.1 Performance counter stats for 'system wide': CPU0 6,979,225 instructions/cpu=0/ # 0.89 insn per cycle CPU4 75,138 cpu/l1d-misses/ CPU5 1,418,939 cpu/l1d-misses/ CPU8 797,553 cpu/inst_retired.any,cpu=8/ CPU0 7,845,302 cycles CPU4 6,546,859 cycles CPU5 185,915,438 cycles CPU8 2,065,668 cycles 0.112449242 seconds time elapsed ``` Committer testing: root@number:~# grep -m1 "model name" /proc/cpuinfo model name : AMD Ryzen 9 9950X3D 16-Core Processor root@number:~# perf stat -A -e "instructions/cpu=0/,instructions,l1d-misses/cpu=4,cpu=5/,cycles" -a sleep 0.1 Performance counter stats for 'system wide': CPU0 2,398,351 instructions/cpu=0/ # 0.44 insn per cycle CPU0 2,398,152 instructions # 0.44 insn per cycle CPU1 1,265,634 instructions # 0.49 insn per cycle CPU2 606,087 instructions # 0.50 insn per cycle CPU3 4,025,752 instructions # 0.52 insn per cycle CPU4 4,236,810 instructions # 0.53 insn per cycle CPU5 3,984,832 instructions # 0.66 insn per cycle CPU6 434,132 instructions # 0.44 insn per cycle CPU7 65,752 instructions # 0.41 insn per cycle CPU8 459,083 instructions # 0.48 insn per cycle CPU9 6,464,161 instructions # 1.31 insn per cycle <SNIP> root@number:~# perf stat -e "instructions/cpu=0/,instructions,l1d-misses/cpu=4,cpu=5/,cycles" -a sleep 0. Performance counter stats for 'system wide': 144,822 instructions/cpu=0/ # 0.03 insn per cycle 4,666,114 instructions # 0.93 insn per cycle 2,583 l1d-misses 4,993,633 cycles 0.000868512 seconds time elapsed root@number:~# Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Kan Liang <kan.liang@linux.intel.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Leo Yan <leo.yan@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Weilin Wang <weilin.wang@intel.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Link: https://lore.kernel.org/r/20250403194337.40202-5-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-12perf parse-events: Set is_pmu_core for legacy hardware eventsIan Rogers
Also set the CPU map to all online CPU maps. This is done so the behavior of legacy hardware and hardware cache events better matches that of sysfs and JSON events during __perf_evlist__propagate_maps(). Fix missing cpumap put in "Synthesize attr update" test. Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Kan Liang <kan.liang@linux.intel.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Leo Yan <leo.yan@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Weilin Wang <weilin.wang@intel.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Link: https://lore.kernel.org/r/20250403194337.40202-4-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-12perf stat: Use counter cpumask to skip zero valuesIan Rogers
When a counter is 0 it may or may not be skipped. For uncore counters it is common they are only valid on 1 logical CPU and all other CPUs should be skipped. The PMU's cpumask was used for the skip calculation, but that cpumask may not reflect user overrides. Similarly a counter on a core PMU may explicitly not request a CPU be gathered. If the counter on this CPU's value is 0 then the counter should be skipped as it wasn't requested. Switch from using the PMU cpumask to that associated with the evsel to support these cases. Avoid potential crash with --per-thread mode where config->aggr_get_id is NULL. Add some examples for the tool event 0 counter skipping. Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Kan Liang <kan.liang@linux.intel.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Leo Yan <leo.yan@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Weilin Wang <weilin.wang@intel.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Link: https://lore.kernel.org/r/20250403194337.40202-3-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-12libperf cpumap: Add ability to create CPU from a single CPU numberIan Rogers
Add perf_cpu_map__new_int() so that a CPU map can be created from a single integer. Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Kan Liang <kan.liang@linux.intel.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Leo Yan <leo.yan@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Weilin Wang <weilin.wang@intel.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Link: https://lore.kernel.org/r/20250403194337.40202-2-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-12perf tests metrics: Permission related fixesIan Rogers
When permissions are limited running sleep without system wide isn't a good benchmark to run to achieve samples, switch to running noploop. Remove indent for non-success cases. Allow skip for the not counted case. Minor debug changes. Signed-off-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Weilin Wang <weilin.wang@intel.com> Link: https://lore.kernel.org/r/20250412004704.2297939-2-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-12perf evsel: Add per-thread warning for EOPNOTSUPP open failuesIan Rogers
The mrvl_ddr_pmu will return EOPNOTSUPP if opened in per-thread mode. Give a warning for this similar to EINVAL. Doing this better supports metric testing with limited permissions when the mrvl_ddr_pmu is present, as the failure to open causes the test to skip and not fail. Signed-off-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Weilin Wang <weilin.wang@intel.com> Link: https://lore.kernel.org/r/20250412004704.2297939-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-12perf scripts python: exported-sql-viewer.py: Fix pattern matching with Python 3Adrian Hunter
The script allows the user to enter patterns to find symbols. The pattern matching characters are converted for use in SQL. For PostgreSQL the conversion involves using the Python maketrans() method which is slightly different in Python 3 compared with Python 2. Fix to work in Python 3. Fixes: beda0e725e5f06ac ("perf script python: Add Python3 support to exported-sql-viewer.py") Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Tony Jones <tonyj@suse.de> Link: https://lore.kernel.org/r/20250512093932.79854-4-adrian.hunter@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-12perf intel-pt: Do not default to recording all switch eventsAdrian Hunter
On systems with many CPUs, recording extra context switch events can be excessive and unnecessary. Add perf config intel-pt.all-switch-events=false to control the behaviour. Example: # perf config intel-pt.all-switch-events=false # perf record -eintel_pt//u uname Linux [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.082 MB perf.data ] # perf script -D | grep PERF_RECORD_SWITCH | awk '{print $5}' | uniq -c 5 PERF_RECORD_SWITCH # perf config intel-pt.all-switch-events=true # perf record -eintel_pt//u uname Linux [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.102 MB perf.data ] # perf script -D | grep PERF_RECORD_SWITCH | awk '{print $5}' | uniq -c 180 PERF_RECORD_SWITCH_CPU_WIDE Committer testing: While doing a make -j28 allmodconfig: root@five:~# grep "model name" -m1 /proc/cpuinfo model name : Intel(R) Core(TM) i7-14700K root@five:~# root@five:~# perf config intel-pt.all-switch-events=false root@five:~# perf record -e intel_pt//u uname Linux [ perf record: Woken up 2 times to write data ] [ perf record: Captured and wrote 0.019 MB perf.data ] root@five:~# perf report --stats | grep SWITCH_CPU_WIDE root@five:~# root@five:~# perf config intel-pt.all-switch-events=true root@five:~# perf record -e intel_pt//u uname Linux [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.047 MB perf.data ] root@five:~# perf report --stats | grep SWITCH_CPU_WIDE SWITCH_CPU_WIDE events: 542 (96.4%) root@five:~# Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250512093932.79854-3-adrian.hunter@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-12perf intel-pt: Fix PEBS-via-PT data_srcAdrian Hunter
The Fixes commit did not add support for decoding PEBS-via-PT data_src. Fix by adding support. PEBS-via-PT is a feature of some E-core processors, starting with processors based on Tremont microarchitecture. Because the kernel only supports Intel PT features that are on all processors, there is no support for PEBS-via-PT on hybrids. Currently that leaves processors based on Tremont, Gracemont and Crestmont, however there are no events on Tremont that produce data_src information, and for Gracemont and Crestmont there are only: mem-loads event=0xd0,umask=0x5,ldlat=3 mem-stores event=0xd0,umask=0x6 Affected processors include Alder Lake N (Gracemont), Sierra Forest (Crestmont) and Grand Ridge (Crestmont). Example: # perf record -d -e intel_pt/branch=0/ -e mem-loads/aux-output/pp uname Before: # perf.before script --itrace=o -Fdata_src 0 |OP No|LVL N/A|SNP N/A|TLB N/A|LCK No|BLK N/A 0 |OP No|LVL N/A|SNP N/A|TLB N/A|LCK No|BLK N/A After: # perf script --itrace=o -Fdata_src 10268100142 |OP LOAD|LVL L1 hit|SNP None|TLB L1 or L2 hit|LCK No|BLK N/A 10450100442 |OP LOAD|LVL L2 hit|SNP None|TLB L2 miss|LCK No|BLK N/A Fixes: 975846eddf907297 ("perf intel-pt: Add memory information to synthesized PEBS sample") Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250512093932.79854-2-adrian.hunter@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2025-05-12ACPICA: Update copyright yearSaket Dumbre
ACPICA commit 45253be18b3f37d46cd0072aa3f8a0a21a70e0a4 Changes needed by acpisrc to update copyright year when building for release. Link: https://github.com/acpica/acpica/commit/45253be1 Signed-off-by: Saket Dumbre <saket.dumbre@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-12ACPICA: Apply ACPI_NONSTRING in more placesAhmed Salem
ACPICA commit 1035a3d453f7dd49a235a59ee84ebda9d2d2f41b Add ACPI_NONSTRING for destination char arrays without a terminating NUL character. This is a follow-up to commit 35ad99236f3a ("ACPICA: Apply ACPI_NONSTRING") where not all instances received the same treatment, in preparation for replacing strncpy() calls with memcpy() Link: https://github.com/acpica/acpica/commit/1035a3d4 Signed-off-by: Ahmed Salem <x0rw3ll@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://patch.msgid.link/3833065.MHq7AAxBmi@rjwysocki.net
2025-05-12selftests/fs/mount-notify: add a test variant running inside usernsAmir Goldstein
unshare userns in addition to mntns and verify that: 1. watching tmpfs mounted inside userns is allowed with any mark type 2. watching orig root with filesystem mark type is not allowed 3. watching mntns of orig userns is not allowed 4. watching mntns in userns where fanotify_init was called is allowed mount events are only tested with the last case of mntns mark. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/20250509133240.529330-9-amir73il@gmail.com Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-12selftests/filesystems: create setup_userns() helperAmir Goldstein
Add helper to utils.c and use it in statmount userns tests. Reviewed-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/20250509133240.529330-8-amir73il@gmail.com Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-12selftests/filesystems: create get_unique_mnt_id() helperAmir Goldstein
Add helper to utils.c and use it in mount-notify and statmount tests. Linking with utils.c drags in a dependecy with libcap, so add it to the Makefile of the tests. Reviewed-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/20250509133240.529330-7-amir73il@gmail.com Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-12selftests/fs/mount-notify: build with tools include dirAmir Goldstein
Copy the fanotify uapi header files to the tools include dir and define __kernel_fsid_t to decouple dependency with headers_install and then remove the redundant re-definitions of fanotify macros. Reviewed-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/20250509133240.529330-6-amir73il@gmail.com Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-05-12selftests/mount_settattr: remove duplicate syscall definitionsAmir Goldstein
Which are already defined in wrappers.h. For now, the syscall defintions of mount_settattr() itself remain in the test, which is the only test to use them. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/20250509133240.529330-5-amir73il@gmail.com Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>