Age | Commit message (Collapse) | Author |
|
Pre-population has been requested several times to mitigate KVM page faults
during guest boot or after live migration. It is also required by TDX
before filling in the initial guest memory with measured contents.
Introduce it as a generic API.
|
|
Wire KVM_PRE_FAULT_MEMORY ioctl to kvm_mmu_do_page_fault() to populate guest
memory. It can be called right after KVM_CREATE_VCPU creates a vCPU,
since at that point kvm_mmu_create() and kvm_init_mmu() are called and
the vCPU is ready to invoke the KVM page fault handler.
The helper function kvm_tdp_map_page() takes care of the logic to
process RET_PF_* return values and convert them to success or errno.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <9b866a0ae7147f96571c439e75429a03dcb659b6.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The guest memory population logic will need to know what page size or level
(4K, 2M, ...) is mapped.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <eabc3f3e5eb03b370cadf6e1901ea34d7a020adc.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move the accounting of the result of kvm_mmu_do_page_fault() to its
callers, as only pf_fixed is common to guest page faults and async #PFs,
and upcoming support KVM_PRE_FAULT_MEMORY won't bump _any_ stats.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Account stat.pf_taken in kvm_mmu_page_fault(), i.e. the actual page fault
handler, instead of conditionally bumping it in kvm_mmu_do_page_fault().
The "real" page fault handler is the only path that should ever increment
the number of taken page faults, as all other paths that "do page fault"
are by definition not handling faults that occurred in the guest.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The xen_nopvspin boot parameter is deprecated since 2019. nopvspin
can be used instead.
Remove the xen_nopvspin boot parameter and replace the xen_pvspin
variable use cases with nopvspin.
This requires to move the nopvspin variable out of the .initdata
section, as it needs to be accessed for cpuhotplug, too.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Message-ID: <20240710110139.22300-1-jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
Under arch/x86/xen there is one large private header file xen-ops.h
containing most of the Xen-private x86 related declarations, and then
there are several small headers with a handful of declarations each.
Merge the small headers into xen-ops.h.
While doing that, move the declaration of xen_fifo_events from
xen-ops.h into include/xen/events.h where it should have been from the
beginning.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Message-ID: <20240710093718.14552-3-jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
Some functions and variables in arch/x86/xen are used locally only,
make them static.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Message-ID: <20240710093718.14552-2-jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
Today Xen multicall debugging needs to be enabled via modifying a
define in a source file for getting debug data of multicall errors
encountered by users.
Switch multicall debugging to depend on a boot parameter "xen_mc_debug"
instead, enabling affected users to boot with the new parameter set in
order to get better diagnostics.
With debugging enabled print the following information in case at least
one of the batched calls failed:
- all calls of the batch with operation, result and caller
- all parameters of each call
- all parameters stored in the multicall data for each call
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Message-ID: <20240710092749.13595-1-jgross@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
|
|
The RMP table is probed early in the boot process before max_pfn has been
set, so the logic to check if the RMP covers all of system memory is not
valid.
Move the RMP memory coverage check from snp_probe_rmptable_info() into
snp_rmptable_init(), which is well after max_pfn has been set. Also, fix
the calculation to use PFN_UP instead of PHYS_PFN, in order to compute
the required RMP size properly.
Fixes: 216d106c7ff7 ("x86/sev: Add SEV-SNP host initialization support")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/bec4364c7e34358cc576f01bb197a7796a109169.1718984524.git.thomas.lendacky@amd.com
|
|
A long time ago it was agreed upon that the coco stuff needs to go where
it belongs:
https://lore.kernel.org/all/Yg5nh1RknPRwIrb8@zn.tnic
and not keep it in arch/x86/kernel. TDX did that and SEV can't find time
to do so. So lemme do it. If people have trouble converting their
ongoing featuritis patches, ask me for a sed script.
No functional changes.
Move the instrumentation exclusion bits too, as helpfully caught and
reported by the 0day folks.
Closes: https://lore.kernel.org/oe-kbuild-all/202406220748.hG3qlmDx-lkp@intel.com
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202407091342.46d7dbb-oliver.sang@intel.com
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>
Reviewed-by: Ashish Kalra <ashish.kalra@amd.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/r/20240619093014.17962-1-bp@kernel.org
|
|
When clone3() was introduced, it was not obvious how each architecture
deals with setting up the stack and keeping the register contents in
a fork()-like system call, so this was left for the architecture
maintainers to implement, with __ARCH_WANT_SYS_CLONE3 defined by those
that already implement it.
Five years later, we still have a few architectures left that are missing
clone3(), and the macro keeps getting in the way as it's fundamentally
different from all the other __ARCH_WANT_SYS_* macros that are meant
to provide backwards-compatibility with applications using older
syscalls that are no longer provided by default.
Address this by reversing the polarity of the macro, adding an
__ARCH_BROKEN_SYS_CLONE3 macro to all architectures that don't
already provide the syscall, and remove __ARCH_WANT_SYS_CLONE3
from all the other ones.
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2024-07-08
The following pull-request contains BPF updates for your *net-next* tree.
We've added 102 non-merge commits during the last 28 day(s) which contain
a total of 127 files changed, 4606 insertions(+), 980 deletions(-).
The main changes are:
1) Support resilient split BTF which cuts down on duplication and makes BTF
as compact as possible wrt BTF from modules, from Alan Maguire & Eduard Zingerman.
2) Add support for dumping kfunc prototypes from BTF which enables both detecting
as well as dumping compilable prototypes for kfuncs, from Daniel Xu.
3) Batch of s390x BPF JIT improvements to add support for BPF arena and to implement
support for BPF exceptions, from Ilya Leoshkevich.
4) Batch of riscv64 BPF JIT improvements in particular to add 12-argument support
for BPF trampolines and to utilize bpf_prog_pack for the latter, from Pu Lehui.
5) Extend BPF test infrastructure to add a CHECKSUM_COMPLETE validation option
for skbs and add coverage along with it, from Vadim Fedorenko.
6) Inline bpf_get_current_task/_btf() helpers in the arm64 BPF JIT which gives
a small 1% performance improvement in micro-benchmarks, from Puranjay Mohan.
7) Extend the BPF verifier to track the delta between linked registers in order
to better deal with recent LLVM code optimizations, from Alexei Starovoitov.
8) Fix bpf_wq_set_callback_impl() kfunc signature where the third argument should
have been a pointer to the map value, from Benjamin Tissoires.
9) Extend BPF selftests to add regular expression support for test output matching
and adjust some of the selftest when compiled under gcc, from Cupertino Miranda.
10) Simplify task_file_seq_get_next() and remove an unnecessary loop which always
iterates exactly once anyway, from Dan Carpenter.
11) Add the capability to offload the netfilter flowtable in XDP layer through
kfuncs, from Florian Westphal & Lorenzo Bianconi.
12) Various cleanups in networking helpers in BPF selftests to shave off a few
lines of open-coded functions on client/server handling, from Geliang Tang.
13) Properly propagate prog->aux->tail_call_reachable out of BPF verifier, so
that x86 JIT does not need to implement detection, from Leon Hwang.
14) Fix BPF verifier to add a missing check_func_arg_reg_off() to prevent an
out-of-bounds memory access for dynpointers, from Matt Bobrowski.
15) Fix bpf_session_cookie() kfunc to return __u64 instead of long pointer as
it might lead to problems on 32-bit archs, from Jiri Olsa.
16) Enhance traffic validation and dynamic batch size support in xsk selftests,
from Tushar Vyavahare.
bpf-next-for-netdev
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (102 commits)
selftests/bpf: DENYLIST.aarch64: Remove fexit_sleep
selftests/bpf: amend for wrong bpf_wq_set_callback_impl signature
bpf: helpers: fix bpf_wq_set_callback_impl signature
libbpf: Add NULL checks to bpf_object__{prev_map,next_map}
selftests/bpf: Remove exceptions tests from DENYLIST.s390x
s390/bpf: Implement exceptions
s390/bpf: Change seen_reg to a mask
bpf: Remove unnecessary loop in task_file_seq_get_next()
riscv, bpf: Optimize stack usage of trampoline
bpf, devmap: Add .map_alloc_check
selftests/bpf: Remove arena tests from DENYLIST.s390x
selftests/bpf: Add UAF tests for arena atomics
selftests/bpf: Introduce __arena_global
s390/bpf: Support arena atomics
s390/bpf: Enable arena
s390/bpf: Support address space cast instruction
s390/bpf: Support BPF_PROBE_MEM32
s390/bpf: Land on the next JITed instruction after exception
s390/bpf: Introduce pre- and post- probe functions
s390/bpf: Get rid of get_probe_mem_regno()
...
====================
Link: https://patch.msgid.link/20240708221438.10974-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Currently, the Sapphire Rapids and Granite Rapids share the same PMU
name, sapphire_rapids. Because from the kernel’s perspective, GNR is
similar to SPR. The only key difference is that they support different
extra MSRs. The code path and the PMU name are shared.
However, from end users' perspective, they are quite different. Besides
the extra MSRs, GNR has a newer PEBS format, supports Retire Latency,
supports new CPUID enumeration architecture, doesn't required the
load-latency AUX event, has additional TMA Level 1 Architectural Events,
etc. The differences can be enumerated by CPUID or the PERF_CAPABILITIES
MSR. They weren't reflected in the model-specific kernel setup.
But it is worth to have a distinct PMU name for GNR.
Fixes: a6742cb90b56 ("perf/x86/intel: Fix the FRONTEND encoding on GNR and MTL")
Suggested-by: Ahmad Yasin <ahmad.yasin@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20240708193336.1192217-3-kan.liang@linux.intel.com
|
|
A non-0 retire latency can be observed on a Raptorlake which doesn't
support the retire latency feature.
By design, the retire latency shares the PERF_SAMPLE_WEIGHT_STRUCT
sample type with other types of latency. That could avoid adding too
many different sample types to support all kinds of latency. For the
machine which doesn't support some kind of latency, 0 should be
returned.
Perf doesn’t clear/init all the fields of a sample data for the sake
of performance. It expects the later perf_{prepare,output}_sample() to
update the uninitialized field. However, the current implementation
doesn't touch the field of the retire latency if the feature is not
supported. The memory garbage is dumped into the perf data.
Clear the retire latency if the feature is not supported.
Fixes: c87a31093c70 ("perf/x86: Support Retire Latency")
Reported-by: "Bayduraev, Alexey V" <alexey.v.bayduraev@intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: "Bayduraev, Alexey V" <alexey.v.bayduraev@intel.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20240708193336.1192217-4-kan.liang@linux.intel.com
|
|
The below error is observed on Ice Lake VM.
$ perf stat
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument)
for event (slots).
/bin/dmesg | grep -i perf may provide additional information.
In a virtualization env, the Topdown metrics and the slots event haven't
been supported yet. The guest CPUID doesn't enumerate them. However, the
current kernel unconditionally exposes the slots event and the Topdown
metrics events to sysfs, which misleads the perf tool and triggers the
error.
Hide the perf-metrics topdown events and the slots event if the
perf-metrics feature is not enumerated.
The big core of a hybrid platform can also supports the perf-metrics
feature. Fix the hybrid platform as well.
Closes: https://lore.kernel.org/lkml/CAM9d7cj8z+ryyzUHR+P1Dcpot2jjW+Qcc4CPQpfafTXN=LEU0Q@mail.gmail.com/
Reported-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Dongli Zhang <dongli.zhang@oracle.com>
Link: https://lkml.kernel.org/r/20240708193336.1192217-2-kan.liang@linux.intel.com
|
|
The perf stat errors out with UNC_CHA_TOR_INSERTS.IA_HIT_CXL_ACC_LOCAL
event.
$perf stat -e uncore_cha_55/event=0x35,umask=0x10c0008101/ -a -- ls
event syntax error: '..0x35,umask=0x10c0008101/'
\___ Bad event or PMU
The definition of the CHA umask is config:8-15,32-55, which is 32bit.
However, the umask of the event is bigger than 32bit.
This is an error in the original uncore spec.
Add a new umask_ext5 for the new CHA umask range.
Fixes: 949b11381f81 ("perf/x86/intel/uncore: Add Sapphire Rapids server CHA support")
Closes: https://lore.kernel.org/linux-perf-users/alpine.LRH.2.20.2401300733310.11354@Diego/
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20240708185524.1185505-1-kan.liang@linux.intel.com
|
|
The smbios.c source file is not currently included in the x86 build, and
before we can do so, it needs some tweaks to build correctly in
combination with the EFI mixed mode support.
Reviewed-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
The exynos-next pull is based on a newer -rc than drm-next. hence
backmerge first to make sure the unrelated conflicts we accumulated
don't end up randomly in the exynos merge pull, but are separated out.
Conflicts are all benign: Adjacent changes in amdgpu and fbdev-dma
code, and cherry-pick conflict in xe.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
|
|
For uncore PMUs, a single context is shared across all CPUs in a domain.
The domain can be a CCX, like in the case of the L3 PMU, or a socket,
like in the case of DF and UMC PMUs. This information is available via
the PMU's cpumask.
For contexts shared across a socket, the domain is currently determined
from topology_die_id() which is incorrect after the introduction of
commit 63edbaa48a57 ("x86/cpu/topology: Add support for the AMD
0x80000026 leaf") as it now returns a CCX identifier on Zen 4 and later
systems which support CPUID leaf 0x80000026.
Use topology_logical_package_id() instead as it always returns a socket
identifier irrespective of the availability of CPUID leaf 0x80000026.
Fixes: 63edbaa48a57 ("x86/cpu/topology: Add support for the AMD 0x80000026 leaf")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240626074942.1044818-1-sandipan.das@amd.com
|
|
X86_FEATURE_PERFCTR_NB and X86_FEATURE_PERFCTR_LLC are derived from
CPUID leaf 0x80000001 ECX bits 24 and 28 respectively and denote the
availability of DF and L3 counters. When these bits are not set, the
corresponding PMUs have no counters and hence, should not be registered.
Fixes: 07888daa056e ("perf/x86/amd/uncore: Move discovery and registration")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240626074404.1044230-1-sandipan.das@amd.com
|
|
The architectural performance monitoring V6 supports a new range of
counters' MSRs in the 19xxH address range. They include all the GP
counter MSRs, the GP control MSRs, and the fixed counter MSRs.
The step between each sibling counter is 4. Add intel_pmu_addr_offset()
to calculate the correct offset.
Add fixedctr in struct x86_pmu to store the address of the fixed counter
0. It can be used to calculate the rest of the fixed counters.
The MSR address of the fixed counter control is not changed.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lkml.kernel.org/r/20240626143545.480761-9-kan.liang@linux.intel.com
|
|
Two new fields (the unit mask2, and the equal flag) are added in the
IA32_PERFEVTSELx MSRs. They can be enumerated by the CPUID.23H.0.EBX.
Update the config_mask in x86_pmu and x86_hybrid_pmu for the true layout
of the PERFEVTSEL.
Expose the new formats into sysfs if they are available. The umask
extension reuses the same format attr name "umask" as the previous
umask. Add umask2_show to determine/display the correct format
for the current machine.
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lkml.kernel.org/r/20240626143545.480761-8-kan.liang@linux.intel.com
|
|
Different vendors may support different fields in EVENTSEL MSR, such as
Intel would introduce new fields umask2 and eq bits in EVENTSEL MSR
since Perfmon version 6. However, a fixed mask X86_RAW_EVENT_MASK is
used to filter the attr.config.
Introduce a new config_mask to record the real supported EVENTSEL
bitmask.
Only apply it to the existing code now. No functional change.
Co-developed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lkml.kernel.org/r/20240626143545.480761-7-kan.liang@linux.intel.com
|
|
A new PEBS data source format is introduced for the p-core of Lunar
Lake. The data source field is extended to 8 bits with new encodings.
A new layout is introduced into the union intel_x86_pebs_dse.
Introduce the lnl_latency_data() to parse the new format.
Enlarge the pebs_data_source[] accordingly to include new encodings.
Only the mem load and the mem store events can generate the data source.
Introduce INTEL_HYBRID_LDLAT_CONSTRAINT and
INTEL_HYBRID_STLAT_CONSTRAINT to mark them.
Add two new bits for the new cache-related data src, L2_MHB and MSC.
The L2_MHB is short for L2 Miss Handling Buffer, which is similar to
LFB (Line Fill Buffer), but to track the L2 Cache misses.
The MSC stands for the memory-side cache.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lkml.kernel.org/r/20240626143545.480761-6-kan.liang@linux.intel.com
|
|
The model-specific pebs_latency_data functions of ADL and MTL use the
"small" as a postfix to indicate the e-core. The postfix is too generic
for a model-specific function. It cannot provide useful information that
can directly map it to a specific uarch, which can facilitate the
development and maintenance.
Use the abbr of the uarch to rename the model-specific functions.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lkml.kernel.org/r/20240626143545.480761-5-kan.liang@linux.intel.com
|
|
From PMU's perspective, Lunar Lake and Arrow Lake are similar to the
previous generation Meteor Lake. Both are hybrid platforms, with e-core
and p-core.
The key differences include:
- The e-core supports 3 new fixed counters
- The p-core supports an updated PEBS Data Source format
- More GP counters (Updated event constraint table)
- New Architectural performance monitoring V6
(New Perfmon MSRs aliasing, umask2, eq).
- New PEBS format V6 (Counters Snapshotting group)
- New RDPMC metrics clear mode
The legacy features, the 3 new fixed counters and updated event
constraint table are enabled in this patch.
The new PEBS data source format, the architectural performance
monitoring V6, the PEBS format V6, and the new RDPMC metrics clear mode
are supported in the following patches.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lkml.kernel.org/r/20240626143545.480761-4-kan.liang@linux.intel.com
|
|
The current perf assumes that both GP and fixed counters are contiguous.
But it's not guaranteed on newer Intel platforms or in a virtualization
environment.
Use the counter mask to replace the number of counters for both GP and
the fixed counters. For the other ARCHs or old platforms which don't
support a counter mask, using GENMASK_ULL(num_counter - 1, 0) to
replace. There is no functional change for them.
The interface to KVM is not changed. The number of counters still be
passed to KVM. It can be updated later separately.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lkml.kernel.org/r/20240626143545.480761-3-kan.liang@linux.intel.com
|
|
The current perf assumes that the counters that support PEBS are
contiguous. But it's not guaranteed with the new leaf 0x23 introduced.
The counters are enumerated with a counter mask. There may be holes in
the counter mask for future platforms or in a virtualization
environment.
Store the PEBS event mask rather than the maximum number of PEBS
counters in the x86 PMU structures.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lkml.kernel.org/r/20240626143545.480761-2-kan.liang@linux.intel.com
|
|
Compared with previous client platforms, PC8 is removed from Lunarlake.
It supports CC1/CC6/CC7 and PC2/PC3/PC6/PC10 residency counters.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20240628031758.43103-4-rui.zhang@intel.com
|
|
Like Alderlake, Arrowlake supports CC1/CC6/CC7 and PC2/PC3/PC6/PC8/PC10.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20240628031758.43103-3-rui.zhang@intel.com
|
|
For Alderlake, the spec changes after the patch submitted and PC7/PC9
are removed.
Raptorlake and Meteorlake, which copy the Alderlake cstate PMU, also
don't have PC7/PC9.
Remove PC7/PC9 support for Alderlake/Raptorlake/Meteorlake.
Fixes: d0ca946bcf84 ("perf/x86/cstate: Add Alder Lake CPU support")
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20240628031758.43103-2-rui.zhang@intel.com
|
|
The Lunarlake patches rely on the new VFM stuff.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
|
|
Currently, perf allocates an array of page pointers which is limited in
size by MAX_PAGE_ORDER. That in turn limits the maximum Intel PT buffer
size to 2GiB. Should that limitation be lifted, the Intel PT driver can
support larger sizes, except for one calculation in
pt_topa_entry_for_page(), which is limited to 32-bits.
Fix pt_topa_entry_for_page() address calculation by adding a cast.
Fixes: 39152ee51b77 ("perf/x86/intel/pt: Get rid of reverse lookup table for ToPA")
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240624201101.60186-4-adrian.hunter@intel.com
|
|
topa_entry->base is a bit-field. Bit-fields are not promoted to a 64-bit
type, even if the underlying type is 64-bit, and so, if necessary, must
be cast to a larger type when calculations are done.
Fix a topa_entry->base address calculation by adding a cast.
Without the cast, the address was limited to 36-bits i.e. 64GiB.
The address calculation is used on systems that do not support Multiple
Entry ToPA (only Broadwell), and affects physical addresses on or above
64GiB. Instead of writing to the correct address, the address comprising
the first 36 bits would be written to.
Intel PT snapshot and sampling modes are not affected.
Fixes: 52ca9ced3f70 ("perf/x86/intel/pt: Add Intel PT PMU driver")
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240624201101.60186-3-adrian.hunter@intel.com
|
|
topa_entry->base needs to store a pfn. It obviously needs to be
large enough to store the largest possible x86 pfn which is
MAXPHYADDR-PAGE_SIZE (52-12). So it is 4 bits too small.
Increase the size of topa_entry->base from 36 bits to 40 bits.
Note, systems where physical addresses can be 256TiB or more are affected.
[ Adrian: Amend commit message as suggested by Dave Hansen ]
Fixes: 52ca9ced3f70 ("perf/x86/intel/pt: Add Intel PT PMU driver")
Signed-off-by: Marco Cavenati <cavenati.marco@gmail.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240624201101.60186-2-adrian.hunter@intel.com
|
|
In free_pagetable() we use the non-atomic version for clearing the
PageReserved bit from the page. free_pagetable() will either call
free_reserved_page() or put_page_bootmem(), which will eventually end up
calling free_reserved_page(), and in there we already clear the
PageReserved flag.
Link: https://lkml.kernel.org/r/20240527044523.29207-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
pxx_huge() has been removed in recent commit 9636f055dae1 ("mm/treewide:
remove pXd_huge()"), however there are still three comments referencing
the API that got overlooked. Remove them.
Link: https://lkml.kernel.org/r/20240527154855.528816-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
x86_64 is already using the node's cpu as maximum threads. Make that the
default for all archs setting DEFERRED_STRUCT_PAGE_INIT.
This returns to the behavior prior making the function arch-specific with
commit ecd096506922 ("mm: make deferred init's max threads
arch-specific").
Setting DEFERRED_STRUCT_PAGE_INIT and testing on a few arm64 platforms
shows faster deferred_init_memmap completions:
| | x13s | SA8775p-ride | Ampere R137-P31 | Ampere HR330 |
| | Metal, 32GB | VM, 36GB | VM, 58GB | Metal, 128GB |
| | 8cpus | 8cpus | 8cpus | 32cpus |
|---------|-------------|--------------|-----------------|--------------|
| threads | ms (%) | ms (%) | ms (%) | ms (%) |
|---------|-------------|--------------|-----------------|--------------|
| 1 | 108 (0%) | 72 (0%) | 224 (0%) | 324 (0%) |
| cpus | 24 (-77%) | 36 (-50%) | 40 (-82%) | 56 (-82%) |
Michael Ellerman reported:
: On a machine here (1TB, 40 cores, 4KB pages) the existing code gives:
:
: [ 0.500124] node 2 deferred pages initialised in 210ms
: [ 0.515790] node 3 deferred pages initialised in 230ms
: [ 0.516061] node 0 deferred pages initialised in 230ms
: [ 0.516522] node 7 deferred pages initialised in 230ms
: [ 0.516672] node 4 deferred pages initialised in 230ms
: [ 0.516798] node 6 deferred pages initialised in 230ms
: [ 0.517051] node 5 deferred pages initialised in 230ms
: [ 0.523887] node 1 deferred pages initialised in 240ms
:
: vs with the patch:
:
: [ 0.379613] node 0 deferred pages initialised in 90ms
: [ 0.380388] node 1 deferred pages initialised in 90ms
: [ 0.380540] node 4 deferred pages initialised in 100ms
: [ 0.390239] node 6 deferred pages initialised in 100ms
: [ 0.390249] node 2 deferred pages initialised in 100ms
: [ 0.390786] node 3 deferred pages initialised in 110ms
: [ 0.396721] node 5 deferred pages initialised in 110ms
: [ 0.397095] node 7 deferred pages initialised in 110ms
:
: Which is a nice speedup.
[echanude@redhat.com: v3]
Link: https://lkml.kernel.org/r/20240528185455.643227-4-echanude@redhat.com
Link: https://lkml.kernel.org/r/20240522203758.626932-4-echanude@redhat.com
Signed-off-by: Eric Chanudet <echanude@redhat.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Reviewed-by: Baoquan He <bhe@redhat.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Including hrtimer.h is not required and is probably a historical
leftover. Remove it.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lore.kernel.org/r/20240701-vdso-cleanup-v1-5-36eb64e7ece2@linutronix.de
|
|
The typedef gtod_long_t is not used anymore so remove it.
The header file contains then only includes dependent on
CONFIG_GENERIC_GETTIMEOFDAY to not break ARCH=um. Nevertheless, keep the
header file only with those includes to prevent spreading ifdeffery all
over the place.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lore.kernel.org/r/20240701-vdso-cleanup-v1-4-36eb64e7ece2@linutronix.de
|
|
Replace the reference to the non-existent function arch_vdso_cycles_valid()
by the proper function arch_vdso_cycles_ok().
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Link: https://lore.kernel.org/r/20240701-vdso-cleanup-v1-3-36eb64e7ece2@linutronix.de
|
|
Since userspace state is saved in the MM process, kernel using
FPU still doesn't really need to do anything, so this really
is as simple as enabling preemption. The irq critical section
in sigio_handler() needs preempt_disable()/preempt_enable().
Signed-off-by: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Link: https://patch.msgid.link/20240702102549.d2fcea450854.I12f5a53d80ec1e425e66ef272b1e95cb523b608e@changeid
[rebase, remove FPU save/restore, fix x86/um Makefile,
rewrite commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The kernel flushes the memory ranges anyway for CoW and does not assume
that the userspace process has anything set up already. So, start with a
fresh process for the new mm context.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://patch.msgid.link/20240703134536.1161108-8-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The current LDT code has a few issues that mean it should be redone in a
different way once we always start with a fresh MM even when cloning.
In a new and better world, the kernel would just ensure its own LDT is
clear at startup. At that point, all that is needed is a simple function
to populate the LDT from another MM in arch_dup_mmap combined with some
tracking of the installed LDT entries for each MM.
Note that the old implementation was even incorrect with regard to
reading, as it copied out the LDT entries in the internal format rather
than converting them to the userspace structure.
Removal should be fine as the LDT is not used for thread-local storage
anymore.
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Link: https://patch.msgid.link/20240703134536.1161108-7-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Rework syscall handling to be platform independent. Also create a clean
split between queueing of syscalls and flushing them out, removing the
need to keep state in the code that triggers the syscalls.
The code adds syscall_data_len to the global mm_id structure. This will
be used later to allow surrounding code to track whether syscalls still
need to run and if errors occurred.
Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net>
Link: https://patch.msgid.link/20240703134536.1161108-5-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
This function will be used by the new syscall handling code.
Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net>
Link: https://patch.msgid.link/20240703134536.1161108-4-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Further commits will require values from common-offsets.h inside
stub-data.h. Resolve the possible circular dependency and simply use
offsetof() inside stub_32.h and stub_64.h.
Signed-off-by: Benjamin Berg <benjamin@sipsolutions.net>
Link: https://patch.msgid.link/20240703134536.1161108-2-benjamin@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
When BHI mitigation is enabled, if SYSENTER is invoked with the TF flag set
then entry_SYSENTER_compat() uses CLEAR_BRANCH_HISTORY and calls the
clear_bhb_loop() before the TF flag is cleared. This causes the #DB handler
(exc_debug_kernel()) to issue a warning because single-step is used outside the
entry_SYSENTER_compat() function.
To address this issue, entry_SYSENTER_compat() should use CLEAR_BRANCH_HISTORY
after making sure the TF flag is cleared.
The problem can be reproduced with the following sequence:
$ cat sysenter_step.c
int main()
{ asm("pushf; pop %ax; bts $8,%ax; push %ax; popf; sysenter"); }
$ gcc -o sysenter_step sysenter_step.c
$ ./sysenter_step
Segmentation fault (core dumped)
The program is expected to crash, and the #DB handler will issue a warning.
Kernel log:
WARNING: CPU: 27 PID: 7000 at arch/x86/kernel/traps.c:1009 exc_debug_kernel+0xd2/0x160
...
RIP: 0010:exc_debug_kernel+0xd2/0x160
...
Call Trace:
<#DB>
? show_regs+0x68/0x80
? __warn+0x8c/0x140
? exc_debug_kernel+0xd2/0x160
? report_bug+0x175/0x1a0
? handle_bug+0x44/0x90
? exc_invalid_op+0x1c/0x70
? asm_exc_invalid_op+0x1f/0x30
? exc_debug_kernel+0xd2/0x160
exc_debug+0x43/0x50
asm_exc_debug+0x1e/0x40
RIP: 0010:clear_bhb_loop+0x0/0xb0
...
</#DB>
<TASK>
? entry_SYSENTER_compat_after_hwframe+0x6e/0x8d
</TASK>
[ bp: Massage commit message. ]
Fixes: 7390db8aea0d ("x86/bhi: Add support for clearing branch history at syscall entry")
Reported-by: Suman Maity <suman.m.maity@oracle.com>
Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lore.kernel.org/r/20240524070459.3674025-1-alexandre.chartre@oracle.com
|