linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2021-04-19	perf/x86: Hybrid PMU support for unconstrained	Kan Liang
	The unconstrained value depends on the number of GP and fixed counters. Each hybrid PMU should use its own unconstrained. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1618237865-33448-8-git-send-email-kan.liang@linux.intel.com
2021-04-19	perf/x86: Hybrid PMU support for counters	Kan Liang
	The number of GP and fixed counters are different among hybrid PMUs. Each hybrid PMU should use its own counter related information. When handling a certain hybrid PMU, apply the number of counters from the corresponding hybrid PMU. When reserving the counters in the initialization of a new event, reserve all possible counters. The number of counter recored in the global x86_pmu is for the architecture counters which are available for all hybrid PMUs. KVM doesn't support the hybrid PMU yet. Return the number of the architecture counters for now. For the functions only available for the old platforms, e.g., intel_pmu_drain_pebs_nhm(), nothing is changed. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/1618237865-33448-7-git-send-email-kan.liang@linux.intel.com
2021-04-19	perf/x86: Hybrid PMU support for intel_ctrl	Kan Liang
	The intel_ctrl is the counter mask of a PMU. The PMU counter information may be different among hybrid PMUs, each hybrid PMU should use its own intel_ctrl to check and access the counters. When handling a certain hybrid PMU, apply the intel_ctrl from the corresponding hybrid PMU. When checking the HW existence, apply the PMU and number of counters from the corresponding hybrid PMU as well. Perf will check the HW existence for each Hybrid PMU before registration. Expose the check_hw_exists() for a later patch. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/1618237865-33448-6-git-send-email-kan.liang@linux.intel.com
2021-04-19	perf/x86/intel: Hybrid PMU support for perf capabilities	Kan Liang
	Some platforms, e.g. Alder Lake, have hybrid architecture. Although most PMU capabilities are the same, there are still some unique PMU capabilities for different hybrid PMUs. Perf should register a dedicated pmu for each hybrid PMU. Add a new struct x86_hybrid_pmu, which saves the dedicated pmu and capabilities for each hybrid PMU. The architecture MSR, MSR_IA32_PERF_CAPABILITIES, only indicates the architecture features which are available on all hybrid PMUs. The architecture features are stored in the global x86_pmu.intel_cap. For Alder Lake, the model-specific features are perf metrics and PEBS-via-PT. The corresponding bits of the global x86_pmu.intel_cap should be 0 for these two features. Perf should not use the global intel_cap to check the features on a hybrid system. Add a dedicated intel_cap in the x86_hybrid_pmu to store the model-specific capabilities. Use the dedicated intel_cap to replace the global intel_cap for thse two features. The dedicated intel_cap will be set in the following "Add Alder Lake Hybrid support" patch. Add is_hybrid() to distinguish a hybrid system. ADL may have an alternative configuration. With that configuration, the X86_FEATURE_HYBRID_CPU is not set. Perf cannot rely on the feature bit. Add a new static_key_false, perf_is_hybrid, to indicate a hybrid system. It will be assigned in the following "Add Alder Lake Hybrid support" patch as well. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1618237865-33448-5-git-send-email-kan.liang@linux.intel.com
2021-04-19	perf/x86: Track pmu in per-CPU cpu_hw_events	Kan Liang
	Some platforms, e.g. Alder Lake, have hybrid architecture. In the same package, there may be more than one type of CPU. The PMU capabilities are different among different types of CPU. Perf will register a dedicated PMU for each type of CPU. Add a 'pmu' variable in the struct cpu_hw_events to track the dedicated PMU of the current CPU. Current x86_get_pmu() use the global 'pmu', which will be broken on a hybrid platform. Modify it to apply the 'pmu' of the specific CPU. Initialize the per-CPU 'pmu' variable with the global 'pmu'. There is nothing changed for the non-hybrid platforms. The is_x86_event() will be updated in the later patch ("perf/x86: Register hybrid PMUs") for hybrid platforms. For the non-hybrid platforms, nothing is changed here. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1618237865-33448-4-git-send-email-kan.liang@linux.intel.com
2021-04-16	perf/amd/uncore: Fix sysfs type mismatch	Nathan Chancellor
	dev_attr_show() calls the __uncore_*_show() functions via an indirect call but their type does not currently match the type of the show() member in 'struct device_attribute', resulting in a Control Flow Integrity violation. $ cat /sys/devices/amd_l3/format/umask config:8-15 $ dmesg \| grep "CFI failure" [ 1258.174653] CFI failure (target: __uncore_umask_show...): Update the type in the DEFINE_UNCORE_FORMAT_ATTR macro to match 'struct device_attribute' so that there is no more CFI violation. Fixes: 06f2c24584f3 ("perf/amd/uncore: Prepare to scale for more attributes that vary per family") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210415001112.3024673-2-nathan@kernel.org
2021-04-16	x86/events/amd/iommu: Fix sysfs type mismatch	Nathan Chancellor
	dev_attr_show() calls _iommu_event_show() via an indirect call but _iommu_event_show()'s type does not currently match the type of the show() member in 'struct device_attribute', resulting in a Control Flow Integrity violation. $ cat /sys/devices/amd_iommu_1/events/mem_dte_hit csource=0x0a $ dmesg \| grep "CFI failure" [ 3526.735140] CFI failure (target: _iommu_event_show...): Change _iommu_event_show() and 'struct amd_iommu_event_desc' to 'struct device_attribute' so that there is no more CFI violation. Fixes: 7be6296fdd75 ("perf/x86/amd: AMD IOMMU Performance Counter PERF uncore PMU implementation") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210415001112.3024673-1-nathan@kernel.org
2021-04-16	Merge branches 'iommu/fixes', 'arm/mediatek', 'arm/smmu', 'arm/exynos', ↵	Joerg Roedel
	'unisoc', 'x86/vt-d', 'x86/amd' and 'core' into next
2021-04-16	perf/x86: Move cpuc->running into P4 specific code	Kan Liang
	The 'running' variable is only used in the P4 PMU. Current perf sets the variable in the critical function x86_pmu_start(), which wastes cycles for everybody not running on P4. Move cpuc->running into the P4 specific p4_pmu_enable_event(). Add a static per-CPU 'p4_running' variable to replace the 'running' variable in the struct cpu_hw_events. Saves space for the generic structure. The p4_pmu_enable_all() also invokes the p4_pmu_enable_event(), but it should not set cpuc->running. Factor out __p4_pmu_enable_event() for p4_pmu_enable_all(). Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1618410990-21383-1-git-send-email-kan.liang@linux.intel.com
2021-04-07	iommu/amd: Move a few prototypes to include/linux/amd-iommu.h	Christoph Hellwig
	A few functions that were intentended for the perf events support are currently declared in arch/x86/events/amd/iommu.h, which mens they are not in scope for the actual function definition. Also amdkfd has started using a few of them using externs in a .c file. End that misery by moving the prototypes to the proper header. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210402143312.372386-5-hch@lst.de Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-04-02	Merge tag 'v5.12-rc5' into WIP.x86/core, to pick up recent NOP related changes	Ingo Molnar
	In particular we want to have this upstream commit: b90829704780: ("bpf: Use NOP_ATOMIC5 instead of emit_nops(&prog, 5) for BPF_TRAMP_F_CALL_ORIG") ... before merging in x86/cpu changes and the removal of the NOP optimizations, and applying PeterZ's !retpoline objtool series. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-04-02	perf/x86/intel/uncore: Enable IIO stacks to PMON mapping for multi-segment SKX	Alexander Antonov
	IIO stacks to PMON mapping on Skylake servers is exposed through introduced early attributes /sys/devices/uncore_iio_<pmu_idx>/dieX, where dieX is a file which holds "Segment:Root Bus" for PCIe root port which can be monitored by that IIO PMON block. These sysfs attributes are disabled for multiple segment topologies except VMD domains which start at 0x10000. This patch removes the limitation and enables IIO stacks to PMON mapping for multi-segment Skylake servers by introducing segment-aware intel_uncore_topology structure and attributing the topology configuration to the segment in skx_iio_get_topology() function. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Alexander Antonov <alexander.antonov@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Tested-by: Kyle Meyer <kyle.meyer@hpe.com> Link: https://lkml.kernel.org/r/20210323150507.2013-1-alexander.antonov@linux.intel.com
2021-04-02	perf/x86/intel/uncore: Generic support for the MMIO type of uncore blocks	Kan Liang
	The discovery table provides the generic uncore block information for the MMIO type of uncore blocks, which is good enough to provide basic uncore support. The box control field is composed of the BAR address and box control offset. When initializing the uncore blocks, perf should ioremap the address from the box control field. Implement the generic support for the MMIO type of uncore block. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1616003977-90612-6-git-send-email-kan.liang@linux.intel.com
2021-04-02	perf/x86/intel/uncore: Generic support for the PCI type of uncore blocks	Kan Liang
	The discovery table provides the generic uncore block information for the PCI type of uncore blocks, which is good enough to provide basic uncore support. The PCI BUS and DEVFN information can be retrieved from the box control field. Introduce the uncore_pci_pmus_register() to register all the PCICFG type of uncore blocks. The old PCI probe/remove way is dropped. The PCI BUS and DEVFN information are different among dies. Add box_ctls to store the box control field of each die. Add a new BUS notifier for the PCI type of uncore block to support the hotplug. If the device is "hot remove", the corresponding registered PMU has to be unregistered. Perf cannot locate the PMU by searching a const pci_device_id table, because the discovery tables don't provide such information. Introduce uncore_pci_find_dev_pmu_from_types() to search the whole uncore_pci_uncores for the PMU. Implement generic support for the PCI type of uncore block. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1616003977-90612-5-git-send-email-kan.liang@linux.intel.com
2021-04-02	perf/x86/intel/uncore: Rename uncore_notifier to uncore_pci_sub_notifier	Kan Liang
	Perf will use a similar method to the PCI sub driver to register the PMUs for the PCI type of uncore blocks. The method requires a BUS notifier to support hotplug. The current BUS notifier cannot be reused, because it searches a const id_table for the corresponding registered PMU. The PCI type of uncore blocks in the discovery tables doesn't provide an id_table. Factor out uncore_bus_notify() and add the pointer of an id_table as a parameter. The uncore_bus_notify() will be reused in the following patch. The current BUS notifier is only used by the PCI sub driver. Its name is too generic. Rename it to uncore_pci_sub_notifier, which is specific for the PCI sub driver. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1616003977-90612-4-git-send-email-kan.liang@linux.intel.com
2021-04-02	perf/x86/intel/uncore: Generic support for the MSR type of uncore blocks	Kan Liang
	The discovery table provides the generic uncore block information for the MSR type of uncore blocks, e.g., the counter width, the number of counters, the location of control/counter registers, which is good enough to provide basic uncore support. It can be used as a fallback solution when the kernel doesn't support a platform. The name of the uncore box cannot be retrieved from the discovery table. uncore_type_&typeID_&boxID will be used as its name. Save the type ID and the box ID information in the struct intel_uncore_type. Factor out uncore_get_pmu_name() to handle different naming methods. Implement generic support for the MSR type of uncore block. Some advanced features, such as filters and constraints, cannot be retrieved from discovery tables. Features that rely on that information are not be supported here. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1616003977-90612-3-git-send-email-kan.liang@linux.intel.com
2021-04-02	perf/x86/intel/uncore: Parse uncore discovery tables	Kan Liang
	A self-describing mechanism for the uncore PerfMon hardware has been introduced with the latest Intel platforms. By reading through an MMIO page worth of information, perf can 'discover' all the standard uncore PerfMon registers in a machine. The discovery mechanism relies on BIOS's support. With a proper BIOS, a PCI device with the unique capability ID 0x23 can be found on each die. Perf can retrieve the information of all available uncore PerfMons from the device via MMIO. The information is composed of one global discovery table and several unit discovery tables. - The global discovery table includes global uncore information of the die, e.g., the address of the global control register, the offset of the global status register, the number of uncore units, the offset of unit discovery tables, etc. - The unit discovery table includes generic uncore unit information, e.g., the access type, the counter width, the address of counters, the address of the counter control, the unit ID, the unit type, etc. The unit is also called "box" in the code. Perf can provide basic uncore support based on this information with the following patches. To locate the PCI device with the discovery tables, check the generic PCI ID first. If it doesn't match, go through the entire PCI device tree and locate the device with the unique capability ID. The uncore information is similar among dies. To save parsing time and space, only completely parse and store the discovery tables on the first die and the first box of each die. The parsed information is stored in an RB tree structure, intel_uncore_discovery_type. The size of the stored discovery tables varies among platforms. It's around 4KB for a Sapphire Rapids server. If a BIOS doesn't support the 'discovery' mechanism, the uncore driver will exit with -ENODEV. There is nothing changed. Add a module parameter to disable the discovery feature. If a BIOS gets the discovery tables wrong, users can have an option to disable the feature. For the current patchset, the uncore driver will exit with -ENODEV. In the future, it may fall back to the hardcode uncore driver on a known platform. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1616003977-90612-2-git-send-email-kan.liang@linux.intel.com
2021-03-21	x86: Fix various typos in comments, take #2	Ingo Molnar
	Fix another ~42 single-word typos in arch/x86/ code comments, missed a few in the first pass, in particular in .S files. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org
2021-03-21	x86: Remove unusual Unicode characters from comments	Ingo Molnar
	We've accumulated a few unusual Unicode characters in arch/x86/ over the years, substitute them with their proper ASCII equivalents. A few of them were a whitespace equivalent: ' ' - the use was harmless. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-kernel@vger.kernel.org
2021-03-21	Merge branch 'linus' into x86/cleanups, to resolve conflict	Ingo Molnar
	Conflicts: arch/x86/kernel/kprobes/ftrace.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2021-03-18	x86: Fix various typos in comments	Ingo Molnar
	Fix ~144 single-word typos in arch/x86/ code comments. Doing this in a single commit should reduce the churn. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org
2021-03-16	perf/x86/intel: Fix unchecked MSR access error caused by VLBR_EVENT	Kan Liang
	On a Haswell machine, the perf_fuzzer managed to trigger this message: [117248.075892] unchecked MSR access error: WRMSR to 0x3f1 (tried to write 0x0400000000000000) at rIP: 0xffffffff8106e4f4 (native_write_msr+0x4/0x20) [117248.089957] Call Trace: [117248.092685] intel_pmu_pebs_enable_all+0x31/0x40 [117248.097737] intel_pmu_enable_all+0xa/0x10 [117248.102210] __perf_event_task_sched_in+0x2df/0x2f0 [117248.107511] finish_task_switch.isra.0+0x15f/0x280 [117248.112765] schedule_tail+0xc/0x40 [117248.116562] ret_from_fork+0x8/0x30 A fake event called VLBR_EVENT may use the bit 58 of the PEBS_ENABLE, if the precise_ip is set. The bit 58 is reserved by the HW. Accessing the bit causes the unchecked MSR access error. The fake event doesn't support PEBS. The case should be rejected. Fixes: 097e4311cda9 ("perf/x86: Add constraint to create guest LBR event without hw counter") Reported-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1615555298-140216-2-git-send-email-kan.liang@linux.intel.com
2021-03-16	perf/x86/intel: Fix a crash caused by zero PEBS status	Kan Liang
	A repeatable crash can be triggered by the perf_fuzzer on some Haswell system. https://lore.kernel.org/lkml/7170d3b-c17f-1ded-52aa-cc6d9ae999f4@maine.edu/ For some old CPUs (HSW and earlier), the PEBS status in a PEBS record may be mistakenly set to 0. To minimize the impact of the defect, the commit was introduced to try to avoid dropping the PEBS record for some cases. It adds a check in the intel_pmu_drain_pebs_nhm(), and updates the local pebs_status accordingly. However, it doesn't correct the PEBS status in the PEBS record, which may trigger the crash, especially for the large PEBS. It's possible that all the PEBS records in a large PEBS have the PEBS status 0. If so, the first get_next_pebs_record_by_bit() in the __intel_pmu_pebs_event() returns NULL. The at = NULL. Since it's a large PEBS, the 'count' parameter must > 1. The second get_next_pebs_record_by_bit() will crash. Besides the local pebs_status, correct the PEBS status in the PEBS record as well. Fixes: 01330d7288e0 ("perf/x86: Allow zero PEBS status with only single active event") Reported-by: Vince Weaver <vincent.weaver@maine.edu> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1615555298-140216-1-git-send-email-kan.liang@linux.intel.com
2021-03-15	perf/x86/intel/ds: Check return values of insn decoder functions	Borislav Petkov
	branch_type() doesn't need to call the full insn_decode() because it doesn't need it in all cases thus leave the calls separate. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-9-bp@alien8.de
2021-03-15	perf/x86/intel/ds: Check insn_get_length() retval	Borislav Petkov
	intel_pmu_pebs_fixup_ip() needs only the insn length so use the appropriate helper instead of a full decode. A full decode differs only in running insn_complete() on the decoded insn but that is not needed here. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210304174237.31945-8-bp@alien8.de
2021-03-10	x86/perf: Use RET0 as default for guest_get_msrs to handle "no PMU" case	Sean Christopherson
	Initialize x86_pmu.guest_get_msrs to return 0/NULL to handle the "nop" case. Patching in perf_guest_get_msrs_nop() during setup does not work if there is no PMU, as setup bails before updating the static calls, leaving x86_pmu.guest_get_msrs NULL and thus a complete nop. Ultimately, this causes VMX abort on VM-Exit due to KVM putting random garbage from the stack into the MSR load list. Add a comment in KVM to note that nr_msrs is valid if and only if the return value is non-NULL. Fixes: abd562df94d1 ("x86/perf: Use static_call for x86_pmu.guest_get_msrs") Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: syzbot+cce9ef2dd25246f815ee@syzkaller.appspotmail.com Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210309171019.1125243-1-seanjc@google.com
2021-03-06	perf/x86/intel: Set PERF_ATTACH_SCHED_CB for large PEBS and LBR	Kan Liang
	To supply a PID/TID for large PEBS, it requires flushing the PEBS buffer in a context switch. For normal LBRs, a context switch can flip the address space and LBR entries are not tagged with an identifier, we need to wipe the LBR, even for per-cpu events. For LBR callstack, save/restore the stack is required during a context switch. Set PERF_ATTACH_SCHED_CB for the event with large PEBS & LBR. Fixes: 9c964efa4330 ("perf/x86/intel: Drain the PEBS buffer during context switches") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20201130193842.10569-2-kan.liang@linux.intel.com
2021-02-10	perf/x86/rapl: Fix psys-energy event on Intel SPR platform	Zhang Rui
	There are several things special for the RAPL Psys energy counter, on Intel Sapphire Rapids platform. 1. it contains one Psys master package, and only CPUs on the master package can read valid value of the Psys energy counter, reading the MSR on CPUs in the slave package returns 0. 2. The master package does not have to be Physical package 0. And when all the CPUs on the Psys master package are offlined, we lose the Psys energy counter, at runtime. 3. The Psys energy counter can be disabled by BIOS, while all the other energy counters are not affected. It is not easy to handle all of these in the current RAPL PMU design because a) perf_msr_probe() validates the MSR on some random CPU, which may either be in the Psys master package or in the Psys slave package. b) all the RAPL events share the same PMU, and there is not API to remove the psys-energy event cleanly, without affecting the other events in the same PMU. This patch addresses the problems in a simple way. First, by setting .no_check bit for RAPL Psys MSR, the psys-energy event is always added, so we don't have to check the Psys ENERGY_STATUS MSR on master package. Then, by removing rapl_not_visible(), the psys-energy event is always available in sysfs. This does not affect the previous code because, for the RAPL MSRs with .no_check cleared, the .is_visible() callback is always overriden in the perf_msr_probe() function. Note, although RAPL PMU is die-based, and the Psys energy counter MSR on Intel SPR is package scope, this is not a problem because there is only one die in each package on SPR. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/20210204161816.12649-3-rui.zhang@intel.com
2021-02-10	perf/x86/rapl: Only check lower 32bits for RAPL energy counters	Zhang Rui
	In the RAPL ENERGY_COUNTER MSR, only the lower 32bits represent the energy counter. On previous platforms, the higher 32bits are reverved and always return Zero. But on Intel SapphireRapids platform, the higher 32bits are reused for other purpose and return non-zero value. Thus check the lower 32bits only for these ENERGY_COUTNER MSRs, to make sure the RAPL PMU events are not added erroneously when higher 32bits contain non-zero value. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/20210204161816.12649-2-rui.zhang@intel.com
2021-02-10	perf/x86/rapl: Add msr mask support	Zhang Rui
	In some cases, when probing a perf MSR, we're probing certain bits of the MSR instead of the whole register, thus only these bits should be checked. For example, for RAPL ENERGY_STATUS MSR, only the lower 32 bits represents the energy counter, and the higher 32bits are reserved. Introduce a new mask field in struct perf_msr to allow probing certain bits of a MSR. This change is transparent to the current perf_msr_probe() users. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/20210204161816.12649-1-rui.zhang@intel.com
2021-02-10	perf/x86/kvm: Add Cascade Lake Xeon steppings to isolation_ucodes[]	Jim Mattson
	Cascade Lake Xeon parts have the same model number as Skylake Xeon parts, so they are tagged with the intel_pebs_isolation quirk. However, as with Skylake Xeon H0 stepping parts, the PEBS isolation issue is fixed in all microcode versions. Add the Cascade Lake Xeon steppings (5, 6, and 7) to the isolation_ucodes[] table so that these parts benefit from Andi's optimization in commit 9b545c04abd4f ("perf/x86/kvm: Avoid unnecessary work in guest filtering"). Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lkml.kernel.org/r/20210205191324.2889006-1-jmattson@google.com
2021-02-01	perf/x86/intel: Support CPUID 10.ECX to disable fixed counters	Kan Liang
	With Architectural Performance Monitoring Version 5, CPUID 10.ECX cpu leaf indicates the fixed counter enumeration. This extends the previous count to a bitmap which allows disabling even lower fixed counters. It could be used by a Hypervisor. The existing intel_ctrl variable is used to remember the bitmask of the counters. All code that reads all counters is fixed to check this extra bitmask. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Originally-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1611873611-156687-6-git-send-email-kan.liang@linux.intel.com
2021-02-01	perf/x86/intel: Add perf core PMU support for Sapphire Rapids	Kan Liang
	Add perf core PMU support for the Intel Sapphire Rapids server, which is the successor of the Intel Ice Lake server. The enabling code is based on Ice Lake, but there are several new features introduced. The event encoding is changed and simplified, e.g., the event codes which are below 0x90 are restricted to counters 0-3. The event codes which above 0x90 are likely to have no restrictions. The event constraints, extra_regs(), and hardware cache events table are changed accordingly. A new Precise Distribution (PDist) facility is introduced, which further minimizes the skid when a precise event is programmed on the GP counter 0. Enable the Precise Distribution (PDist) facility with :ppp event. For this facility to work, the period must be initialized with a value larger than 127. Add spr_limit_period() to apply the limit for :ppp event. Two new data source fields, data block & address block, are added in the PEBS Memory Info Record for the load latency event. To enable the feature, - An auxiliary event has to be enabled together with the load latency event on Sapphire Rapids. A new flag PMU_FL_MEM_LOADS_AUX is introduced to indicate the case. A new event, mem-loads-aux, is exposed to sysfs for the user tool. Add a check in hw_config(). If the auxiliary event is not detected, return an unique error -ENODATA. - The union perf_mem_data_src is extended to support the new fields. - Ice Lake and earlier models do not support block information, but the fields may be set by HW on some machines. Add pebs_no_block to explicitly indicate the previous platforms which don't support the new block fields. Accessing the new block fields are ignored on those platforms. A new store Latency facility is introduced, which leverages the PEBS facility where it can provide additional information about sampled stores. The additional information includes the data address, memory auxiliary info (e.g. Data Source, STLB miss) and the latency of the store access. To enable the facility, the new event (0x02cd) has to be programed on the GP counter 0. A new flag PERF_X86_EVENT_PEBS_STLAT is introduced to indicate the event. The store_latency_data() is introduced to parse the memory auxiliary info. The layout of access latency field of PEBS Memory Info Record has been changed. Two latency, instruction latency (bit 15:0) and cache access latency (bit 47:32) are recorded. - The cache access latency is similar to previous memory access latency. For loads, the latency starts by the actual cache access until the data is returned by the memory subsystem. For stores, the latency starts when the demand write accesses the L1 data cache and lasts until the cacheline write is completed in the memory subsystem. The cache access latency is stored in low 32bits of the sample type PERF_SAMPLE_WEIGHT_STRUCT. - The instruction latency starts by the dispatch of the load operation for execution and lasts until completion of the instruction it belongs to. Add a new flag PMU_FL_INSTR_LATENCY to indicate the instruction latency support. The instruction latency is stored in the bit 47:32 of the sample type PERF_SAMPLE_WEIGHT_STRUCT. Extends the PERF_METRICS MSR to feature TMA method level 2 metrics. The lower half of the register is the TMA level 1 metrics (legacy). The upper half is also divided into four 8-bit fields for the new level 2 metrics. Expose all eight Topdown metrics events to user space. The full description for the SPR features can be found at Intel Architecture Instruction Set Extensions and Future Features Programming Reference, 319433-041. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1611873611-156687-5-git-send-email-kan.liang@linux.intel.com
2021-02-01	perf/x86/intel: Filter unsupported Topdown metrics event	Kan Liang
	Intel Sapphire Rapids server will introduce 8 metrics events. Intel Ice Lake only supports 4 metrics events. A perf tool user may mistakenly use the unsupported events via RAW format on Ice Lake. The user can still get a value from the unsupported Topdown metrics event once the following Sapphire Rapids enabling patch is applied. To enable the 8 metrics events on Intel Sapphire Rapids, the INTEL_TD_METRIC_MAX has to be updated, which impacts the is_metric_event(). The is_metric_event() is a generic function. On Ice Lake, the newly added SPR metrics events will be mistakenly accepted as metric events on creation. At runtime, the unsupported Topdown metrics events will be updated. Add a variable num_topdown_events in x86_pmu to indicate the available number of the Topdown metrics event on the platform. Apply the number into is_metric_event(). Only the supported Topdown metrics events should be created as metrics events. Apply the num_topdown_events in icl_update_topdown_event() as well. The function can be reused by the following patch. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1611873611-156687-4-git-send-email-kan.liang@linux.intel.com
2021-02-01	perf/x86/intel: Factor out intel_update_topdown_event()	Kan Liang
	Similar to Ice Lake, Intel Sapphire Rapids server also supports the topdown performance metrics feature. The difference is that Intel Sapphire Rapids server extends the PERF_METRICS MSR to feature TMA method level two metrics, which will introduce 8 metrics events. Current icl_update_topdown_event() only check 4 level one metrics events. Factor out intel_update_topdown_event() to facilitate the code sharing between Ice Lake and Sapphire Rapids. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1611873611-156687-3-git-send-email-kan.liang@linux.intel.com
2021-02-01	perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT	Kan Liang
	Current PERF_SAMPLE_WEIGHT sample type is very useful to expresses the cost of an action represented by the sample. This allows the profiler to scale the samples to be more informative to the programmer. It could also help to locate a hotspot, e.g., when profiling by memory latencies, the expensive load appear higher up in the histograms. But current PERF_SAMPLE_WEIGHT sample type is solely determined by one factor. This could be a problem, if users want two or more factors to contribute to the weight. For example, Golden Cove core PMU can provide both the instruction latency and the cache Latency information as factors for the memory profiling. For current X86 platforms, although meminfo::latency is defined as a u64, only the lower 32 bits include the valid data in practice (No memory access could last than 4G cycles). The higher 32 bits can be used to store new factors. Add a new sample type, PERF_SAMPLE_WEIGHT_STRUCT, to indicate the new sample weight structure. It shares the same space as the PERF_SAMPLE_WEIGHT sample type. Users can apply either the PERF_SAMPLE_WEIGHT sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type to retrieve the sample weight, but they cannot apply both sample types simultaneously. Currently, only X86 and PowerPC use the PERF_SAMPLE_WEIGHT sample type. - For PowerPC, there is nothing changed for the PERF_SAMPLE_WEIGHT sample type. There is no effect for the new PERF_SAMPLE_WEIGHT_STRUCT sample type. PowerPC can re-struct the weight field similarly later. - For X86, the same value will be dumped for the PERF_SAMPLE_WEIGHT sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type for now. The following patches will apply the new factors for the PERF_SAMPLE_WEIGHT_STRUCT sample type. The field in the union perf_sample_weight should be shared among different architectures. A generic name is required, but it's hard to abstract a name that applies to all architectures. For example, on X86, the fields are to store all kinds of latency. While on PowerPC, it stores MMCRA[TECX/TECM], which should not be latency. So a general name prefix 'var$NUM' is used here. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1611873611-156687-2-git-send-email-kan.liang@linux.intel.com
2021-01-27	perf/intel: Remove Perfmon-v4 counter_freezing support	Peter Zijlstra
	Perfmon-v4 counter freezing is fundamentally broken; remove this default disabled code to make sure nobody uses it. The feature is called Freeze-on-PMI in the SDM, and if it would do that, there wouldn't actually be a problem, however it does something subtly different. It globally disables the whole PMU when it raises the PMI, not when the PMI hits. This means there's a window between the PMI getting raised and the PMI actually getting served where we loose events and this violates the perf counter independence. That is, a counting event should not result in a different event count when there is a sampling event co-scheduled. This is known to break existing software (RR). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2021-01-27	x86/perf: Use static_call for x86_pmu.guest_get_msrs	Like Xu
	Clean up that CONFIG_RETPOLINE crud and replace the indirect call x86_pmu.guest_get_msrs with static_call(). Reported-by: kernel test robot <lkp@intel.com> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Like Xu <like.xu@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210125121458.181635-1-like.xu@linux.intel.com
2021-01-14	perf/x86/intel/uncore: With > 8 nodes, get pci bus die id from NUMA info	Steve Wahl
	The registers used to determine which die a pci bus belongs to don't contain enough information to uniquely specify more than 8 dies, so when more than 8 dies are present, use NUMA information instead. Continue to use the previous method for 8 or fewer because it works there, and covers cases of NUMA being disabled. Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lkml.kernel.org/r/20210108153549.108989-3-steve.wahl@hpe.com
2021-01-14	perf/x86/intel/uncore: Store the logical die id instead of the physical die id.	Steve Wahl
	The phys_id isn't really used other than to map to a logical die id. Calculate the logical die id earlier, and store that instead of the phys_id. Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lkml.kernel.org/r/20210108153549.108989-2-steve.wahl@hpe.com
2020-12-14	Merge tag 'perf-core-2020-12-14' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Thomas Gleixner: "Core: - Better handling of page table leaves on archictectures which have architectures have non-pagetable aligned huge/large pages. For such architectures a leaf can actually be part of a larger entry. - Prevent a deadlock vs exec_update_mutex Architectures: - The related updates for page size calculation of leaf entries - The usual churn to support new CPUs - Small fixes and improvements all over the place" * tag 'perf-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) perf/x86/intel: Add Tremont Topdown support uprobes/x86: Fix fall-through warnings for Clang perf/x86: Fix fall-through warnings for Clang kprobes/x86: Fix fall-through warnings for Clang perf/x86/intel/lbr: Fix the return type of get_lbr_cycles() perf/x86/intel: Fix rtm_abort_event encoding on Ice Lake x86/kprobes: Restore BTF if the single-stepping is cancelled perf: Break deadlock involving exec_update_mutex sparc64/mm: Implement pXX_leaf_size() support powerpc/8xx: Implement pXX_leaf_size() support arm64/mm: Implement pXX_leaf_size() support perf/core: Fix arch_perf_get_page_size() mm: Introduce pXX_leaf_size() mm/gup: Provide gup_get_pte() more generic perf/x86/intel: Add event constraint for CYCLE_ACTIVITY.STALLS_MEM_ANY perf/x86/intel/uncore: Add Rocket Lake support perf/x86/msr: Add Rocket Lake CPU support perf/x86/cstate: Add Rocket Lake CPU support perf/x86/intel: Add Rocket Lake CPU support perf,mm: Handle non-page-table-aligned hugetlbfs ...
2020-12-14	Merge tag 'x86_cleanups_for_v5.11' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Borislav Petkov: "Another branch with a nicely negative diffstat, just the way I like 'em: - Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in the end (Gabriel Krisman Bertazi) - All kinds of minor cleanups all over the tree" * tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) x86/ia32_signal: Propagate __user annotation properly x86/alternative: Update text_poke_bp() kernel-doc comment x86/PCI: Make a kernel-doc comment a normal one x86/asm: Drop unused RDPID macro x86/boot/compressed/64: Use TEST %reg,%reg instead of CMP $0,%reg x86/head64: Remove duplicate include x86/mm: Declare 'start' variable where it is used x86/head/64: Remove unused GET_CR2_INTO() macro x86/boot: Remove unused finalize_identity_maps() x86/uaccess: Document copy_from_user_nmi() x86/dumpstack: Make show_trace_log_lvl() static x86/mtrr: Fix a kernel-doc markup x86/setup: Remove unused MCA variables x86, libnvdimm/test: Remove COPY_MC_TEST x86: Reclaim TIF_IA32 and TIF_X32 x86/mm: Convert mmu context ia32_compat into a proper flags field x86/elf: Use e_machine to check for x32/ia32 in setup_additional_pages() elf: Expose ELF header on arch_setup_additional_pages() x86/elf: Use e_machine to select start_thread for x32 elf: Expose ELF header in compat_start_thread() ...
2020-12-14	Merge tag 'x86_cpu_for_v5.11' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpuid updates from Borislav Petkov: "Only AMD-specific changes this time: - Save the AMD physical die ID into cpuinfo_x86.cpu_die_id and convert all code to use it (Yazen Ghannam) - Remove a dead and unused TSEG region remapping workaround on AMD (Arvind Sankar)" * tag 'x86_cpu_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu/amd: Remove dead code for TSEG region remapping x86/topology: Set cpu_die_id only if DIE_TYPE found EDAC/mce_amd: Use struct cpuinfo_x86.cpu_die_id for AMD NodeId x86/CPU/AMD: Remove amd_get_nb_id() x86/CPU/AMD: Save AMD NodeId as cpu_die_id
2020-12-09	perf/x86/intel: Add Tremont Topdown support	Kan Liang
	Tremont has four L1 Topdown events, TOPDOWN_FE_BOUND.ALL, TOPDOWN_BAD_SPECULATION.ALL, TOPDOWN_BE_BOUND.ALL and TOPDOWN_RETIRING.ALL. They are available on GP counters. Export them to sysfs and facilitate the perf stat tool. $perf stat --topdown -- sleep 1 Performance counter stats for 'sleep 1': retiring bad speculation frontend bound backend bound 24.9% 16.8% 31.7% 26.6% 1.001224610 seconds time elapsed 0.001150000 seconds user 0.000000000 seconds sys Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1607457952-3519-1-git-send-email-kan.liang@linux.intel.com
2020-12-09	perf/x86: Fix fall-through warnings for Clang	Gustavo A. R. Silva
	In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning by explicitly adding a fallthrough pseudo-keyword as a replacement for a /* fall through / comment, instead of letting the code fall through to the next case. Notice that Clang doesn't recognize / fall through */ comments as implicit fall-through markings. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://github.com/KSPP/linux/issues/115
2020-12-09	perf/x86/intel/lbr: Fix the return type of get_lbr_cycles()	Kan Liang
	The cycle count of a timed LBR is always 1 in perf record -D. The cycle count is stored in the first 16 bits of the IA32_LBR_x_INFO register, but the get_lbr_cycles() return Boolean type. Use u16 to replace the Boolean type. Fixes: 47125db27e47 ("perf/x86/intel/lbr: Support Architectural LBR") Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201125213720.15692-2-kan.liang@linux.intel.com
2020-12-09	perf/x86/intel: Fix rtm_abort_event encoding on Ice Lake	Kan Liang
	According to the event list from icelake_core_v1.09.json, the encoding of the RTM_RETIRED.ABORTED event on Ice Lake should be, "EventCode": "0xc9", "UMask": "0x04", "EventName": "RTM_RETIRED.ABORTED", Correct the wrong encoding. Fixes: 6017608936c1 ("perf/x86/intel: Add Icelake support") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201125213720.15692-1-kan.liang@linux.intel.com
2020-12-03	perf/x86/intel: Check PEBS status correctly	Stephane Eranian
	The kernel cannot disambiguate when 2+ PEBS counters overflow at the same time. This is what the comment for this code suggests. However, I see the comparison is done with the unfiltered p->status which is a copy of IA32_PERF_GLOBAL_STATUS at the time of the sample. This register contains more than the PEBS counter overflow bits. It also includes many other bits which could also be set. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201126110922.317681-2-namhyung@kernel.org
2020-12-03	perf/x86/intel: Fix a warning on x86_pmu_stop() with large PEBS	Namhyung Kim
	The commit 3966c3feca3f ("x86/perf/amd: Remove need to check "running" bit in NMI handler") introduced this. It seems x86_pmu_stop can be called recursively (like when it losts some samples) like below: x86_pmu_stop intel_pmu_disable_event (x86_pmu_disable) intel_pmu_pebs_disable intel_pmu_drain_pebs_nhm (x86_pmu_drain_pebs_buffer) x86_pmu_stop While commit 35d1ce6bec13 ("perf/x86/intel/ds: Fix x86_pmu_stop warning for large PEBS") fixed it for the normal cases, there's another path to call x86_pmu_stop() recursively when a PEBS error was detected (like two or more counters overflowed at the same time). Like in the Kan's previous fix, we can skip the interrupt accounting for large PEBS, so check the iregs which is set for PMI only. Fixes: 3966c3feca3f ("x86/perf/amd: Remove need to check "running" bit in NMI handler") Reported-by: John Sperbeck <jsperbeck@google.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201126110922.317681-1-namhyung@kernel.org
2020-11-26	Merge remote-tracking branch 'origin/master' into perf/core	Peter Zijlstra
	Further perf/core patches will depend on: d3f7b1bb2040 ("mm/gup: fix gup_fast with dynamic page table folding") which is already in Linus' tree.