summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2023-01-12x86/hyperv: Add support for detecting nested hypervisorJinank Jain
Detect if Linux is running as a nested hypervisor in the root partition for Microsoft Hypervisor, using flags provided by MSHV. Expose a new variable hv_nested that is used later for decisions specific to the nested use case. Signed-off-by: Jinank Jain <jinankjain@linux.microsoft.com> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Link: https://lore.kernel.org/r/8e3e7112806e81d2292a66a56fe547162754ecea.1672639707.git.jinankjain@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org>
2023-01-12x86/gsseg: Move load_gs_index() to its own new header fileH. Peter Anvin (Intel)
GS is a special segment on x86_64, move load_gs_index() to its own new header file to simplify header inclusion. No change in functionality. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230112072032.35626-5-xin3.li@intel.com
2023-01-12x86/gsseg: Make asm_load_gs_index() take an u16H. Peter Anvin (Intel)
Let GCC know that only the low 16 bits of load_gs_index() argument actually matter. It might allow it to create slightly better code. However, do not propagate this into the prototypes of functions that end up being paravirtualized, to avoid unnecessary changes. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230112072032.35626-4-xin3.li@intel.com
2023-01-12x86/opcode: Add the LKGS instruction to x86-opcode-mapH. Peter Anvin (Intel)
Add the instruction opcode used by LKGS to x86-opcode-map. Opcode number is per public FRED draft spec v3.0. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230112072032.35626-3-xin3.li@intel.com
2023-01-12x86/cpufeature: Add the CPU feature bit for LKGSH. Peter Anvin (Intel)
Add the CPU feature bit for LKGS (Load "Kernel" GS). LKGS instruction is introduced with Intel FRED (flexible return and event delivery) specification. Search for the latest FRED spec in most search engines with this search pattern: site:intel.com FRED (flexible return and event delivery) specification LKGS behaves like the MOV to GS instruction except that it loads the base address into the IA32_KERNEL_GS_BASE MSR instead of the GS segment’s descriptor cache, which is exactly what Linux kernel does to load a user level GS base. Thus, with LKGS, there is no need to SWAPGS away from the kernel GS base. [ mingo: Minor tweaks to the description. ] Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li <xin3.li@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230112072032.35626-2-xin3.li@intel.com
2023-01-12x86/bugs: Reset speculation control settings on initBreno Leitao
Currently, x86_spec_ctrl_base is read at boot time and speculative bits are set if Kconfig items are enabled. For example, IBRS is enabled if CONFIG_CPU_IBRS_ENTRY is configured, etc. These MSR bits are not cleared if the mitigations are disabled. This is a problem when kexec-ing a kernel that has the mitigation disabled from a kernel that has the mitigation enabled. In this case, the MSR bits are not cleared during the new kernel boot. As a result, this might have some performance degradation that is hard to pinpoint. This problem does not happen if the machine is (hard) rebooted because the bit will be cleared by default. [ bp: Massage. ] Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20221128153148.1129350-1-leitao@debian.org
2023-01-12x86/mm: fix poking_init() for Xen PV guestsJuergen Gross
Commit 3f4c8211d982 ("x86/mm: Use mm_alloc() in poking_init()") broke the kernel for running as Xen PV guest. It seems as if the new address space is never activated before being used, resulting in Xen rejecting to accept the new CR3 value (the PGD isn't pinned). Fix that by adding the now missing call of paravirt_arch_dup_mmap() to poking_init(). That call was previously done by dup_mm()->dup_mmap() and it is a NOP for all cases but for Xen PV, where it is just doing the pinning of the PGD. Fixes: 3f4c8211d982 ("x86/mm: Use mm_alloc() in poking_init()") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230109150922.10578-1-jgross@suse.com
2023-01-11KVM: x86/xen: Avoid deadlock by adding kvm->arch.xen.xen_lock leaf node lockDavid Woodhouse
In commit 14243b387137a ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery") the clever version of me left some helpful notes for those who would come after him: /* * For the irqfd workqueue, using the main kvm->lock mutex is * fine since this function is invoked from kvm_set_irq() with * no other lock held, no srcu. In future if it will be called * directly from a vCPU thread (e.g. on hypercall for an IPI) * then it may need to switch to using a leaf-node mutex for * serializing the shared_info mapping. */ mutex_lock(&kvm->lock); In commit 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") the other version of me ran straight past that comment without reading it, and introduced a potential deadlock by taking vcpu->mutex and kvm->lock in the wrong order. Solve this as originally suggested, by adding a leaf-node lock in the Xen state rather than using kvm->lock for it. Fixes: 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20230111180651.14394-4-dwmw2@infradead.org> [Rebase, add docs. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-11x86/pci: Simplify is_mmconf_reserved() messagesBjorn Helgaas
is_mmconf_reserved() takes a "with_e820" parameter that only determines the message logged if it finds the MMCONFIG region is reserved. Pass the message directly, which will simplify a future patch that adds a new way of looking for that reservation. No functional change intended. Link: https://lore.kernel.org/r/20230110180243.1590045-2-helgaas@kernel.org Tested-by: Tony Luck <tony.luck@intel.com> Tested-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Tested-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
2023-01-11KVM: x86/xen: Fix potential deadlock in kvm_xen_update_runstate_guest()David Woodhouse
The kvm_xen_update_runstate_guest() function can be called when the vCPU is being scheduled out, from a preempt notifier. It *opportunistically* updates the runstate area in the guest memory, if the gfn_to_pfn_cache which caches the appropriate address is still valid. If there is *contention* when it attempts to obtain gpc->lock, then locking inside the priority inheritance checks may cause a deadlock. Lockdep reports: [13890.148997] Chain exists of: &gpc->lock --> &p->pi_lock --> &rq->__lock [13890.149002] Possible unsafe locking scenario: [13890.149003] CPU0 CPU1 [13890.149004] ---- ---- [13890.149005] lock(&rq->__lock); [13890.149007] lock(&p->pi_lock); [13890.149009] lock(&rq->__lock); [13890.149011] lock(&gpc->lock); [13890.149013] *** DEADLOCK *** In the general case, if there's contention for a read lock on gpc->lock, that's going to be because something else is either invalidating or revalidating the cache. Either way, we've raced with seeing it in an invalid state, in which case we would have aborted the opportunistic update anyway. So in the 'atomic' case when called from the preempt notifier, just switch to using read_trylock() and avoid the PI handling altogether. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20230111180651.14394-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-11KVM: x86/xen: Fix lockdep warning on "recursive" gpc lockingDavid Woodhouse
In commit 5ec3289b31 ("KVM: x86/xen: Compatibility fixes for shared runstate area") we declared it safe to obtain two gfn_to_pfn_cache locks at the same time: /* * The guest's runstate_info is split across two pages and we * need to hold and validate both GPCs simultaneously. We can * declare a lock ordering GPC1 > GPC2 because nothing else * takes them more than one at a time. */ However, we forgot to tell lockdep. Do so, by setting a subclass on the first lock before taking the second. Fixes: 5ec3289b31 ("KVM: x86/xen: Compatibility fixes for shared runstate area") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20230111180651.14394-1-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-11x86/setup: Move duplicate boot_cpu_data definition out of the ifdefferyYuntao Wang
Both the if and else blocks define an exact same boot_cpu_data variable, move the duplicate variable definition out of the if/else block. In addition, do some other minor cleanups. [ bp: Massage. ] Signed-off-by: Yuntao Wang <ytcoode@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20220601122914.820890-1-ytcoode@gmail.com
2023-01-11x86/boot/e820: Fix typo in e820.c commentWang Yong
change "itsmain" to "its main". Fixes: 544a0f47e780 ("x86/boot/e820: Rename e820_table_saved to e820_table_firmware and improve the description") Signed-off-by: Wang Yong <yongw.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20221211103849.173870-1-yongw.kernel@gmail.com
2023-01-10bpf, x86: Simplify the parsing logic of structure parametersPu Lehui
Extra_nregs of structure parameters and nr_args can be added directly at the beginning, and using a flip flag to identifiy structure parameters. Meantime, renaming some variables to make them more sense. Signed-off-by: Pu Lehui <pulehui@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/r/20230105035026.3091988-1-pulehui@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-01-10x86/resctrl: Fix event counts regression in reused RMIDsPeter Newman
When creating a new monitoring group, the RMID allocated for it may have been used by a group which was previously removed. In this case, the hardware counters will have non-zero values which should be deducted from what is reported in the new group's counts. resctrl_arch_reset_rmid() initializes the prev_msr value for counters to 0, causing the initial count to be charged to the new group. Resurrect __rmid_read() and use it to initialize prev_msr correctly. Unlike before, __rmid_read() checks for error bits in the MSR read so that callers don't need to. Fixes: 1d81d15db39c ("x86/resctrl: Move mbm_overflow_count() into resctrl_arch_rmid_read()") Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20221220164132.443083-1-peternewman@google.com
2023-01-10x86/resctrl: Fix task CLOSID/RMID update racePeter Newman
When the user moves a running task to a new rdtgroup using the task's file interface or by deleting its rdtgroup, the resulting change in CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the task(s) CPUs. x86 allows reordering loads with prior stores, so if the task starts running between a task_curr() check that the CPU hoisted before the stores in the CLOSID/RMID update then it can start running with the old CLOSID/RMID until it is switched again because __rdtgroup_move_task() failed to determine that it needs to be interrupted to obtain the new CLOSID/RMID. Refer to the diagram below: CPU 0 CPU 1 ----- ----- __rdtgroup_move_task(): curr <- t1->cpu->rq->curr __schedule(): rq->curr <- t1 resctrl_sched_in(): t1->{closid,rmid} -> {1,1} t1->{closid,rmid} <- {2,2} if (curr == t1) // false IPI(t1->cpu) A similar race impacts rdt_move_group_tasks(), which updates tasks in a deleted rdtgroup. In both cases, use smp_mb() to order the task_struct::{closid,rmid} stores before the loads in task_curr(). In particular, in the rdt_move_group_tasks() case, simply execute an smp_mb() on every iteration with a matching task. It is possible to use a single smp_mb() in rdt_move_group_tasks(), but this would require two passes and a means of remembering which task_structs were updated in the first loop. However, benchmarking results below showed too little performance impact in the simple approach to justify implementing the two-pass approach. Times below were collected using `perf stat` to measure the time to remove a group containing a 1600-task, parallel workload. CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads) # mkdir /sys/fs/resctrl/test # echo $$ > /sys/fs/resctrl/test/tasks # perf bench sched messaging -g 40 -l 100000 task-clock time ranges collected using: # perf stat rmdir /sys/fs/resctrl/test Baseline: 1.54 - 1.60 ms smp_mb() every matching task: 1.57 - 1.67 ms [ bp: Massage commit message. ] Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR") Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount") Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com
2023-01-10x86/acpi/boot: Do not register processors that cannot be onlined for x2APICKishon Vijay Abraham I
Section 5.2.12.12 Processor Local x2APIC Structure in the ACPI v6.5 spec mandates that both "enabled" and "online capable" Local APIC Flags should be used to determine if the processor is usable or not. However, Linux doesn't use the "online capable" flag for x2APIC to determine if the processor is usable. As a result, cpu_possible_mask has incorrect value and results in more memory getting allocated for per_cpu variables than it is going to be used. Make sure Linux parses both "enabled" and "online capable" flags for x2APIC to correctly determine if the processor is usable. Fixes: aa06e20f1be6 ("x86/ACPI: Don't add CPUs that are not online capable") Reported-by: Leo Duran <leo.duran@amd.com> Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Zhang Rui <rui.zhang@intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20230105041059.39366-1-kvijayab@amd.com
2023-01-10x86/pat: Fix pat_x_mtrr_type() for MTRR disabled caseJuergen Gross
Since 72cbc8f04fe2 ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen") PAT can be enabled without MTRR. This has resulted in problems e.g. for a SEV-SNP guest running under Hyper-V, when trying to establish a new mapping via memremap() with WB caching mode, as pat_x_mtrr_type() will call mtrr_type_lookup(), which in turn is returning MTRR_TYPE_INVALID due to MTRR being disabled in this configuration. The result is a mapping with UC- caching, leading to severe performance degradation. Fix that by handling MTRR_TYPE_INVALID the same way as MTRR_TYPE_WRBACK in pat_x_mtrr_type() because MTRR_TYPE_INVALID means MTRRs are disabled. [ bp: Massage commit message. ] Fixes: 72cbc8f04fe2 ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen") Reported-by: Michael Kelley (LINUX) <mikelley@microsoft.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230110065427.20767-1-jgross@suse.com
2023-01-10x86/boot: Avoid using Intel mnemonics in AT&T syntax asmPeter Zijlstra
With 'GNU assembler (GNU Binutils for Debian) 2.39.90.20221231' the build now reports: arch/x86/realmode/rm/../../boot/bioscall.S: Assembler messages: arch/x86/realmode/rm/../../boot/bioscall.S:35: Warning: found `movsd'; assuming `movsl' was meant arch/x86/realmode/rm/../../boot/bioscall.S:70: Warning: found `movsd'; assuming `movsl' was meant arch/x86/boot/bioscall.S: Assembler messages: arch/x86/boot/bioscall.S:35: Warning: found `movsd'; assuming `movsl' was meant arch/x86/boot/bioscall.S:70: Warning: found `movsd'; assuming `movsl' was meant Which is due to: PR gas/29525 Note that with the dropped CMPSD and MOVSD Intel Syntax string insn templates taking operands, mixed IsString/non-IsString template groups (with memory operands) cannot occur anymore. With that maybe_adjust_templates() becomes unnecessary (and is hence being removed). More details: https://sourceware.org/bugzilla/show_bug.cgi?id=29525 Borislav Petkov further explains: " the particular problem here is is that the 'd' suffix is "conflicting" in the sense that you can have SSE mnemonics like movsD %xmm... and the same thing also for string ops (which is the case here) so apparently the agreement in binutils land is to use the always accepted suffixes 'l' or 'q' and phase out 'd' slowly... " Fixes: 7a734e7dd93b ("x86, setup: "glove box" BIOS calls -- infrastructure") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/Y71I3Ex2pvIxMpsP@hirez.programming.kicks-ass.net
2023-01-10x86/cpu: Remove redundant extern x86_read_arch_cap_msr()Ashok Raj
The prototype for the x86_read_arch_cap_msr() function has moved to arch/x86/include/asm/cpu.h - kill the redundant definition in arch/x86/kernel/cpu.h and include the header. Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Link: https://lore.kernel.org/r/20221128172451.792595-1-ashok.raj@intel.com
2023-01-10x86/kprobes: Use switch-case for 0xFF opcodes in prepare_emulationChuang Wang
For the `FF /digit` opcodes in prepare_emulation, use switch-case instead of hand-written code to make the logic easier to understand. Signed-off-by: Chuang Wang <nashuiliang@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20221129084022.718355-1-nashuiliang@gmail.com
2023-01-10x86/mce: Mask out non-address bits from machine check bankTony Luck
Systems that support various memory encryption schemes (MKTME, TDX, SEV) use high order physical address bits to indicate which key should be used for a specific memory location. When a memory error is reported, some systems may report those key bits in the IA32_MCi_ADDR machine check MSR. The Intel SDM has a footnote for the contents of the address register that says: "Useful bits in this field depend on the address methodology in use when the register state is saved." AMD Processor Programming Reference has a more explicit description of the MCA_ADDR register: "For physical addresses, the most significant bit is given by Core::X86::Cpuid::LongModeInfo[PhysAddrSize]." Add a new #define MCI_ADDR_PHYSADDR for the mask of valid physical address bits within the machine check bank address register. Use this mask for recoverable machine check handling and in the EDAC driver to ignore any key bits that may be present. [ Tony: Based on independent fixes proposed by Fan Du and Isaku Yamahata ] Reported-by: Isaku Yamahata <isaku.yamahata@intel.com> Reported-by: Fan Du <fan.du@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20230109152936.397862-1-tony.luck@intel.com
2023-01-09x86/boot: Robustify calling startup_{32,64}() from the decompressor codeAlexander Lobakin
After commit ce697ccee1a8 ("kbuild: remove head-y syntax"), I started digging whether x86 is ready for removing this old cruft. Removing its objects from the list makes the kernel unbootable. This applies only to bzImage, vmlinux still works correctly. The reason is that with no strict object order determined by the linker arguments, not the linker script, startup_64 can be placed not right at the beginning of the kernel. Here's vmlinux.map's beginning before removing: ffffffff81000000 vmlinux.o:(.head.text) ffffffff81000000 startup_64 ffffffff81000070 secondary_startup_64 ffffffff81000075 secondary_startup_64_no_verify ffffffff81000160 verify_cpu and after: ffffffff81000000 vmlinux.o:(.head.text) ffffffff81000000 pvh_start_xen ffffffff81000080 startup_64 ffffffff810000f0 secondary_startup_64 ffffffff810000f5 secondary_startup_64_no_verify Not a problem itself, but the self-extractor code has the address of that function hardcoded the beginning, not looking onto the ELF header, which always contains the address of startup_{32,64}(). So, instead of doing an "act of blind faith", just take the address from the ELF header and extract a relative offset to the entry point. The decompressor function already returns a pointer to the beginning of the kernel to the Asm code, which then jumps to it, so add that offset to the return value. This doesn't change anything for now, but allows to resign from the "head object list" for x86 and makes sure valid Kbuild or any other improvements won't break anything here in general. Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Jiri Slaby <jirislaby@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20230109170403.4117105-2-alexandr.lobakin@intel.com
2023-01-09perf/x86/msr: Add Meteor Lake supportKan Liang
Meteor Lake is Intel's successor to Raptor lake. PPERF and SMI_COUNT MSRs are also supported. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230104201349.1451191-7-kan.liang@linux.intel.com
2023-01-09perf/x86/cstate: Add Meteor Lake supportKan Liang
Meteor Lake is Intel's successor to Raptor lake. From the perspective of Intel cstate residency counters, there is nothing changed compared with Raptor lake. Share adl_cstates with Raptor lake. Update the comments for Meteor Lake. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230104201349.1451191-6-kan.liang@linux.intel.com
2023-01-09perf/x86/intel: Support Architectural PerfMon Extension leafKan Liang
The new CPUID leaf 0x23 reports the "true view" of PMU resources. The sub-leaf 1 reports the available general-purpose counters and fixed counters. Update the number of counters and fixed counters when the sub-leaf is detected. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230104201349.1451191-5-kan.liang@linux.intel.com
2023-01-09x86/cpufeatures: Add Architectural PerfMon Extension bitKan Liang
CPUID.(EAX=07H, ECX=1):EAX[8] indicates whether the Architectural PerfMon Extension leaf (CPUID leaf 23) is supported. The "X86_FEATURE_..., word 12" is already mirrored from CPUID "0x00000007:1 (EAX)". Add X86_FEATURE_ARCH_PERFMON_EXT under the "word 12" section. The new Architectural PerfMon Extension leaf (CPUID leaf 23) will be supported in the perf_events subsystem later. The feature will not appear in /proc/cpuinfo. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230104201349.1451191-4-kan.liang@linux.intel.com
2023-01-09perf/x86: Support Retire LatencyKan Liang
Retire Latency reports the number of elapsed core clocks between the retirement of the instruction indicated by the Instruction Pointer field of the PEBS record and the retirement of the prior instruction. It's enumerated by the IA32_PERF_CAPABILITIES.PEBS_TIMING_INFO[17]. Add flag PMU_FL_RETIRE_LATENCY to indicate the availability of the feature. The Retire Latency is not supported by the fixed counter 0 on p-core of MTL. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230104201349.1451191-3-kan.liang@linux.intel.com
2023-01-09perf/x86: Add Meteor Lake supportKan Liang
From PMU's perspective, Meteor Lake is similar to Alder Lake. Both are hybrid platforms, with e-core and p-core. The key differences include: - The e-core supports 2 PDIST GP counters (GP0 & GP1) - New MSRs for the Module Snoop Response Events on the e-core. - New Data Source fields are introduced for the e-core. - There are 8 GP counters for the e-core. - The load latency AUX event is not required for the p-core anymore. - Retire Latency (Support in a separate patch) for both cores. Since most of the code in the intel_pmu_init() should be the same as Alder Lake, to avoid code duplication, share the path with Alder Lake. Add new specific functions of extra_regs, and get_event_constraints to support the OCR events, Module Snoop Response Events and 2 PDIST GP counters on e-core. Add new MTL specific mem_attrs which drops the load latency AUX event. The Data Source field is extended to 4:0, which can contains max 32 sources. The Retire Latency is implemented with a separate patch. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20230104201349.1451191-2-kan.liang@linux.intel.com
2023-01-09perf/x86/intel/uncore: Add Emerald RapidsKan Liang
From the perspective of the uncore PMU, the new Emerald Rapids is the same as the Sapphire Rapids. The only difference is the event list, which will be supported in the perf tool later. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230106160449.3566477-4-kan.liang@linux.intel.com
2023-01-09perf/x86/msr: Add Emerald RapidsKan Liang
The same as Sapphire Rapids, the SMI_COUNT MSR is also supported on Emerald Rapids. Add Emerald Rapids model. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230106160449.3566477-3-kan.liang@linux.intel.com
2023-01-09perf/x86/msr: Add Meteor Lake supportKan Liang
Meteor Lake is Intel's successor to Raptor lake. PPERF and SMI_COUNT MSRs are also supported. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/20230104201349.1451191-7-kan.liang@linux.intel.com
2023-01-09perf/x86/cstate: Add Meteor Lake supportKan Liang
Meteor Lake is Intel's successor to Raptor lake. From the perspective of Intel cstate residency counters, there is nothing changed compared with Raptor lake. Share adl_cstates with Raptor lake. Update the comments for Meteor Lake. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/20230104201349.1451191-6-kan.liang@linux.intel.com
2023-01-09KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUIDPaolo Bonzini
Passing the host topology to the guest is almost certainly wrong and will confuse the scheduler. In addition, several fields of these CPUID leaves vary on each processor; it is simply impossible to return the right values from KVM_GET_SUPPORTED_CPUID in such a way that they can be passed to KVM_SET_CPUID2. The values that will most likely prevent confusion are all zeroes. Userspace will have to override it anyway if it wishes to present a specific topology to the guest. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-09KVM: nSVM: clarify recalc_intercepts() wrt CR8Paolo Bonzini
The mysterious comment "We only want the cr8 intercept bits of L1" dates back to basically the introduction of nested SVM, back when the handling of "less typical" hypervisors was very haphazard. With the development of kvm-unit-tests for interrupt handling, the same code grew another vmcb_clr_intercept for the interrupt window (VINTR) vmexit, this time with a comment that is at least decent. It turns out however that the same comment applies to the CR8 write intercept, which is also a "recheck if an interrupt should be injected" intercept. The CR8 read intercept instead has not been used by KVM for 14 years (commit 649d68643ebf, "KVM: SVM: sync TPR value to V_TPR field in the VMCB"), so do not bother clearing it and let one comment describe both CR8 write and VINTR handling. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-09x86/xen: Remove the unused function p2m_index()Jiapeng Chong
The function p2m_index is defined in the p2m.c file, but not called elsewhere, so remove this unused function. arch/x86/xen/p2m.c:137:24: warning: unused function 'p2m_index'. Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3557 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230105090141.36248-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-01-07x86/kprobes: Drop removed INT3 handling codeMasami Hiramatsu (Google)
Drop removed INT3 handling code from kprobe_int3_handler() because this case (get_kprobe() doesn't return corresponding kprobe AND the INT3 is removed) must not happen with the kprobe managed INT3, but can happen with the non-kprobe INT3, which should be handled by other callbacks. For the kprobe managed INT3, it is already safe. The commit 5c02ece81848d ("x86/kprobes: Fix ordering while text-patching") introduced text_poke_sync() to the arch_disarm_kprobe() right after removing INT3. Since this text_poke_sync() uses IPI to call sync_core() on all online cpus, that ensures that all running INT3 exception handlers have done. And, the unregister_kprobe() will remove the kprobe from the hash table after arch_disarm_kprobe(). Thus, when the kprobe managed INT3 hits, kprobe_int3_handler() should be able to find corresponding kprobe always by get_kprobe(). If it can not find any kprobe, this means that is NOT a kprobe managed INT3. Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/166981518895.1131462.4693062055762912734.stgit@devnote3
2023-01-07x86/mce/dev-mcelog: use strscpy() to instead of strncpy()Xu Panda
The implementation of strscpy() is more robust and safer. That's now the recommended way to copy NUL terminated strings. Signed-off-by: Xu Panda <xu.panda@zte.com.cn> Signed-off-by: Yang Yang <yang.yang29@zte.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/202212031419324523731@zte.com.cn
2023-01-06Merge tag 'perf-urgent-2023-01-06' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Ingo Molnar: "Intel RAPL updates for new model IDs" * tag 'perf-urgent-2023-01-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/rapl: Add support for Intel Emerald Rapids perf/x86/rapl: Add support for Intel Meteor Lake perf/x86/rapl: Treat Tigerlake like Icelake
2023-01-06crypto: x86/aria - implement aria-avx512Taehee Yoo
aria-avx512 implementation uses AVX512 and GFNI. It supports 64way parallel processing. So, byteslicing code is changed to support 64way parallel. And it exports some aria-avx2 functions such as encrypt() and decrypt(). AVX and AVX2 have 16 registers. They should use memory to store/load state because of lack of registers. But AVX512 supports 32 registers. So, it doesn't require store/load in the s-box layer. It means that it can reduce overhead of store/load in the s-box layer. Also code become much simpler. Benchmark with modprobe tcrypt mode=610 num_mb=8192, i3-12100: ARIA-AVX512(128bit and 256bit) testing speed of multibuffer ecb(aria) (ecb-aria-avx512) encryption tcrypt: 1 operation in 1504 cycles (1024 bytes) tcrypt: 1 operation in 4595 cycles (4096 bytes) tcrypt: 1 operation in 1763 cycles (1024 bytes) tcrypt: 1 operation in 5540 cycles (4096 bytes) testing speed of multibuffer ecb(aria) (ecb-aria-avx512) decryption tcrypt: 1 operation in 1502 cycles (1024 bytes) tcrypt: 1 operation in 4615 cycles (4096 bytes) tcrypt: 1 operation in 1759 cycles (1024 bytes) tcrypt: 1 operation in 5554 cycles (4096 bytes) ARIA-AVX2 with GFNI(128bit and 256bit) testing speed of multibuffer ecb(aria) (ecb-aria-avx2) encryption tcrypt: 1 operation in 2003 cycles (1024 bytes) tcrypt: 1 operation in 5867 cycles (4096 bytes) tcrypt: 1 operation in 2358 cycles (1024 bytes) tcrypt: 1 operation in 7295 cycles (4096 bytes) testing speed of multibuffer ecb(aria) (ecb-aria-avx2) decryption tcrypt: 1 operation in 2004 cycles (1024 bytes) tcrypt: 1 operation in 5956 cycles (4096 bytes) tcrypt: 1 operation in 2409 cycles (1024 bytes) tcrypt: 1 operation in 7564 cycles (4096 bytes) Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-01-06crypto: x86/aria - implement aria-avx2Taehee Yoo
aria-avx2 implementation uses AVX2, AES-NI, and GFNI. It supports 32way parallel processing. So, byteslicing code is changed to support 32way parallel. And it exports some aria-avx functions such as encrypt() and decrypt(). There are two main logics, s-box layer and diffusion layer. These codes are the same as aria-avx implementation. But some instruction are exchanged because they don't support 256bit registers. Also, AES-NI doesn't support 256bit register. So, aesenclast and aesdeclast are used twice like below: vextracti128 $1, ymm0, xmm6; vaesenclast xmm7, xmm0, xmm0; vaesenclast xmm7, xmm6, xmm6; vinserti128 $1, xmm6, ymm0, ymm0; Benchmark with modprobe tcrypt mode=610 num_mb=8192, i3-12100: ARIA-AVX2 with GFNI(128bit and 256bit) testing speed of multibuffer ecb(aria) (ecb-aria-avx2) encryption tcrypt: 1 operation in 2003 cycles (1024 bytes) tcrypt: 1 operation in 5867 cycles (4096 bytes) tcrypt: 1 operation in 2358 cycles (1024 bytes) tcrypt: 1 operation in 7295 cycles (4096 bytes) testing speed of multibuffer ecb(aria) (ecb-aria-avx2) decryption tcrypt: 1 operation in 2004 cycles (1024 bytes) tcrypt: 1 operation in 5956 cycles (4096 bytes) tcrypt: 1 operation in 2409 cycles (1024 bytes) tcrypt: 1 operation in 7564 cycles (4096 bytes) ARIA-AVX with GFNI(128bit and 256bit) testing speed of multibuffer ecb(aria) (ecb-aria-avx) encryption tcrypt: 1 operation in 2761 cycles (1024 bytes) tcrypt: 1 operation in 9390 cycles (4096 bytes) tcrypt: 1 operation in 3401 cycles (1024 bytes) tcrypt: 1 operation in 11876 cycles (4096 bytes) testing speed of multibuffer ecb(aria) (ecb-aria-avx) decryption tcrypt: 1 operation in 2735 cycles (1024 bytes) tcrypt: 1 operation in 9424 cycles (4096 bytes) tcrypt: 1 operation in 3369 cycles (1024 bytes) tcrypt: 1 operation in 11954 cycles (4096 bytes) Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-01-06crypto: x86/aria - do not use magic number offsets of aria_ctxTaehee Yoo
aria-avx assembly code accesses members of aria_ctx with magic number offset. If the shape of struct aria_ctx is changed carelessly, aria-avx will not work. So, we need to ensure accessing members of aria_ctx with correct offset values, not with magic numbers. It adds ARIA_CTX_enc_key, ARIA_CTX_dec_key, and ARIA_CTX_rounds in the asm-offsets.c So, correct offset definitions will be generated. aria-avx assembly code can access members of aria_ctx safely with these definitions. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-01-06crypto: x86/aria - add keystream array into request ctxTaehee Yoo
avx accelerated aria module used local keystream array. But, keystream array size is too big. So, it puts the keystream array into request ctx. Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2023-01-06x86/rtc: Simplify PNP ids checkHans de Goede
compare_pnp_id() already iterates over the single linked pnp_ids list starting with the id past to it. So there is no need for add_rtc_cmos() to call compare_pnp_id() for each id on the list. No change in functionality intended. Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: linux-kernel@vger.kernel.org
2023-01-06x86/signal/compat: Move sigaction_compat_abi() to signal_64.cBrian Gerst
Also remove the now-empty signal_compat.c. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20221219193904.190220-3-brgerst@gmail.com Cc: Al Viro <viro@zeniv.linux.org.uk>
2023-01-06x86/signal: Move siginfo field testsBrian Gerst
Move the tests to the appropriate signal_$(BITS).c file. Convert them to use static_assert(), removing the need for a dummy function. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20221219193904.190220-2-brgerst@gmail.com Cc: Al Viro <viro@zeniv.linux.org.uk>
2023-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-05elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size}Catalin Marinas
A subsequent fix for arm64 will use this parameter to parse the vma information from the snapshot created by dump_vma_snapshot() rather than traversing the vma list without the mmap_lock. Fixes: 6dd8b1a0b6cb ("arm64: mte: Dump the MTE tags in the core file") Cc: <stable@vger.kernel.org> # 5.18.x Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Seth Jenkins <sethjenkins@google.com> Suggested-by: Seth Jenkins <sethjenkins@google.com> Cc: Will Deacon <will@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221222181251.1345752-3-catalin.marinas@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2023-01-05x86/alternatives: Add alt_instr.flagsBorislav Petkov (AMD)
Add a struct alt_instr.flags field which will contain different flags controlling alternatives patching behavior. The initial idea was to be able to specify it as a separate macro parameter but that would mean touching all possible invocations of the alternatives macros and thus a lot of churn. What is more, as PeterZ suggested, being able to say ALT_NOT(feature) is very readable and explains exactly what is meant. So make the feature field a u32 where the patching flags are the upper u16 part of the dword quantity while the lower u16 word is the feature. The highest feature number currently is 0x26a (i.e., word 19) so there is plenty of space. If that becomes insufficient, the field can be extended to u64 which will then make struct alt_instr of the nice size of 16 bytes (14 bytes currently). There should be no functional changes resulting from this. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/Y6RCoJEtxxZWwotd@zn.tnic
2023-01-04Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== bpf-next 2023-01-04 We've added 45 non-merge commits during the last 21 day(s) which contain a total of 50 files changed, 1454 insertions(+), 375 deletions(-). The main changes are: 1) Fixes, improvements and refactoring of parts of BPF verifier's state equivalence checks, from Andrii Nakryiko. 2) Fix a few corner cases in libbpf's BTF-to-C converter in particular around padding handling and enums, also from Andrii Nakryiko. 3) Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata, from Christian Ehrig. 4) Improve x86 JIT's codegen for PROBE_MEM runtime error checks, from Dave Marchevsky. 5) Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers, from Jiri Olsa. 6) Add proper documentation for BPF_MAP_TYPE_SOCK{MAP,HASH} maps, from Maryam Tahhan. 7) Improvements in libbpf's btf_parse_elf error handling, from Changbin Du. 8) Bigger batch of improvements to BPF tracing code samples, from Daniel T. Lee. 9) Add LoongArch support to libbpf's bpf_tracing helper header, from Hengqi Chen. 10) Fix a libbpf compiler warning in perf_event_open_probe on arm32, from Khem Raj. 11) Optimize bpf_local_storage_elem by removing 56 bytes of padding, from Martin KaFai Lau. 12) Use pkg-config to locate libelf for resolve_btfids build, from Shen Jiamin. 13) Various libbpf improvements around API documentation and errno handling, from Xin Liu. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (45 commits) libbpf: Return -ENODATA for missing btf section libbpf: Add LoongArch support to bpf_tracing.h libbpf: Restore errno after pr_warn. libbpf: Added the description of some API functions libbpf: Fix invalid return address register in s390 samples/bpf: Use BPF_KSYSCALL macro in syscall tracing programs samples/bpf: Fix tracex2 by using BPF_KSYSCALL macro samples/bpf: Change _kern suffix to .bpf with syscall tracing program samples/bpf: Use vmlinux.h instead of implicit headers in syscall tracing program samples/bpf: Use kyscall instead of kprobe in syscall tracing program bpf: rename list_head -> graph_root in field info types libbpf: fix errno is overwritten after being closed. bpf: fix regs_exact() logic in regsafe() to remap IDs correctly bpf: perform byte-by-byte comparison only when necessary in regsafe() bpf: reject non-exact register type matches in regsafe() bpf: generalize MAYBE_NULL vs non-MAYBE_NULL rule bpf: reorganize struct bpf_reg_state fields bpf: teach refsafe() to take into account ID remapping bpf: Remove unused field initialization in bpf's ctl_table selftests/bpf: Add jit probe_mem corner case tests to s390x denylist ... ==================== Link: https://lore.kernel.org/r/20230105000926.31350-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>