summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2023-12-05x86/pci: Rename 'MMCONFIG' to 'ECAM', use pr_fmtBjorn Helgaas
The "MMCONFIG" term is not used in PCI/PCIe specs. Replace it with "ECAM", the term used in PCIe r6.0, sec 7.2.2. Define pr_fmt() instead of repeating PREFIX in every log message. Link: https://lore.kernel.org/r/20231121183643.249006-5-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-05x86/pci: Add MCFG debug loggingBjorn Helgaas
MCFG handling is a frequent source of problems. Add more logging to aid in debugging. Enable the logging with CONFIG_DYNAMIC_DEBUG=y and the kernel boot parameter 'dyndbg="file arch/x86/pci +p"'. Link: https://lore.kernel.org/r/20231121183643.249006-4-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-05x86/pci: Reword ECAM EfiMemoryMappedIO logging to avoid 'reserved'Bjorn Helgaas
fd3a8cff4d4a ("x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space") added the concept of using the EFI memory map to help decide whether ECAM space mentioned in the MCFG table is valid. Unfortunately it described that EfiMemoryMappedIO space as "reserved", but it is actually not *reserved* by the EFI memory map. EfiMemoryMappedIO only means the firmware requested that the OS map this space for use by firmware runtime services. Change the dmesg logging to describe it as simply "EfiMemoryMappedIO", not as "reserved as EfiMemoryMappedIO". A previous commit actually *does* reserve the space if ACPI PNP0C01/02 devices haven't done so: - PCI: ECAM at [mem 0xe0000000-0xefffffff] reserved as EfiMemoryMappedIO + PCI: ECAM at [mem 0xe0000000-0xefffffff] is EfiMemoryMappedIO; assuming valid PCI: ECAM [mem 0xe0000000-0xefffffff] reserved to work around lack of ACPI motherboard _CRS Link: https://lore.kernel.org/r/20231121183643.249006-3-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-05x86/pci: Reserve ECAM if BIOS didn't include it in PNP0C02 _CRSBjorn Helgaas
Tomasz, Sebastian, and some Proxmox users reported problems initializing ixgbe NICs. I think the problem is that ECAM space described in the ACPI MCFG table is not reserved via a PNP0C02 _CRS method as required by the PCI Firmware spec (r3.3, sec 4.1.2), but it *is* included in the PNP0A03 host bridge _CRS as part of the MMIO aperture. If we allocate space for a PCI BAR, we're likely to allocate it from that ECAM space, which obviously cannot work. This could happen for any device, but in the ixgbe case it happens because it's an SR-IOV device and the BIOS didn't allocate space for the VF BARs, so Linux reallocated the bridge window leading to ixgbe and put it on top of the ECAM space. From Tomasz' system: PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000) PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] not reserved in ACPI motherboard resources pci_bus 0000:00: root bus resource [mem 0x80000000-0xfbffffff window] pci 0000:00:01.1: PCI bridge to [bus 02-03] pci 0000:00:01.1: bridge window [mem 0xfb900000-0xfbbfffff] pci 0000:02:00.0: [8086:10fb] type 00 class 0x020000 # ixgbe pci 0000:02:00.0: reg 0x10: [mem 0xfba80000-0xfbafffff 64bit] pci 0000:02:00.0: VF(n) BAR0 space: [mem 0x00000000-0x000fffff 64bit] (contains BAR0 for 64 VFs) pci 0000:02:00.0: BAR 7: no space for [mem size 0x00100000 64bit] # VF BAR 0 pci_bus 0000:00: No. 2 try to assign unassigned res pci 0000:00:01.1: resource 14 [mem 0xfb900000-0xfbbfffff] released pci 0000:00:01.1: BAR 14: assigned [mem 0x80000000-0x806fffff] pci 0000:02:00.0: BAR 0: assigned [mem 0x80000000-0x8007ffff 64bit] pci 0000:02:00.0: BAR 7: assigned [mem 0x80204000-0x80303fff 64bit] # VF BAR 0 Fixes: 07eab0901ede ("efi/x86: Remove EfiMemoryMappedIO from E820 map") Fixes: fd3a8cff4d4a ("x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space") Reported-by: Tomasz Pala <gotar@polanet.pl> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218050 Reported-by: Sebastian Manciulea <manciuleas@protonmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218107 Link: https://forum.proxmox.com/threads/proxmox-8-kernel-6-2-16-4-pve-ixgbe-driver-fails-to-load-due-to-pci-device-probing-failure.131203/ Link: https://lore.kernel.org/r/20231121183643.249006-2-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@vger.kernel.org # v6.2+
2023-12-05mm/slab: remove CONFIG_SLAB from all Kconfig and MakefileVlastimil Babka
Remove CONFIG_SLAB, CONFIG_DEBUG_SLAB, CONFIG_SLAB_DEPRECATED and everything in Kconfig files and mm/Makefile that depends on those. Since SLUB is the only remaining allocator, remove the allocator choice, make CONFIG_SLUB a "def_bool y" for now and remove all explicit dependencies on SLUB or SLAB as it's now always enabled. Make every option's verbose name and description refer to "the slab allocator" without refering to the specific implementation. Do not rename the CONFIG_ option names yet. Everything under #ifdef CONFIG_SLAB, and mm/slab.c is now dead code, all code under #ifdef CONFIG_SLUB is now always compiled. Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2023-12-03x86/microcode/intel: Set new revision only after a successful updateBorislav Petkov (AMD)
This was meant to be done only when early microcode got updated successfully. Move it into the if-branch. Also, make sure the current revision is read unconditionally and only once. Fixes: 080990aa3344 ("x86/microcode: Rework early revisions reporting") Reported-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Ashok Raj <ashok.raj@intel.com> Link: https://lore.kernel.org/r/ZWjVt5dNRjbcvlzR@a4bf019067fa.jf.intel.com
2023-12-03Merge tag 'for-linus-6.7a-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - A fix for the Xen event driver setting the correct return value when experiencing an allocation failure - A fix for allocating space for a struct in the percpu area to not cross page boundaries (this one is for x86, a similar one for Arm was already in the pull request for rc3) * tag 'for-linus-6.7a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/events: fix error code in xen_bind_pirq_msi_to_irq() x86/xen: fix percpu vcpu_info allocation
2023-12-02x86/CPU/AMD: Check vendor in the AMD microcode callbackBorislav Petkov (AMD)
Commit in Fixes added an AMD-specific microcode callback. However, it didn't check the CPU vendor the kernel runs on explicitly. The only reason the Zenbleed check in it didn't run on other x86 vendors hardware was pure coincidental luck: if (!cpu_has_amd_erratum(c, amd_zenbleed)) return; gives true on other vendors because they don't have those families and models. However, with the removal of the cpu_has_amd_erratum() in 05f5f73936fa ("x86/CPU/AMD: Drop now unused CPU erratum checking function") that coincidental condition is gone, leading to the zenbleed check getting executed on other vendors too. Add the explicit vendor check for the whole callback as it should've been done in the first place. Fixes: 522b1d69219d ("x86/cpu/amd: Add a Zenbleed fix") Cc: <stable@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20231201184226.16749-1-bp@alien8.de
2023-12-01x86/pci: Clean up open-coded PCIBIOS return code manglingIlpo Järvinen
Per PCI Firmware spec r3.3, sec 2.5.2, 2.6.2, and 2.7, the return code for these PCI BIOS interfaces is in 8 bits of the EAX register. Previously it was extracted by open-coded masks and shifting. Name the return code bits with a #define and add pcibios_get_return_code() to extract the return code to improve code readability. In addition, replace zero test with PCIBIOS_SUCCESSFUL. No functional changes intended. Link: https://lore.kernel.org/r/20231124085924.13830-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-01x86/pci: Use PCI_HEADER_TYPE_* instead of literalsIlpo Järvinen
Replace 0x7f and 0x80 literals with PCI_HEADER_TYPE_* defines. Link: https://lore.kernel.org/r/20231124090919.23687-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2023-12-01x86/microcode/intel: Remove redundant microcode late updated messageAshok Raj
After successful update, the late loading routine prints an update summary similar to: microcode: load: updated on 128 primary CPUs with 128 siblings microcode: revision: 0x21000170 -> 0x21000190 Remove the redundant message in the Intel side of the driver. [ bp: Massage commit message. ] Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/ZWjYhedNfhAUmt0k@a4bf019067fa.jf.intel.com
2023-12-01KVM: x86: Remove 'return void' expression for 'void function'Like Xu
The requested info will be stored in 'guest_xsave->region' referenced by the incoming pointer "struct kvm_xsave *guest_xsave", thus there is no need to explicitly use return void expression for a void function "static void kvm_vcpu_ioctl_x86_get_xsave(...)". The issue is caught with [-Wpedantic]. Fixes: 2d287ec65e79 ("x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer") Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20231007064019.17472-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01KVM: Set file_operations.owner appropriately for all such structuresSean Christopherson
Set .owner for all KVM-owned filed types so that the KVM module is pinned until any files with callbacks back into KVM are completely freed. Using "struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive while there are active VMs, doesn't provide full protection. Userspace can invoke delete_module() the instant the last reference to KVM is put. If KVM itself puts the last reference, e.g. via kvm_destroy_vm(), then it's possible for KVM to be preempted and deleted/unloaded before KVM fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back in, it will jump to a code page that is no longer mapped. Note, file types that can call into sub-module code, e.g. kvm-intel.ko or kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not THIS_MODULE (which points at kvm.ko). KVM assumes that if /dev/kvm is reachable, e.g. VMs are active, then the vendor module is loaded. To reduce the probability of forgetting to set .owner entirely, use THIS_MODULE for stats files where KVM does not call back into vendor code. This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and fixes several other file types that have been buggy since their introduction. Fixes: 70375c2d8fa3 ("Revert "KVM: set owner of cpu and vm file operations"") Fixes: 3bcd0662d66f ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01KVM: x86/mmu: fix comment about mmu_unsync_pages_lockPaolo Bonzini
Fix the comment about what can and cannot happen when mmu_unsync_pages_lock is not help. The comment correctly mentions "clearing sp->unsync", but then it talks about unsync going from 0 to 1. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231125083400.1399197-5-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01KVM: x86/mmu: always take tdp_mmu_pages_lockPaolo Bonzini
It is cheap to take tdp_mmu_pages_lock in all write-side critical sections. We already do it all the time when zapping with read_lock(), so it is not a problem to do it from the kvm_tdp_mmu_zap_all() path (aka kvm_arch_flush_shadow_all(), aka VM destruction and MMU notifier release). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231125083400.1399197-4-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01KVM: x86/mmu: remove unnecessary "bool shared" argument from iteratorsPaolo Bonzini
The "bool shared" argument is more or less unnecessary in the for_each_*_tdp_mmu_root_yield_safe() macros. Many users check for the lock before calling it; all of them either call small functions that do the check, or end up calling tdp_mmu_set_spte_atomic() and tdp_mmu_iter_set_spte(). Add a few assertions to make up for the lost check in for_each_*_tdp_mmu_root_yield_safe(), but even this is probably overkill and mostly for documentation reasons. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231125083400.1399197-3-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01KVM: x86/mmu: remove unnecessary "bool shared" argument from functionsPaolo Bonzini
Neither tdp_mmu_next_root nor kvm_tdp_mmu_put_root need to know if the lock is taken for read or write. Either way, protection is achieved via RCU and tdp_mmu_pages_lock. Remove the argument and just assert that the lock is taken. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20231125083400.1399197-2-pbonzini@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01KVM: x86/mmu: Check for leaf SPTE when clearing dirty bit in the TDP MMUDavid Matlack
Re-check that the given SPTE is still a leaf and present SPTE after a failed cmpxchg in clear_dirty_gfn_range(). clear_dirty_gfn_range() intends to only operate on present leaf SPTEs, but that could change after a failed cmpxchg. A check for present was added in commit 3354ef5a592d ("KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU") but the check for leaf is still buried in tdp_root_for_each_leaf_pte() and does not get rechecked on retry. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20231027172640.2335197-3-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01KVM: x86/mmu: Fix off-by-1 when splitting huge pages during CLEARDavid Matlack
Fix an off-by-1 error when passing in the range of pages to kvm_mmu_try_split_huge_pages() during CLEAR_DIRTY_LOG. Specifically, end is the last page that needs to be split (inclusive) so pass in `end + 1` since kvm_mmu_try_split_huge_pages() expects the `end` to be non-inclusive. At worst this will cause a huge page to be write-protected instead of eagerly split, which is purely a performance issue, not a correctness issue. But even that is unlikely as it would require userspace pass in a bitmap where the last page is the only 4K page on a huge page that needs to be split. Reported-by: Vipin Sharma <vipinsh@google.com> Fixes: cb00a70bd4b7 ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG") Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20231027172640.2335197-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86: Harden copying of userspace-array against overflowPhilipp Stanner
cpuid.c utilizes vmemdup_user() and array_size() to copy two userspace arrays. This, currently, does not check for an overflow. Use the new wrapper vmemdup_array_user() to copy the arrays more safely, as vmemdup_user() doesn't check for overflow. Note, KVM explicitly checks the number of entries before duplicating the array, i.e. adding the overflow check should be a glorified nop. Suggested-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Philipp Stanner <pstanner@redhat.com> Link: https://lore.kernel.org/r/20231102181526.43279-2-pstanner@redhat.com [sean: call out that KVM pre-checks the number of entries] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86/pmu: Track emulated counter events instead of previous counterSean Christopherson
Explicitly track emulated counter events instead of using the common counter value that's shared with the hardware counter owned by perf. Bumping the common counter requires snapshotting the pre-increment value in order to detect overflow from emulation, and the snapshot approach is inherently flawed. Snapshotting the previous counter at every increment assumes that there is at most one emulated counter event per emulated instruction (or rather, between checks for KVM_REQ_PMU). That's mostly holds true today because KVM only emulates (branch) instructions retired, but the approach will fall apart if KVM ever supports event types that don't have a 1:1 relationship with instructions. And KVM already has a relevant bug, as handle_invalid_guest_state() emulates multiple instructions without checking KVM_REQ_PMU, i.e. could miss an overflow event due to clobbering pmc->prev_counter. Not checking KVM_REQ_PMU is problematic in both cases, but at least with the emulated counter approach, the resulting behavior is delayed overflow detection, as opposed to completely lost detection. Tracking the emulated count fixes another bug where the snapshot approach can signal spurious overflow due to incorporating both the emulated count and perf's count in the check, i.e. if overflow is detected by perf, then KVM's emulation will also incorrectly signal overflow. Add a comment in the related code to call out the need to process emulated events *after* pausing the perf event (big kudos to Mingwei for figuring out that particular wrinkle). Cc: Mingwei Zhang <mizhang@google.com> Cc: Roman Kagan <rkagan@amazon.de> Cc: Jim Mattson <jmattson@google.com> Cc: Dapeng Mi <dapeng1.mi@linux.intel.com> Cc: Like Xu <like.xu.linux@gmail.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20231103230541.352265-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86/pmu: Update sample period in pmc_write_counter()Sean Christopherson
Update a PMC's sample period in pmc_write_counter() to deduplicate code across all callers of pmc_write_counter(). Opportunistically move pmc_write_counter() into pmc.c now that it's doing more work. WRMSR isn't such a hot path that an extra CALL+RET pair will be problematic, and the order of function definitions needs to be changed anyways, i.e. now is a convenient time to eat the churn. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20231103230541.352265-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86/pmu: Remove manual clearing of fields in kvm_pmu_init()Sean Christopherson
Remove code that unnecessarily clears event_count and need_cleanup in kvm_pmu_init(), the entire kvm_pmu is zeroed just a few lines earlier. Vendor code doesn't set event_count or need_cleanup during .init(), and if either VMX or SVM did set those fields it would be a flagrant bug. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20231103230541.352265-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86/pmu: Stop calling kvm_pmu_reset() at RESET (it's redundant)Sean Christopherson
Drop kvm_vcpu_reset()'s call to kvm_pmu_reset(), the call is performed only for RESET, which is really just the same thing as vCPU creation, and kvm_arch_vcpu_create() *just* called kvm_pmu_init(), i.e. there can't possibly be any work to do. Unlike Intel, AMD's amd_pmu_refresh() does fill all_valid_pmc_idx even if guest CPUID is empty, but everything that is at all dynamic is guaranteed to be '0'/NULL, e.g. it should be impossible for KVM to have already created a perf event. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20231103230541.352265-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86/pmu: Reset the PMU, i.e. stop counters, before refreshingSean Christopherson
Stop all counters and release all perf events before refreshing the vPMU, i.e. before reconfiguring the vPMU to respond to changes in the vCPU model. Clear need_cleanup in kvm_pmu_reset() as well so that KVM doesn't prematurely stop counters, e.g. if KVM enters the guest and enables counters before the vCPU is scheduled out. Cc: stable@vger.kernel.org Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20231103230541.352265-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86/pmu: Move PMU reset logic to common x86 codeSean Christopherson
Move the common (or at least "ignored") aspects of resetting the vPMU to common x86 code, along with the stop/release helpers that are no used only by the common pmu.c. There is no need to manually handle fixed counters as all_valid_pmc_idx tracks both fixed and general purpose counters, and resetting the vPMU is far from a hot path, i.e. the extra bit of overhead to the PMC from the index is a non-issue. Zero fixed_ctr_ctrl in common code even though it's Intel specific. Ensuring it's zero doesn't harm AMD/SVM in any way, and stopping the fixed counters via all_valid_pmc_idx, but not clearing the associated control bits, would be odd/confusing. Make the .reset() hook optional as SVM no longer needs vendor specific handling. Cc: stable@vger.kernel.org Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20231103230541.352265-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: SVM,VMX: Use %rip-relative addressing to access kvm_rebootingUros Bizjak
Instruction with %rip-relative address operand is one byte shorter than its absolute address counterpart and is also compatible with position independent executable (-fpie) build. No functional changes intended. Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20231031075312.47525-1-ubizjak@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: SVM: Don't intercept IRET when injecting NMI and vNMI is enabledSean Christopherson
When vNMI is enabled, rely entirely on hardware to correctly handle NMI blocking, i.e. don't intercept IRET to detect when NMIs are no longer blocked. KVM already correctly ignores svm->nmi_masked when vNMI is enabled, so the effect of the bug is essentially an unnecessary VM-Exit. KVM intercepts IRET for two reasons: - To track NMI masking to be able to know at any point of time if NMI is masked. - To track NMI windows (to inject another NMI after the guest executes IRET, i.e. unblocks NMIs) When vNMI is enabled, both cases are handled by hardware: - NMI masking state resides in int_ctl.V_NMI_BLOCKING and can be read by KVM at will. - Hardware automatically "injects" pending virtual NMIs when virtual NMIs become unblocked. However, even though pending a virtual NMI for hardware to handle is the most common way to synthesize a guest NMI, KVM may still directly inject an NMI via when KVM is handling two "simultaneous" NMIs (see comments in process_nmi() for details on KVM's simultaneous NMI handling). Per AMD's APM, hardware sets the BLOCKING flag when software directly injects an NMI as well, i.e. KVM doesn't need to manually mark vNMIs as blocked: If Event Injection is used to inject an NMI when NMI Virtualization is enabled, VMRUN sets V_NMI_MASK in the guest state. Note, it's still possible that KVM could trigger a spurious IRET VM-Exit. When running a nested guest, KVM disables vNMI for L2 and thus will enable IRET interception (in both vmcb01 and vmcb02) while running L2 reason. If a nested VM-Exit happens before L2 executes IRET, KVM can end up running L1 with vNMI enable and IRET intercepted. This is also a benign bug, and even less likely to happen, i.e. can be safely punted to a future fix. Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI") Link: https://lore.kernel.org/all/ZOdnuDZUd4mevCqe@google.como Cc: Santosh Shukla <santosh.shukla@amd.com> Cc: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Santosh Shukla <santosh.shukla@amd.com> Link: https://lore.kernel.org/r/20231018192021.1893261-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: SVM: Explicitly require FLUSHBYASID to enable SEV supportSean Christopherson
Add a sanity check that FLUSHBYASID is available if SEV is supported in hardware, as SEV (and beyond) guests are bound to a single ASID, i.e. KVM can't "flush" by assigning a new, fresh ASID to the guest. If FLUSHBYASID isn't supported for some bizarre reason, KVM would completely fail to do TLB flushes for SEV+ guests (see pre_svm_run() and pre_sev_run()). Cc: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20231018193617.1895752-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: nSVM: Advertise support for flush-by-ASIDSean Christopherson
Advertise support for FLUSHBYASID when nested SVM is enabled, as KVM can always emulate flushing TLB entries for a vmcb12 ASID, e.g. by running L2 with a new, fresh ASID in vmcb02. Some modern hypervisors, e.g. VMWare Workstation 17, require FLUSHBYASID support and will refuse to run if it's not present. Punt on proper support, as "Honor L1's request to flush an ASID on nested VMRUN" is one of the TODO items in the (incomplete) list of issues that need to be addressed in order for KVM to NOT do a full TLB flush on every nested SVM transition (see nested_svm_transition_tlb_flush()). Reported-by: Stefan Sterz <s.sterz@proxmox.com> Closes: https://lkml.kernel.org/r/b9915c9c-4cf6-051a-2d91-44cc6380f455%40proxmox.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20231018194104.1896415-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30Revert "nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB"Sean Christopherson
Revert KVM's made-up consistency check on SVM's TLB control. The APM says that unsupported encodings are reserved, but the APM doesn't state that VMRUN checks for a supported encoding. Unless something is called out in "Canonicalization and Consistency Checks" or listed as MBZ (Must Be Zero), AMD behavior is typically to let software shoot itself in the foot. This reverts commit 174a921b6975ef959dd82ee9e8844067a62e3ec1. Fixes: 174a921b6975 ("nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB") Reported-by: Stefan Sterz <s.sterz@proxmox.com> Closes: https://lkml.kernel.org/r/b9915c9c-4cf6-051a-2d91-44cc6380f455%40proxmox.com Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20231018194104.1896415-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86: Don't unnecessarily force masterclock update on vCPU hotplugSean Christopherson
Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, e.g. when userspace hotplugs a pre-created vCPU into the VM. Unnecessarily updating the masterclock is undesirable as it can cause kvmclock's time to jump, which is particularly painful on systems with a stable TSC as kvmclock _should_ be fully reliable on such systems. The unexpected time jumps are due to differences in the TSC=>nanoseconds conversion algorithms between kvmclock and the host's CLOCK_MONOTONIC_RAW (the pvclock algorithm is inherently lossy). When updating the masterclock, KVM refreshes the "base", i.e. moves the elapsed time since the last update from the kvmclock/pvclock algorithm to the CLOCK_MONOTONIC_RAW algorithm. Synchronizing kvmclock with CLOCK_MONOTONIC_RAW is the lesser of evils when the TSC is unstable, but adds no real value when the TSC is stable. Prior to commit 7f187922ddf6 ("KVM: x86: update masterclock values on TSC writes"), KVM did NOT force an update when synchronizing a vCPU to the current generation. commit 7f187922ddf6b67f2999a76dcb71663097b75497 Author: Marcelo Tosatti <mtosatti@redhat.com> Date: Tue Nov 4 21:30:44 2014 -0200 KVM: x86: update masterclock values on TSC writes When the guest writes to the TSC, the masterclock TSC copy must be updated as well along with the TSC_OFFSET update, otherwise a negative tsc_timestamp is calculated at kvm_guest_time_update. Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to "if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)" becomes redundant, so remove the do_request boolean and collapse everything into a single condition. Before that, KVM only re-synced the masterclock if the masterclock was enabled or disabled Note, at the time of the above commit, VMX synchronized TSC on *guest* writes to MSR_IA32_TSC: case MSR_IA32_TSC: kvm_write_tsc(vcpu, msr_info); break; which is why the changelog specifically says "guest writes", but the bug that was being fixed wasn't unique to guest write, i.e. a TSC write from the host would suffer the same problem. So even though KVM stopped synchronizing on guest writes as of commit 0c899c25d754 ("KVM: x86: do not attempt TSC synchronization on guest writes"), simply reverting commit 7f187922ddf6 is not an option. Figuring out how a negative tsc_timestamp could be computed requires a bit more sleuthing. In kvm_write_tsc() (at the time), except for KVM's "less than 1 second" hack, KVM snapshotted the vCPU's current TSC *and* the current time in nanoseconds, where kvm->arch.cur_tsc_nsec is the current host kernel time in nanoseconds: ns = get_kernel_ns(); ... if (usdiff < USEC_PER_SEC && vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) { ... } else { /* * We split periods of matched TSC writes into generations. * For each generation, we track the original measured * nanosecond time, offset, and write, so if TSCs are in * sync, we can match exact offset, and if not, we can match * exact software computation in compute_guest_tsc() * * These values are tracked in kvm->arch.cur_xxx variables. */ kvm->arch.cur_tsc_generation++; kvm->arch.cur_tsc_nsec = ns; kvm->arch.cur_tsc_write = data; kvm->arch.cur_tsc_offset = offset; matched = false; pr_debug("kvm: new tsc generation %llu, clock %llu\n", kvm->arch.cur_tsc_generation, data); } ... /* Keep track of which generation this VCPU has synchronized to */ vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation; vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec; vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write; Note that the above creates a new generation and sets "matched" to false! But because kvm_track_tsc_matching() looks for matched+1, i.e. doesn't require the vCPU that creates the new generation to match itself, KVM would immediately compute vcpus_matched as true for VMs with a single vCPU. As a result, KVM would skip the masterlock update, even though a new TSC generation was created: vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == atomic_read(&vcpu->kvm->online_vcpus)); if (vcpus_matched && gtod->clock.vclock_mode == VCLOCK_TSC) if (!ka->use_master_clock) do_request = 1; if (!vcpus_matched && ka->use_master_clock) do_request = 1; if (do_request) kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); On hardware without TSC scaling support, vcpu->tsc_catchup is set to true if the guest TSC frequency is faster than the host TSC frequency, even if the TSC is otherwise stable. And for that mode, kvm_guest_time_update(), by way of compute_guest_tsc(), uses vcpu->arch.this_tsc_nsec, a.k.a. the kernel time at the last TSC write, to compute the guest TSC relative to kernel time: static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns) { u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec, vcpu->arch.virtual_tsc_mult, vcpu->arch.virtual_tsc_shift); tsc += vcpu->arch.this_tsc_write; return tsc; } Except the "kernel_ns" passed to compute_guest_tsc() isn't the current kernel time, it's the masterclock snapshot! spin_lock(&ka->pvclock_gtod_sync_lock); use_master_clock = ka->use_master_clock; if (use_master_clock) { host_tsc = ka->master_cycle_now; kernel_ns = ka->master_kernel_ns; } spin_unlock(&ka->pvclock_gtod_sync_lock); if (vcpu->tsc_catchup) { u64 tsc = compute_guest_tsc(v, kernel_ns); if (tsc > tsc_timestamp) { adjust_tsc_offset_guest(v, tsc - tsc_timestamp); tsc_timestamp = tsc; } } And so when KVM skips the masterclock update after a TSC write, i.e. after a new TSC generation is started, the "kernel_ns-vcpu->arch.this_tsc_nsec" is *guaranteed* to generate a negative value, because this_tsc_nsec was captured after ka->master_kernel_ns. Forcing a masterclock update essentially fudged around that problem, but in a heavy handed way that introduced undesirable side effects, i.e. unnecessarily forces a masterclock update when a new vCPU joins the party via hotplug. Note, KVM forces masterclock updates in other weird ways that are also likely unnecessary, e.g. when establishing a new Xen shared info page and when userspace creates a brand new vCPU. But the Xen thing is firmly a separate mess, and there are no known userspace VMMs that utilize kvmclock *and* create new vCPUs after the VM is up and running. I.e. the other issues are future problems. Reported-by: Dongli Zhang <dongli.zhang@oracle.com> Closes: https://lore.kernel.org/all/20230926230649.67852-1-dongli.zhang@oracle.com Fixes: 7f187922ddf6 ("KVM: x86: update masterclock values on TSC writes") Cc: David Woodhouse <dwmw2@infradead.org> Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com> Tested-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20231018195638.1898375-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86: Use a switch statement and macros in __feature_translate()Jim Mattson
Use a switch statement with macro-generated case statements to handle translating feature flags in order to reduce the probability of runtime errors due to copy+paste goofs, to make compile-time errors easier to debug, and to make the code more readable. E.g. the compiler won't directly generate an error for duplicate if statements if (x86_feature == X86_FEATURE_SGX1) return KVM_X86_FEATURE_SGX1; else if (x86_feature == X86_FEATURE_SGX2) return KVM_X86_FEATURE_SGX1; and so instead reverse_cpuid_check() will fail due to the untranslated entry pointing at a Linux-defined leaf, which provides practically no hint as to what is broken arch/x86/kvm/reverse_cpuid.h:108:2: error: call to __compiletime_assert_450 declared with 'error' attribute: BUILD_BUG_ON failed: x86_leaf == CPUID_LNX_4 BUILD_BUG_ON(x86_leaf == CPUID_LNX_4); ^ whereas duplicate case statements very explicitly point at the offending code: arch/x86/kvm/reverse_cpuid.h:125:2: error: duplicate case value '361' KVM_X86_TRANSLATE_FEATURE(SGX2); ^ arch/x86/kvm/reverse_cpuid.h:124:2: error: duplicate case value '360' KVM_X86_TRANSLATE_FEATURE(SGX1); ^ And without macros, the opposite type of copy+paste goof doesn't generate any error at compile-time, e.g. this yields no complaints: case X86_FEATURE_SGX1: return KVM_X86_FEATURE_SGX1; case X86_FEATURE_SGX2: return KVM_X86_FEATURE_SGX1; Note, __feature_translate() is forcibly inlined and the feature is known at compile-time, so the code generation between an if-elif sequence and a switch statement should be identical. Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20231024001636.890236-2-jmattson@google.com [sean: use a macro, rewrite changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86: Advertise CPUID.(EAX=7,ECX=2):EDX[5:0] to userspaceJim Mattson
The low five bits {INTEL_PSFD, IPRED_CTRL, RRSBA_CTRL, DDPD_U, BHI_CTRL} advertise the availability of specific bits in IA32_SPEC_CTRL. Since KVM dynamically determines the legal IA32_SPEC_CTRL bits for the underlying hardware, the hard work has already been done. Just let userspace know that a guest can use these IA32_SPEC_CTRL bits. The sixth bit (MCDT_NO) states that the processor does not exhibit MXCSR Configuration Dependent Timing (MCDT) behavior. This is an inherent property of the physical processor that is inherited by the virtual CPU. Pass that information on to userspace. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20231024001636.890236-1-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: x86: Turn off KVM_WERROR by default for all configsSean Christopherson
Don't enable KVM_WERROR by default for x86-64 builds as KVM's one-off -Werror enabling is *mostly* superseded by the kernel-wide WERROR, and enabling KVM_WERROR by default can cause problems for developers working on other subsystems. E.g. subsystems that have a "zero W=1 regressions" rule can inadvertently build KVM with -Werror and W=1, and end up with build failures that are completely uninteresting to the developer (W=1 is prone to false positives, especially on older compilers). Keep KVM_WERROR as there are combinations where enabling WERROR isn't feasible, e.g. the default FRAME_WARN=1024 on i386 builds generates a non-zero number of warnings and thus errors, and there are far too many warnings throughout the kernel to enable WERROR with W=1 (building KVM with -Werror is desirable (with a sane compiler) as W=1 does generate useful warnings). Opportunistically drop the dependency on !COMPILE_TEST as it's completely meaningless (it was copied from i195's -Werror Kconfig), as the kernel's WERROR is explicitly *enabled* for COMPILE_TEST=y kernel's, i.e. enabling -Werror is obviosly not dependent on COMPILE_TEST=n. Reported-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/all/20231006205415.3501535-1-kuba@kernel.org Link: https://lore.kernel.org/r/20231018151906.1841689-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30perf/x86/intel/uncore: Factor out topology_gidnid_map()Alexander Antonov
The same code is used for retrieving package ID procedure from GIDNIDMAP register. Factor out topology_gidnid_map() to avoid code duplication. Signed-off-by: Alexander Antonov <alexander.antonov@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Link: https://lore.kernel.org/r/20231127185246.2371939-3-alexander.antonov@linux.intel.com
2023-11-30perf/x86/intel/uncore: Fix NULL pointer dereference issue in upi_fill_topology()Alexander Antonov
Get logical socket id instead of physical id in discover_upi_topology() to avoid out-of-bound access on 'upi = &type->topology[nid][idx];' line that leads to NULL pointer dereference in upi_fill_topology() Fixes: f680b6e6062e ("perf/x86/intel/uncore: Enable UPI topology discovery for Icelake Server") Reported-by: Kyle Meyer <kyle.meyer@hpe.com> Signed-off-by: Alexander Antonov <alexander.antonov@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Kyle Meyer <kyle.meyer@hpe.com> Link: https://lore.kernel.org/r/20231127185246.2371939-2-alexander.antonov@linux.intel.com
2023-11-30x86/sev: Fix kernel crash due to late update to read-only ghcb_versionAshwin Dayanand Kamat
A write-access violation page fault kernel crash was observed while running cpuhotplug LTP testcases on SEV-ES enabled systems. The crash was observed during hotplug, after the CPU was offlined and the process was migrated to different CPU. setup_ghcb() is called again which tries to update ghcb_version in sev_es_negotiate_protocol(). Ideally this is a read_only variable which is initialised during booting. Trying to write it results in a pagefault: BUG: unable to handle page fault for address: ffffffffba556e70 #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation [ ...] Call Trace: <TASK> ? __die_body.cold+0x1a/0x1f ? __die+0x2a/0x35 ? page_fault_oops+0x10c/0x270 ? setup_ghcb+0x71/0x100 ? __x86_return_thunk+0x5/0x6 ? search_exception_tables+0x60/0x70 ? __x86_return_thunk+0x5/0x6 ? fixup_exception+0x27/0x320 ? kernelmode_fixup_or_oops+0xa2/0x120 ? __bad_area_nosemaphore+0x16a/0x1b0 ? kernel_exc_vmm_communication+0x60/0xb0 ? bad_area_nosemaphore+0x16/0x20 ? do_kern_addr_fault+0x7a/0x90 ? exc_page_fault+0xbd/0x160 ? asm_exc_page_fault+0x27/0x30 ? setup_ghcb+0x71/0x100 ? setup_ghcb+0xe/0x100 cpu_init_exception_handling+0x1b9/0x1f0 The fix is to call sev_es_negotiate_protocol() only in the BSP boot phase, and it only needs to be done once in any case. [ mingo: Refined the changelog. ] Fixes: 95d33bfaa3e1 ("x86/sev: Register GHCB memory when SEV-SNP is active") Suggested-by: Tom Lendacky <thomas.lendacky@amd.com> Co-developed-by: Bo Gan <bo.gan@broadcom.com> Signed-off-by: Bo Gan <bo.gan@broadcom.com> Signed-off-by: Ashwin Dayanand Kamat <ashwin.kamat@broadcom.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/1701254429-18250-1-git-send-email-kashwindayan@vmware.com
2023-11-30x86/boot: Ignore NMIs during very early bootJun'ichi Nomura
When there are two racing NMIs on x86, the first NMI invokes NMI handler and the 2nd NMI is latched until IRET is executed. If panic on NMI and panic kexec are enabled, the first NMI triggers panic and starts booting the next kernel via kexec. Note that the 2nd NMI is still latched. During the early boot of the next kernel, once an IRET is executed as a result of a page fault, then the 2nd NMI is unlatched and invokes the NMI handler. However, NMI handler is not set up at the early stage of boot, which results in a boot failure. Avoid such problems by setting up a NOP handler for early NMIs. [ mingo: Refined the changelog. ] Signed-off-by: Jun'ichi Nomura <junichi.nomura@nec.com> Signed-off-by: Derek Barbosa <debarbos@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org>
2023-11-30x86/tools: Remove chkobjdump.awkNathan Chancellor
This check is superfluous now that the minimum version of binutils to build the kernel is 2.25. This also fixes an error seen with llvm-objdump because it does not support '-v' prior to LLVM 13: llvm-objdump: error: unknown argument '-v' Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://github.com/llvm/llvm-project/commit/dde24a87c55f82d8c7b3bf3eafb10f2b9b2b9a01 Link: https://lore.kernel.org/r/20231129-objdump-reformat-llvm-v3-3-0d855e79314d@kernel.org Closes: https://github.com/ClangBuiltLinux/linux/issues/1362
2023-11-30x86/tools: objdump_reformat.awk: Allow for spacesSamuel Zeter
GNU objdump and LLVM objdump have differing output formats. Specifically, GNU objump will format its output as: address:<tab>hex, whereas LLVM objdump displays its output as address:<space>hex. objdump_reformat.awk incorrectly handles this discrepancy due to the unexpected space and as a result insn_decoder_test fails, as its input is garbled. The instruction line being tokenized now handles a space and colon, or tab delimiter. Signed-off-by: Samuel Zeter <samuelzeter@gmail.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20231129-objdump-reformat-llvm-v3-2-0d855e79314d@kernel.org Closes: https://github.com/ClangBuiltLinux/linux/issues/1364
2023-11-30x86/tools: objdump_reformat.awk: Ensure regex matches fwaitSamuel Zeter
If there is "wait" mnemonic in the line being parsed, it is incorrectly handled by the script, and an extra line of "fwait" in objdump_reformat's output is inserted. As insn_decoder_test relies upon the formatted output, the test fails. This is reproducible when disassembling with llvm-objdump: Pre-processed lines: ffffffff81033e72: 9b wait ffffffff81033e73: 48 c7 c7 89 50 42 82 movq After objdump_reformat.awk: ffffffff81033e72: 9b fwait ffffffff81033e72: wait ffffffff81033e73: 48 c7 c7 89 50 42 82 movq The regex match now accepts spaces or tabs, along with the "fwait" instruction. Signed-off-by: Samuel Zeter <samuelzeter@gmail.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Tested-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20231129-objdump-reformat-llvm-v3-1-0d855e79314d@kernel.org Closes: https://github.com/ClangBuiltLinux/linux/issues/1364
2023-11-30perf/x86/amd: Reject branch stack for IBS eventsNamhyung Kim
The AMD IBS PMU doesn't handle branch stacks, so it should not accept events with brstack. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231130062246.290-1-ravi.bangoria@amd.com
2023-11-29KVM: x86/mmu: Declare flush_remote_tlbs{_range}() hooks iff HYPERV!=nSean Christopherson
Declare the kvm_x86_ops hooks used to wire up paravirt TLB flushes when running under Hyper-V if and only if CONFIG_HYPERV!=n. Wrapping yet more code with IS_ENABLED(CONFIG_HYPERV) eliminates a handful of conditional branches, and makes it super obvious why the hooks *might* be valid. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20231018192325.1893896-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-29KVM: x86: Get CPL directly when checking if loaded vCPU is in kernel modeLike Xu
When querying whether or not a vCPU "is" running in kernel mode, directly get the CPL if the vCPU is the currently loaded vCPU. In scenarios where a guest is profiled via perf-kvm, querying vcpu->arch.preempted_in_kernel from kvm_guest_state() is wrong if vCPU is actively running, i.e. isn't scheduled out due to being preempted and so preempted_in_kernel is stale. This affects perf/core's ability to accurately tag guest RIP with PERF_RECORD_MISC_GUEST_{KERNEL|USER} and record it in the sample. This causes perf/tool to fail to connect the vCPU RIPs to the guest kernel space symbols when parsing these samples due to incorrect PERF_RECORD_MISC flags: Before (perf-report of a cpu-cycles sample): 1.23% :58945 [unknown] [u] 0xffffffff818012e0 After: 1.35% :60703 [kernel.vmlinux] [g] asm_exc_page_fault Note, checking preempted_in_kernel in kvm_arch_vcpu_in_kernel() is awful as nothing in the API's suggests that it's safe to use if and only if the vCPU was preempted. That can be cleaned up in the future, for now just fix the glaring correctness bug. Note #2, checking vcpu->preempted is NOT safe, as getting the CPL on VMX requires VMREAD, i.e. is correct if and only if the vCPU is loaded. If the target vCPU *was* preempted, then it can be scheduled back in after the check on vcpu->preempted in kvm_vcpu_on_spin(), i.e. KVM could end up trying to do VMREAD on a VMCS that isn't loaded on the current pCPU. Signed-off-by: Like Xu <likexu@tencent.com> Fixes: e1bfc24577cc ("KVM: Move x86's perf guest info callbacks to generic KVM") Link: https://lore.kernel.org/r/20231123075818.12521-1-likexu@tencent.com [sean: massage changelong, add Fixes] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-29x86: Fix CPUIDLE_FLAG_IRQ_ENABLE leaking timer reprogramPeter Zijlstra
intel_idle_irq() re-enables IRQs very early. As a result, an interrupt may fire before mwait() is eventually called. If such an interrupt queues a timer, it may go unnoticed until mwait returns and the idle loop handles the tick re-evaluation. And monitoring TIF_NEED_RESCHED doesn't help because a local timer enqueue doesn't set that flag. The issue is mitigated by the fact that this idle handler is only invoked for shallow C-states when, presumably, the next tick is supposed to be close enough. There may still be rare cases though when the next tick is far away and the selected C-state is shallow, resulting in a timer getting ignored for a while. Fix this with using sti_mwait() whose IRQ-reenablement only triggers upon calling mwait(), dealing with the race while keeping the interrupt latency within acceptable bounds. Fixes: c227233ad64c (intel_idle: enable interrupts before C1 on Xeons) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lkml.kernel.org/r/20231115151325.6262-3-frederic@kernel.org
2023-11-29x86: Add a comment about the "magic" behind shadow sti before mwaitFrederic Weisbecker
Add a note to make sure we never miss and break the requirements behind it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lkml.kernel.org/r/20231115151325.6262-2-frederic@kernel.org
2023-11-29x86/CPU/AMD: Drop now unused CPU erratum checking functionBorislav Petkov (AMD)
Bye bye. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: http://lore.kernel.org/r/20231120104152.13740-14-bp@alien8.de
2023-11-29x86/CPU/AMD: Get rid of amd_erratum_1485[]Borislav Petkov (AMD)
No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: http://lore.kernel.org/r/20231120104152.13740-13-bp@alien8.de
2023-11-29x86/CPU/AMD: Get rid of amd_erratum_400[]Borislav Petkov (AMD)
Setting X86_BUG_AMD_E400 in init_amd() is early enough. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Link: http://lore.kernel.org/r/20231120104152.13740-12-bp@alien8.de