summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2023-04-05KVM: x86/pmu: Zero out pmu->all_valid_pmc_idx each time it's refreshedLike Xu
The kvm_pmu_refresh() may be called repeatedly (e.g. configure guest CPUID repeatedly or update MSR_IA32_PERF_CAPABILITIES) and each call will use the last pmu->all_valid_pmc_idx value, with the residual bits introducing additional overhead later in the vPMU emulation. Fixes: b35e5548b411 ("KVM: x86/vPMU: Add lazy mechanism to release perf_event per vPMC") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230404071759.75376-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-05Kconfig: introduce HAS_IOPORT option and select it as necessaryNiklas Schnelle
We introduce a new HAS_IOPORT Kconfig option to indicate support for I/O Port access. In a future patch HAS_IOPORT=n will disable compilation of the I/O accessor functions inb()/outb() and friends on architectures which can not meaningfully support legacy I/O spaces such as s390. The following architectures do not select HAS_IOPORT: * ARC * C-SKY * Hexagon * Nios II * OpenRISC * s390 * User-Mode Linux * Xtensa All other architectures select HAS_IOPORT at least conditionally. The "depends on" relations on HAS_IOPORT in drivers as well as ifdefs for HAS_IOPORT specific sections will be added in subsequent patches on a per subsystem basis. Co-developed-by: Arnd Bergmann <arnd@kernel.org> Signed-off-by: Arnd Bergmann <arnd@kernel.org> Acked-by: Johannes Berg <johannes@sipsolutions.net> # for ARCH=um Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2023-04-05KVM: VMX: Use is_64_bit_mode() to check 64-bit mode in SGX handlerBinbin Wu
sgx_get_encls_gva() uses is_long_mode() to check 64-bit mode, however, SGX system leaf instructions are valid in compatibility mode, should use is_64_bit_mode() instead. Fixes: 70210c044b4e ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions") Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20230404032502.27798-1-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-05x86/cpu: Add Xeon Emerald Rapids to list of CPUs that support PPINTony Luck
This should be the last addition to this table. Future CPUs will enumerate PPIN support using CPUID. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230404212124.428118-1-tony.luck@intel.com
2023-04-05arch/x86: Remove "select SRCU"Paul E. McKenney
Now that the SRCU Kconfig option is unconditionally selected, there is no longer any point in selecting it. Therefore, remove the "select SRCU" Kconfig statements. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <x86@kernel.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05kvm: Remove "select SRCU"Paul E. McKenney
Now that the SRCU Kconfig option is unconditionally selected, there is no longer any point in selecting it. Therefore, remove the "select SRCU" Kconfig statements from the various KVM Kconfig files. Acked-by: Sean Christopherson <seanjc@google.com> (x86) Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <kvm@vger.kernel.org> Acked-by: Marc Zyngier <maz@kernel.org> (arm64) Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Acked-by: Anup Patel <anup@brainfault.org> (riscv) Acked-by: Heiko Carstens <hca@linux.ibm.com> (s390) Reviewed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
2023-04-05x86/cpu: Add model number for Intel Arrow Lake processorTony Luck
Successor to Lunar Lake. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230404174641.426593-1-tony.luck@intel.com
2023-04-05KVM: x86: Redefine 'longmode' as a flag for KVM_EXIT_HYPERCALLOliver Upton
The 'longmode' field is a bit annoying as it blows an entire __u32 to represent a boolean value. Since other architectures are looking to add support for KVM_EXIT_HYPERCALL, now is probably a good time to clean it up. Redefine the field (and the remaining padding) as a set of flags. Preserve the existing ABI by using bit 0 to indicate if the guest was in long mode and requiring that the remaining 31 bits must be zero. Cc: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230404154050.2270077-2-oliver.upton@linux.dev
2023-04-04x86/boot: Centralize __pa()/__va() definitionsKirill A. Shutemov
Replace multiple __pa()/__va() definitions with a single one in misc.h. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20230330114956.20342-2-kirill.shutemov%40linux.intel.com
2023-04-04KVM: x86/mmu: Merge all handle_changed_pte*() functionsVipin Sharma
Merge __handle_changed_pte() and handle_changed_spte_acc_track() into a single function, handle_changed_pte(), as the two are always used together. Remove the existing handle_changed_pte(), as it's just a wrapper that calls __handle_changed_pte() and handle_changed_spte_acc_track(). Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Remove handle_changed_spte_dirty_log()Vipin Sharma
Remove handle_changed_spte_dirty_log() as there is no code flow which sets 4KiB SPTE writable and hit this path. This function marks the page dirty in a memslot only if new SPTE is 4KiB in size and writable. Current users of handle_changed_spte_dirty_log() are: 1. set_spte_gfn() - Create only non writable SPTEs. 2. write_protect_gfn() - Change an SPTE to non writable. 3. zap leaf and roots APIs - Everything is 0. 4. handle_removed_pt() - Sets SPTEs to REMOVED_SPTE 5. tdp_mmu_link_sp() - Makes non leaf SPTEs. There is also no path which creates a writable 4KiB without going through make_spte() and this functions takes care of marking SPTE dirty in the memslot if it is PT_WRITABLE. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: add blurb to __handle_changed_spte()'s comment] Link: https://lore.kernel.org/r/20230321220021.2119033-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Remove "record_acc_track" in __tdp_mmu_set_spte()Vipin Sharma
Remove bool parameter "record_acc_track" from __tdp_mmu_set_spte() and refactor the code. This variable is always set to true by its caller. Remove single and double underscore prefix from tdp_mmu_set_spte() related APIs: 1. Change __tdp_mmu_set_spte() to tdp_mmu_set_spte() 2. Change _tdp_mmu_set_spte() to tdp_mmu_iter_set_spte() Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230321220021.2119033-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Bypass __handle_changed_spte() when aging TDP MMU SPTEsVipin Sharma
Drop everything except the "tdp_mmu_spte_changed" tracepoint part of __handle_changed_spte() when aging SPTEs in the TDP MMU, as clearing the accessed status doesn't affect the SPTE's shadow-present status, whether or not the SPTE is a leaf, or change the PFN. I.e. none of the functional updates handled by __handle_changed_spte() are relevant. Losing __handle_changed_spte()'s sanity checks does mean that a bug could theoretical go unnoticed, but that scenario is extremely unlikely, e.g. would effectively require a misconfigured MMU or a locking bug elsewhere. Link: https://lore.kernel.org/all/Y9HcHRBShQgjxsQb@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Drop unnecessary dirty log checks when aging TDP MMU SPTEsVipin Sharma
Drop the unnecessary call to handle dirty log updates when aging TDP MMU SPTEs, as neither clearing the Accessed bit nor marking a SPTE for access tracking can _set_ the Writable bit, i.e. can't trigger marking a gfn dirty in its memslot. The access tracking path can _clear_ the Writable bit, e.g. if the XCHG races with fast_page_fault() and writes the stale value without the Writable bit set, but clearing the Writable bit outside of mmu_lock is not allowed, i.e. access tracking can't spuriously set the Writable bit. Signed-off-by: Vipin Sharma <vipinsh@google.com> [sean: split to separate patch, apply to dirty path, write changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Clear only A-bit (if enabled) when aging TDP MMU SPTEsVipin Sharma
Use tdp_mmu_clear_spte_bits() when clearing the Accessed bit in TDP MMU SPTEs so as to use an atomic-AND instead of XCHG to clear the A-bit. Similar to the D-bit story, this will allow KVM to bypass __handle_changed_spte() by ensuring only the A-bit is modified. Link: https://lore.kernel.org/all/Y9HcHRBShQgjxsQb@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: massage changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Remove "record_dirty_log" in __tdp_mmu_set_spte()Vipin Sharma
Remove bool parameter "record_dirty_log" from __tdp_mmu_set_spte() and refactor the code as this variable is always set to true by its caller. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20230321220021.2119033-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Bypass __handle_changed_spte() when clearing TDP MMU dirty bitsVipin Sharma
Drop everything except marking the PFN dirty and the relevant tracepoint parts of __handle_changed_spte() when clearing the dirty status of gfns in the TDP MMU. Clearing only the Dirty (or Writable) bit doesn't affect the SPTEs shadow-present status, whether or not the SPTE is a leaf, or change the SPTE's PFN. I.e. other than marking the PFN dirty, none of the functional updates handled by __handle_changed_spte() are relevant. Losing __handle_changed_spte()'s sanity checks does mean that a bug could theoretical go unnoticed, but that scenario is extremely unlikely, e.g. would effectively require a misconfigured or a locking bug elsewhere. Opportunistically remove a comment blurb from __handle_changed_spte() about all modifications to TDP MMU SPTEs needing to invoke said function, that "rule" hasn't been true since fast page fault support was added for the TDP MMU (and perhaps even before). Tested on a VM (160 vCPUs, 160 GB memory) and found that performance of clear dirty log stage improved by ~40% in dirty_log_perf_test (with the full optimization applied). Before optimization: -------------------- Iteration 1 clear dirty log time: 3.638543593s Iteration 2 clear dirty log time: 3.145032742s Iteration 3 clear dirty log time: 3.142340358s Clear dirty log over 3 iterations took 9.925916693s. (Avg 3.308638897s/iteration) After optimization: ------------------- Iteration 1 clear dirty log time: 2.318988110s Iteration 2 clear dirty log time: 1.794470164s Iteration 3 clear dirty log time: 1.791668628s Clear dirty log over 3 iterations took 5.905126902s. (Avg 1.968375634s/iteration) Link: https://lore.kernel.org/all/Y9hXmz%2FnDOr1hQal@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: split the switch to atomic-AND to a separate patch] Link: https://lore.kernel.org/r/20230321220021.2119033-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Drop access tracking checks when clearing TDP MMU dirty bitsVipin Sharma
Drop the unnecessary call to handle access-tracking changes when clearing the dirty status of TDP MMU SPTEs. Neither the Dirty bit nor the Writable bit has any impact on the accessed state of a page, i.e. clearing only the aforementioned bits doesn't make an accessed SPTE suddently not accessed. Signed-off-by: Vipin Sharma <vipinsh@google.com> [sean: split to separate patch, write changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Atomically clear SPTE dirty state in the clear-dirty-log flowVipin Sharma
Optimize the clearing of dirty state in TDP MMU SPTEs by doing an atomic-AND (on SPTEs that have volatile bits) instead of the full XCHG that currently ends up being invoked (see kvm_tdp_mmu_write_spte()). Clearing _only_ the bit in question will allow KVM to skip the many irrelevant checks in __handle_changed_spte() by avoiding any collateral damage due to the XCHG writing all SPTE bits, e.g. the XCHG could race with fast_page_fault() setting the W-bit and the CPU setting the D-bit, and thus incorrectly drop the CPU's D-bit update. Link: https://lore.kernel.org/all/Y9hXmz%2FnDOr1hQal@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> [sean: split the switch to atomic-AND to a separate patch] Link: https://lore.kernel.org/r/20230321220021.2119033-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Consolidate Dirty vs. Writable clearing logic in TDP MMUVipin Sharma
Deduplicate the guts of the TDP MMU's clearing of dirty status by snapshotting whether to check+clear the Dirty bit vs. the Writable bit, which is the only difference between the two flavors of dirty tracking. Note, kvm_ad_enabled() is just a wrapper for shadow_accessed_mask, i.e. is constant after kvm-{intel,amd}.ko is loaded. Link: https://lore.kernel.org/all/Yz4Qi7cn7TWTWQjj@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> [sean: split to separate patch, apply to dirty log, write changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Use kvm_ad_enabled() to determine if TDP MMU SPTEs need wrprotVipin Sharma
Use the constant-after-module-load kvm_ad_enabled() to check if SPTEs in the TDP MMU need to be write-protected when clearing accessed/dirty status instead of manually checking every SPTE. The per-SPTE A/D enabling is specific to nested EPT MMUs, i.e. when KVM is using EPT A/D bits but L1 is not, and so cannot happen in the TDP MMU (which is non-nested only). Keep the original code as sanity checks buried under MMU_WARN_ON(). MMU_WARN_ON() is more or less useless at the moment, but there are plans to change that. Link: https://lore.kernel.org/all/Yz4Qi7cn7TWTWQjj@google.com Signed-off-by: Vipin Sharma <vipinsh@google.com> [sean: split to separate patch, apply to dirty path, write changelog] Link: https://lore.kernel.org/r/20230321220021.2119033-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04KVM: x86/mmu: Add a helper function to check if an SPTE needs atomic writeVipin Sharma
Move conditions in kvm_tdp_mmu_write_spte() to check if an SPTE should be written atomically or not to a separate function. This new function, kvm_tdp_mmu_spte_need_atomic_write(), will be used in future commits to optimize clearing bits in SPTEs. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Link: https://lore.kernel.org/r/20230321220021.2119033-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "PPC: - Hide KVM_CAP_IRQFD_RESAMPLE if XIVE is enabled s390: - Fix handling of external interrupts in protected guests x86: - Resample the pending state of IOAPIC interrupts when unmasking them - Fix usage of Hyper-V "enlightened TLB" on AMD - Small fixes to real mode exceptions - Suppress pending MMIO write exits if emulator detects exception Documentation: - Fix rST syntax" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: docs: kvm: x86: Fix broken field list KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent KVM: s390: pv: fix external interruption loop not always detected KVM: nVMX: Do not report error code when synthesizing VM-Exit from Real Mode KVM: x86: Clear "has_error_code", not "error_code", for RM exception injection KVM: x86: Suppress pending MMIO write exits if emulator detects exception KVM: x86/ioapic: Resample the pending state of an IRQ when unmasking KVM: irqfd: Make resampler_list an RCU list KVM: SVM: Flush Hyper-V TLB when required
2023-04-04KVM: SVM: Remove a duplicate definition of VMCB_AVIC_APIC_BAR_MASKXinghui Li
VMCB_AVIC_APIC_BAR_MASK is defined twice with the same value in svm.h, which is meaningless. Delete the duplicate one. Fixes: 391503528257 ("KVM: x86: SVM: move avic definitions from AMD's spec to svm.h") Signed-off-by: Xinghui Li <korantli@tencent.com> Reviewed-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230403095200.1391782-1-korantwork@gmail.com [sean: tweak shortlog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-04-04um: Only disable SSE on clang to work around old GCC bugsDavid Gow
As part of the Rust support for UML, we disable SSE (and similar flags) to match the normal x86 builds. This both makes sense (we ideally want a similar configuration to x86), and works around a crash bug with SSE generation under Rust with LLVM. However, this breaks compiling stdlib.h under gcc < 11, as the x86_64 ABI requires floating-point return values be stored in an SSE register. gcc 11 fixes this by only doing register allocation when a function is actually used, and since we never use atof(), it shouldn't be a problem: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99652 Nevertheless, only disable SSE on clang setups, as that's a simple way of working around everyone's bugs. Fixes: 884981867947 ("rust: arch/um: Disable FP/SIMD instruction to match x86") Reported-by: Roberto Sassu <roberto.sassu@huaweicloud.com> Link: https://lore.kernel.org/linux-um/6df2ecef9011d85654a82acd607fdcbc93ad593c.camel@huaweicloud.com/ Tested-by: Roberto Sassu <roberto.sassu@huaweicloud.com> Tested-by: SeongJae Park <sj@kernel.org> Signed-off-by: David Gow <davidgow@google.com> Reviewed-by: Vincenzo Palazzo <vincenzopalazzodev@gmail.com> Tested-by: Arthur Grillo <arthurgrillo@riseup.net> Signed-off-by: Richard Weinberger <richard@nod.at>
2023-04-03Merge tag 'hyperv-fixes-signed-20230402' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv fixes from Wei Liu: - Fix a bug in channel allocation for VMbus (Mohammed Gamal) - Do not allow root partition functionality in CVM (Michael Kelley) * tag 'hyperv-fixes-signed-20230402' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: x86/hyperv: Block root partition functionality in a Confidential VM Drivers: vmbus: Check for channel allocation before looking up relids
2023-04-03Merge 6.3-rc5 into driver-core-nextGreg Kroah-Hartman
We need the fixes in here for testing, as well as the driver core changes for documentation updates to build on. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-31KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependentAlexey Kardashevskiy
When introduced, IRQFD resampling worked on POWER8 with XICS. However KVM on POWER9 has never implemented it - the compatibility mode code ("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native XIVE mode does not handle INTx in KVM at all. This moved the capability support advertising to platforms and stops advertising it on XIVE, i.e. POWER9 and later. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Anup Patel <anup@brainfault.org> Acked-by: Nicholas Piggin <npiggin@gmail.com> Message-Id: <20220504074807.3616813-1-aik@ozlabs.ru> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-31Merge tag 'kvm-s390-master-6.3-1' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD A small fix that repairs the external loop detection code for PV guests.
2023-03-31iommu/ioasid: Rename INVALID_IOASIDJacob Pan
INVALID_IOASID and IOMMU_PASID_INVALID are duplicated. Rename INVALID_IOASID and consolidate since we are moving away from IOASID infrastructure. Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Link: https://lore.kernel.org/r/20230322200803.869130-7-jacob.jun.pan@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
2023-03-30docs: move x86 documentation into Documentation/arch/Jonathan Corbet
Move the x86 documentation under Documentation/arch/ as a way of cleaning up the top-level directory and making the structure of our docs more closely match the structure of the source directories it describes. All in-kernel references to the old paths have been updated. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-arch@vger.kernel.org Cc: x86@kernel.org Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20230315211523.108836-1-corbet@lwn.net/ Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2023-03-30Merge branch 'x86/cc' into x86/apicThomas Gleixner
Pick up the cc_vendor changes.
2023-03-30Merge branch 'x86/cc' into x86/sevThomas Gleixner
Pick up the cc_vendor changes.
2023-03-30x86/coco: Export cc_vendorBorislav Petkov (AMD)
It will be used in different checks in future changes. Export it directly and provide accessor functions and stubs so this can be used in general code when CONFIG_ARCH_HAS_CC_PLATFORM is not set. No functional changes. [ tglx: Add accessor functions ] Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20230318115634.9392-2-bp@alien8.de
2023-03-30x86/acpi/boot: Correct acpi_is_processor_usable() checkEric DeVolder
The logic in acpi_is_processor_usable() requires the online capable bit be set for hotpluggable CPUs. The online capable bit has been introduced in ACPI 6.3. However, for ACPI revisions < 6.3 which do not support that bit, CPUs should be reported as usable, not the other way around. Reverse the check. [ bp: Rewrite commit message. ] Fixes: e2869bd7af60 ("x86/acpi/boot: Do not register processors that cannot be onlined for x2APIC") Suggested-by: Miguel Luis <miguel.luis@oracle.com> Suggested-by: Boris Ostrovsky <boris.ovstrosky@oracle.com> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: David R <david@unsolicited.net> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230327191026.3454-2-eric.devolder@oracle.com
2023-03-30x86/ACPI/boot: Use FADT version to check support for online capableMario Limonciello
ACPI 6.3 introduced the online capable bit, and also introduced MADT version 5. Latter was used to distinguish whether the offset storing online capable could be used. However ACPI 6.2b has MADT version "45" which is for an errata version of the ACPI 6.2 spec. This means that the Linux code for detecting availability of MADT will mistakenly flag ACPI 6.2b as supporting online capable which is inaccurate as it's an ACPI 6.3 feature. Instead use the FADT major and minor revision fields to distinguish this. [ bp: Massage. ] Fixes: aa06e20f1be6 ("x86/ACPI: Don't add CPUs that are not online capable") Reported-by: Eric DeVolder <eric.devolder@oracle.com> Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/943d2445-84df-d939-f578-5d8240d342cc@unsolicited.net
2023-03-28mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault()Gerald Schaefer
s390 can do more fine-grained handling of spurious TLB protection faults, when there also is the PTE pointer available. Therefore, pass on the PTE pointer to flush_tlb_fix_spurious_fault() as an additional parameter. This will add no functional change to other architectures, but those with private flush_tlb_fix_spurious_fault() implementations need to be made aware of the new parameter. Link: https://lkml.kernel.org/r/20230306161548.661740-1-gerald.schaefer@linux.ibm.com Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28x86: kmsan: use C versions of memset16/memset32/memset64Alexander Potapenko
KMSAN must see as many memory accesses as possible to prevent false positive reports. Fall back to versions of memset16()/memset32()/memset64() implemented in lib/string.c instead of those written in assembly. Link: https://lkml.kernel.org/r/20230303141433.3422671-3-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reviewed-by: Marco Elver <elver@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28x86: kmsan: don't rename memintrinsics in uninstrumented filesAlexander Potapenko
clang -fsanitize=kernel-memory already replaces calls to memset/memcpy/memmove and their __builtin_ versions with __msan_memset/__msan_memcpy/__msan_memmove in instrumented files, so there is no need to override them. In non-instrumented versions we are now required to leave memset() and friends intact, so we cannot replace them with __msan_XXX() functions. Link: https://lkml.kernel.org/r/20230303141433.3422671-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Helge Deller <deller@gmx.de> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-28x86/mm/pat: clear VM_PAT if copy_p4d_range failedMa Wupeng
Syzbot reports a warning in untrack_pfn(). Digging into the root we found that this is due to memory allocation failure in pmd_alloc_one. And this failure is produced due to failslab. In copy_page_range(), memory alloaction for pmd failed. During the error handling process in copy_page_range(), mmput() is called to remove all vmas. While untrack_pfn this empty pfn, warning happens. Here's a simplified flow: dup_mm dup_mmap copy_page_range copy_p4d_range copy_pud_range copy_pmd_range pmd_alloc __pmd_alloc pmd_alloc_one page = alloc_pages(gfp, 0); if (!page) return NULL; mmput exit_mmap unmap_vmas unmap_single_vma untrack_pfn follow_phys WARN_ON_ONCE(1); Since this vma is not generate successfully, we can clear flag VM_PAT. In this case, untrack_pfn() will not be called while cleaning this vma. Function untrack_pfn_moved() has also been renamed to fit the new logic. Link: https://lkml.kernel.org/r/20230217025615.1595558-1-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Reported-by: <syzbot+5f488e922d047d8f00cc@syzkaller.appspotmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-03-27KVM: nVMX: Do not report error code when synthesizing VM-Exit from Real ModeSean Christopherson
Don't report an error code to L1 when synthesizing a nested VM-Exit and L2 is in Real Mode. Per Intel's SDM, regarding the error code valid bit: This bit is always 0 if the VM exit occurred while the logical processor was in real-address mode (CR0.PE=0). The bug was introduced by a recent fix for AMD's Paged Real Mode, which moved the error code suppression from the common "queue exception" path to the "inject exception" path, but missed VMX's "synthesize VM-Exit" path. Fixes: b97f07458373 ("KVM: x86: determine if an exception has an error code only when injecting it.") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230322143300.2209476-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-27KVM: x86: Clear "has_error_code", not "error_code", for RM exception injectionSean Christopherson
When injecting an exception into a vCPU in Real Mode, suppress the error code by clearing the flag that tracks whether the error code is valid, not by clearing the error code itself. The "typo" was introduced by recent fix for SVM's funky Paged Real Mode. Opportunistically hoist the logic above the tracepoint so that the trace is coherent with respect to what is actually injected (this was also the behavior prior to the buggy commit). Fixes: b97f07458373 ("KVM: x86: determine if an exception has an error code only when injecting it.") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230322143300.2209476-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-27KVM: x86: Suppress pending MMIO write exits if emulator detects exceptionSean Christopherson
Clear vcpu->mmio_needed when injecting an exception from the emulator to squash a (legitimate) warning about vcpu->mmio_needed being true at the start of KVM_RUN without a callback being registered to complete the userspace MMIO exit. Suppressing the MMIO write exit is inarguably wrong from an architectural perspective, but it is the least awful hack-a-fix due to shortcomings in KVM's uAPI, not to mention that KVM already suppresses MMIO writes in this scenario. Outside of REP string instructions, KVM doesn't provide a way to resume an instruction at the exact point where it was "interrupted" if said instruction partially completed before encountering an MMIO access. For MMIO reads, KVM immediately exits to userspace upon detecting MMIO as userspace provides the to-be-read value in a buffer, and so KVM can safely (more or less) restart the instruction from the beginning. When the emulator re-encounters the MMIO read, KVM will service the MMIO by getting the value from the buffer instead of exiting to userspace, i.e. KVM won't put the vCPU into an infinite loop. On an emulated MMIO write, KVM finishes the instruction before exiting to userspace, as exiting immediately would ultimately hang the vCPU due to the aforementioned shortcoming of KVM not being able to resume emulation in the middle of an instruction. For the vast majority of _emulated_ instructions, deferring the userspace exit doesn't cause problems as very few x86 instructions (again ignoring string operations) generate multiple writes. But for instructions that generate multiple writes, e.g. PUSHA (multiple pushes onto the stack), deferring the exit effectively results in only the final write triggering an exit to userspace. KVM does support multiple MMIO "fragments", but only for page splits; if an instruction performs multiple distinct MMIO writes, the number of fragments gets reset when the next MMIO write comes along and any previous MMIO writes are dropped. Circling back to the warning, if a deferred MMIO write coincides with an exception, e.g. in this case a #SS due to PUSHA underflowing the stack after queueing a write to an MMIO page on a previous push, KVM injects the exceptions and leaves the deferred MMIO pending without registering a callback, thus triggering the splat. Sweep the problem under the proverbial rug as dropping MMIO writes is not unique to the exception scenario (see above), i.e. instructions like PUSHA are fundamentally broken with respect to MMIO, and have been since KVM's inception. Reported-by: zhangjianguo <zhangjianguo18@huawei.com> Reported-by: syzbot+760a73552f47a8cd0fd9@syzkaller.appspotmail.com Reported-by: syzbot+8accb43ddc6bd1f5713a@syzkaller.appspotmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20230322141220.2206241-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-27KVM: x86/ioapic: Resample the pending state of an IRQ when unmaskingDmytro Maluka
KVM irqfd based emulation of level-triggered interrupts doesn't work quite correctly in some cases, particularly in the case of interrupts that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device in its threaded irq handler, i.e. later than it is acked to the interrupt controller (EOI at the end of hardirq), not earlier. Linux keeps such interrupt masked until its threaded handler finishes, to prevent the EOI from re-asserting an unacknowledged interrupt. However, with KVM + vfio (or whatever is listening on the resamplefd) we always notify resamplefd at the EOI, so vfio prematurely unmasks the host physical IRQ, thus a new physical interrupt is fired in the host. This extra interrupt in the host is not a problem per se. The problem is that it is unconditionally queued for injection into the guest, so the guest sees an extra bogus interrupt. [*] There are observed at least 2 user-visible issues caused by those extra erroneous interrupts for a oneshot irq in the guest: 1. System suspend aborted due to a pending wakeup interrupt from ChromeOS EC (drivers/platform/chrome/cros_ec.c). 2. Annoying "invalid report id data" errors from ELAN0000 touchpad (drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg every time the touchpad is touched. The core issue here is that by the time when the guest unmasks the IRQ, the physical IRQ line is no longer asserted (since the guest has acked the interrupt to the device in the meantime), yet we unconditionally inject the interrupt queued into the guest by the previous resampling. So to fix the issue, we need a way to detect that the IRQ is no longer pending, and cancel the queued interrupt in this case. With IOAPIC we are not able to probe the physical IRQ line state directly (at least not if the underlying physical interrupt controller is an IOAPIC too), so in this patch we use irqfd resampler for that. Namely, instead of injecting the queued interrupt, we just notify the resampler that this interrupt is done. If the IRQ line is actually already deasserted, we are done. If it is still asserted, a new interrupt will be shortly triggered through irqfd and injected into the guest. In the case if there is no irqfd resampler registered for this IRQ, we cannot fix the issue, so we keep the existing behavior: immediately unconditionally inject the queued interrupt. This patch fixes the issue for x86 IOAPIC only. In the long run, we can fix it for other irqchips and other architectures too, possibly taking advantage of reading the physical state of the IRQ line, which is possible with some other irqchips (e.g. with arm64 GIC, maybe even with the legacy x86 PIC). [*] In this description we assume that the interrupt is a physical host interrupt forwarded to the guest e.g. by vfio. Potentially the same issue may occur also with a purely virtual interrupt from an emulated device, e.g. if the guest handles this interrupt, again, as a oneshot interrupt. Signed-off-by: Dmytro Maluka <dmy@semihalf.com> Link: https://lore.kernel.org/kvm/31420943-8c5f-125c-a5ee-d2fde2700083@semihalf.com/ Link: https://lore.kernel.org/lkml/87o7wrug0w.wl-maz@kernel.org/ Message-Id: <20230322204344.50138-3-dmy@semihalf.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-27KVM: SVM: Flush Hyper-V TLB when requiredJeremi Piotrowski
The Hyper-V "EnlightenedNptTlb" enlightenment is always enabled when KVM is running on top of Hyper-V and Hyper-V exposes support for it (which is always). On AMD CPUs this enlightenment results in ASID invalidations not flushing TLB entries derived from the NPT. To force the underlying (L0) hypervisor to rebuild its shadow page tables, an explicit hypercall is needed. The original KVM implementation of Hyper-V's "EnlightenedNptTlb" on SVM only added remote TLB flush hooks. This worked out fine for a while, as sufficient remote TLB flushes where being issued in KVM to mask the problem. Since v5.17, changes in the TDP code reduced the number of flushes and the out-of-sync TLB prevents guests from booting successfully. Split svm_flush_tlb_current() into separate callbacks for the 3 cases (guest/all/current), and issue the required Hyper-V hypercall when a Hyper-V TLB flush is needed. The most important case where the TLB flush was missing is when loading a new PGD, which is followed by what is now svm_flush_tlb_current(). Cc: stable@vger.kernel.org # v5.17+ Fixes: 1e0c7d40758b ("KVM: SVM: hyper-v: Remote TLB flush for SVM") Link: https://lore.kernel.org/lkml/43980946-7bbf-dcef-7e40-af904c456250@linux.microsoft.com/ Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Jeremi Piotrowski <jpiotrowski@linux.microsoft.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20230324145233.4585-1-jpiotrowski@linux.microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-27x86/include/asm/msr-index.h: Add IFS Array test bitsJithu Joseph
Define MSR bitfields for enumerating support for Array BIST test. Signed-off-by: Jithu Joseph <jithu.joseph@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20230322003359.213046-5-jithu.joseph@intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
2023-03-27x86/hyperv: Change vTOM handling to use standard coco mechanismsMichael Kelley
Hyper-V guests on AMD SEV-SNP hardware have the option of using the "virtual Top Of Memory" (vTOM) feature specified by the SEV-SNP architecture. With vTOM, shared vs. private memory accesses are controlled by splitting the guest physical address space into two halves. vTOM is the dividing line where the uppermost bit of the physical address space is set; e.g., with 47 bits of guest physical address space, vTOM is 0x400000000000 (bit 46 is set). Guest physical memory is accessible at two parallel physical addresses -- one below vTOM and one above vTOM. Accesses below vTOM are private (encrypted) while accesses above vTOM are shared (decrypted). In this sense, vTOM is like the GPA.SHARED bit in Intel TDX. Support for Hyper-V guests using vTOM was added to the Linux kernel in two patch sets[1][2]. This support treats the vTOM bit as part of the physical address. For accessing shared (decrypted) memory, these patch sets create a second kernel virtual mapping that maps to physical addresses above vTOM. A better approach is to treat the vTOM bit as a protection flag, not as part of the physical address. This new approach is like the approach for the GPA.SHARED bit in Intel TDX. Rather than creating a second kernel virtual mapping, the existing mapping is updated using recently added coco mechanisms. When memory is changed between private and shared using set_memory_decrypted() and set_memory_encrypted(), the PTEs for the existing kernel mapping are changed to add or remove the vTOM bit in the guest physical address, just as with TDX. The hypercalls to change the memory status on the host side are made using the existing callback mechanism. Everything just works, with a minor tweak to map the IO-APIC to use private accesses. To accomplish the switch in approach, the following must be done: * Update Hyper-V initialization to set the cc_mask based on vTOM and do other coco initialization. * Update physical_mask so the vTOM bit is no longer treated as part of the physical address * Remove CC_VENDOR_HYPERV and merge the associated vTOM functionality under CC_VENDOR_AMD. Update cc_mkenc() and cc_mkdec() to set/clear the vTOM bit as a protection flag. * Code already exists to make hypercalls to inform Hyper-V about pages changing between shared and private. Update this code to run as a callback from __set_memory_enc_pgtable(). * Remove the Hyper-V special case from __set_memory_enc_dec() * Remove the Hyper-V specific call to swiotlb_update_mem_attributes() since mem_encrypt_init() will now do it. * Add a Hyper-V specific implementation of the is_private_mmio() callback that returns true for the IO-APIC and vTPM MMIO addresses [1] https://lore.kernel.org/all/20211025122116.264793-1-ltykernel@gmail.com/ [2] https://lore.kernel.org/all/20211213071407.314309-1-ltykernel@gmail.com/ [ bp: Touchups. ] Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/1679838727-87310-7-git-send-email-mikelley@microsoft.com
2023-03-27x86/mm: Handle decryption/re-encryption of bss_decrypted consistentlyMichael Kelley
sme_postprocess_startup() decrypts the bss_decrypted section when sme_me_mask is non-zero. mem_encrypt_free_decrypted_mem() re-encrypts the unused portion based on CC_ATTR_MEM_ENCRYPT. In a Hyper-V guest VM using vTOM, these conditions are not equivalent as sme_me_mask is always zero when using vTOM. Consequently, mem_encrypt_free_decrypted_mem() attempts to re-encrypt memory that was never decrypted. So check sme_me_mask in mem_encrypt_free_decrypted_mem() too. Hyper-V guests using vTOM don't need the bss_decrypted section to be decrypted, so skipping the decryption/re-encryption doesn't cause a problem. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/1678329614-3482-5-git-send-email-mikelley@microsoft.com
2023-03-27Drivers: hv: Explicitly request decrypted in vmap_pfn() callsMichael Kelley
Update vmap_pfn() calls to explicitly request that the mapping be for decrypted access to the memory. There's no change in functionality since the PFNs passed to vmap_pfn() are above the shared_gpa_boundary, implicitly producing a decrypted mapping. But explicitly requesting "decrypted" allows the code to work before and after changes that cause vmap_pfn() to mask the PFNs to being below the shared_gpa_boundary. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Link: https://lore.kernel.org/r/1679838727-87310-4-git-send-email-mikelley@microsoft.com
2023-03-27x86/hyperv: Reorder code to facilitate future workMichael Kelley
Reorder some code to facilitate future work. No functional change. Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tianyu Lan <Tianyu.Lan@microsoft.com> Link: https://lore.kernel.org/r/1679838727-87310-3-git-send-email-mikelley@microsoft.com