summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx
AgeCommit message (Collapse)Author
2024-06-11KVM: x86/pmu: Add a helper to enable bits in FIXED_CTR_CTRLSean Christopherson
Add a helper, intel_pmu_enable_fixed_counter_bits(), to dedup code that enables fixed counter bits, i.e. when KVM clears bits in the reserved mask used to detect invalid MSR_CORE_PERF_FIXED_CTR_CTRL values. No functional change intended. Cc: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240608000819.3296176-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10KVM: VMX: Remove unused declaration of vmx_request_immediate_exit()Binbin Wu
After commit 0ec3d6d1f169 "KVM: x86: Fully defer to vendor code to decide how to force immediate exit", vmx_request_immediate_exit() was removed. Commit 5f18c642ff7e "KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDX" added its declaration by accident. Remove it. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20240506075025.2251131-1-binbin.wu@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-10KVM: x86: Drop unused check_apicv_inhibit_reasons() callback definitionHou Wenlong
The check_apicv_inhibit_reasons() callback implementation was dropped in the commit b3f257a84696 ("KVM: x86: Track required APICv inhibits with variable, not callback"), but the definition removal was missed in the final version patch (it was removed in the v4). Therefore, it should be dropped, and the vmx_check_apicv_inhibit_reasons() function declaration should also be removed. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Link: https://lore.kernel.org/r/54abd1d0ccaba4d532f81df61259b9c0e021fbde.1714977229.git.houwenlong.hwl@antgroup.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-07KVM: VMX: Always honor guest PAT on CPUs that support self-snoopSean Christopherson
Unconditionally honor guest PAT on CPUs that support self-snoop, as Intel has confirmed that CPUs that support self-snoop always snoop caches and store buffers. I.e. CPUs with self-snoop maintain cache coherency even in the presence of aliased memtypes, thus there is no need to trust the guest behaves and only honor PAT as a last resort, as KVM does today. Honoring guest PAT is desirable for use cases where the guest has access to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual (mediated, for all intents and purposes) GPU is exposed to the guest, along with buffers that are consumed directly by the physical GPU, i.e. which can't be proxied by the host to ensure writes from the guest are performed with the correct memory type for the GPU. Cc: Yiwei Zhang <zzyiwei@google.com> Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Suggested-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: VMX: Drop support for forcing UC memory when guest CR0.CD=1Sean Christopherson
Drop KVM's emulation of CR0.CD=1 on Intel CPUs now that KVM no longer honors guest MTRR memtypes, as forcing UC memory for VMs with non-coherent DMA only makes sense if the guest is using something other than PAT to configure the memtype for the DMA region. Furthermore, KVM has forced WB memory for CR0.CD=1 since commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED"), and no known VMM in existence disables KVM_X86_QUIRK_CD_NW_CLEARED, let alone does so with non-coherent DMA. Lastly, commit fb279950ba02 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") was from the same author as commit b18d5431acc7 ("KVM: x86: fix CR0.CD virtualization"), and followed by a mere month. I.e. forcing UC memory was likely the result of code inspection or perhaps misdiagnosed failures, and not the necessitate by a concrete use case. Update KVM's documentation to note that KVM_X86_QUIRK_CD_NW_CLEARED is now AMD-only, and to take an erratum for lack of CR0.CD virtualization on Intel. Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: x86: Remove VMX support for virtualizing guest MTRR memtypesSean Christopherson
Remove KVM's support for virtualizing guest MTRR memtypes, as full MTRR adds no value, negatively impacts guest performance, and is a maintenance burden due to it's complexity and oddities. KVM's approach to virtualizating MTRRs make no sense, at all. KVM *only* honors guest MTRR memtypes if EPT is enabled *and* the guest has a device that may perform non-coherent DMA access. From a hardware virtualization perspective of guest MTRRs, there is _nothing_ special about EPT. Legacy shadowing paging doesn't magically account for guest MTRRs, nor does NPT. Unwinding and deciphering KVM's murky history, the MTRR virtualization code appears to be the result of misdiagnosed issues when EPT + VT-d with passthrough devices was enabled years and years ago. And importantly, the underlying bugs that were fudged around by honoring guest MTRR memtypes have since been fixed (though rather poorly in some cases). The zapping GFNs logic in the MTRR virtualization code came from: commit efdfe536d8c643391e19d5726b072f82964bfbdb Author: Xiao Guangrong <guangrong.xiao@linux.intel.com> Date: Wed May 13 14:42:27 2015 +0800 KVM: MMU: fix MTRR update Currently, whenever guest MTRR registers are changed kvm_mmu_reset_context is called to switch to the new root shadow page table, however, it's useless since: 1) the cache type is not cached into shadow page's attribute so that the original root shadow page will be reused 2) the cache type is set on the last spte, that means we should sync the last sptes when MTRR is changed This patch fixs this issue by drop all the spte in the gfn range which is being updated by MTRR which was a fix for: commit 0bed3b568b68e5835ef5da888a372b9beabf7544 Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Oct 9 16:01:54 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Wed Dec 31 16:51:44 2008 +0200 KVM: Improve MTRR structure As well as reset mmu context when set MTRR. which was part of a "MTRR/PAT support for EPT" series that also added: + if (mt_mask) { + mt_mask = get_memory_type(vcpu, gfn) << + kvm_x86_ops->get_mt_mask_shift(); + spte |= mt_mask; + } where get_memory_type() was a truly gnarly helper to retrieve the guest MTRR memtype for a given memtype. And *very* subtly, at the time of that change, KVM *always* set VMX_EPT_IGMT_BIT, kvm_mmu_set_base_ptes(VMX_EPT_READABLE_MASK | VMX_EPT_WRITABLE_MASK | VMX_EPT_DEFAULT_MT << VMX_EPT_MT_EPTE_SHIFT | VMX_EPT_IGMT_BIT); which came in via: commit 928d4bf747e9c290b690ff515d8f81e8ee226d97 Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Thu Nov 6 14:55:45 2008 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Tue Nov 11 21:00:37 2008 +0200 KVM: VMX: Set IGMT bit in EPT entry There is a potential issue that, when guest using pagetable without vmexit when EPT enabled, guest would use PAT/PCD/PWT bits to index PAT msr for it's memory, which would be inconsistent with host side and would cause host MCE due to inconsistent cache attribute. The patch set IGMT bit in EPT entry to ignore guest PAT and use WB as default memory type to protect host (notice that all memory mapped by KVM should be WB). Note the CommitDates! The AuthorDates strongly suggests Sheng Yang added the whole "ignoreIGMT things as a bug fix for issues that were detected during EPT + VT-d + passthrough enabling, but it was applied earlier because it was a generic fix. Jumping back to 0bed3b568b68 ("KVM: Improve MTRR structure"), the other relevant code, or rather lack thereof, is the handling of *host* MMIO. That fix came in a bit later, but given the author and timing, it's safe to say it was all part of the same EPT+VT-d enabling mess. commit 2aaf69dcee864f4fb6402638dd2f263324ac839f Author: Sheng Yang <sheng@linux.intel.com> AuthorDate: Wed Jan 21 16:52:16 2009 +0800 Commit: Avi Kivity <avi@redhat.com> CommitDate: Sun Feb 15 02:47:37 2009 +0200 KVM: MMU: Map device MMIO as UC in EPT Software are not allow to access device MMIO using cacheable memory type, the patch limit MMIO region with UC and WC(guest can select WC using PAT and PCD/PWT). In addition to the host MMIO and IGMT issues, KVM's MTRR virtualization was obviously never tested on NPT until much later, which lends further credence to the theory/argument that this was all the result of misdiagnosed issues. Discussion from the EPT+MTRR enabling thread[*] more or less confirms that Sheng Yang was trying to resolve issues with passthrough MMIO. * Sheng Yang : Do you mean host(qemu) would access this memory and if we set it to guest : MTRR, host access would be broken? We would cover this in our shadow MTRR : patch, for we encountered this in video ram when doing some experiment with : VGA assignment. And in the same thread, there's also what appears to be confirmation of Intel running into issues with Windows XP related to a guest device driver mapping DMA with WC in the PAT. * Avi Kavity : Sheng Yang wrote: : > Yes... But it's easy to do with assigned devices' mmio, but what if guest : > specific some non-mmio memory's memory type? E.g. we have met one issue in : > Xen, that a assigned-device's XP driver specific one memory region as buffer, : > and modify the memory type then do DMA. : > : > Only map MMIO space can be first step, but I guess we can modify assigned : > memory region memory type follow guest's? : > : : With ept/npt, we can't, since the memory type is in the guest's : pagetable entries, and these are not accessible. [*] https://lore.kernel.org/all/1223539317-32379-1-git-send-email-sheng@linux.intel.com So, for the most part, what likely happened is that 15 years ago, a few engineers (a) fixed a #MC problem by ignoring guest PAT and (b) initially "fixed" passthrough device MMIO by emulating *guest* MTRRs. Except for the below case, everything since then has been a result of those two intertwined changes. The one exception, which is actually yet more confirmation of all of the above, is the revert of Paolo's attempt at "full" virtualization of guest MTRRs: commit 606decd67049217684e3cb5a54104d51ddd4ef35 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:12:47 2015 +0200 Revert "KVM: x86: apply guest MTRR virtualization on host reserved pages" This reverts commit fd717f11015f673487ffc826e59b2bad69d20fe5. It was reported to cause Machine Check Exceptions (bug 104091). ... commit fd717f11015f673487ffc826e59b2bad69d20fe5 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:38:13 2015 +0200 KVM: x86: apply guest MTRR virtualization on host reserved pages Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true. However, the guest could prefer a different page type than UC for such pages. A good example is that pass-throughed VGA frame buffer is not always UC as host expected. This patch enables full use of virtual guest MTRRs. I.e. Paolo tried to add back KVM's behavior before "Map device MMIO as UC in EPT" and got the same result: machine checks, likely due to the guest MTRRs not being trustworthy/sane at all times. Note, Paolo also tried to enable MTRR virtualization on SVM+NPT, but that too got reverted. Unfortunately, it doesn't appear that anyone ever found a smoking gun, i.e. exactly why emulating guest MTRRs via NPT PAT caused extremely slow boot times doesn't appear to have a definitive root cause. commit fc07e76ac7ffa3afd621a1c3858a503386a14281 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Thu Oct 1 13:20:22 2015 +0200 Revert "KVM: SVM: use NPT page attributes" This reverts commit 3c2e7f7de3240216042b61073803b61b9b3cfb22. Initializing the mapping from MTRR to PAT values was reported to fail nondeterministically, and it also caused extremely slow boot (due to caching getting disabled---bug 103321) with assigned devices. ... commit 3c2e7f7de3240216042b61073803b61b9b3cfb22 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Tue Jul 7 14:32:17 2015 +0200 KVM: SVM: use NPT page attributes Right now, NPT page attributes are not used, and the final page attribute depends solely on gPAT (which however is not synced correctly), the guest MTRRs and the guest page attributes. However, we can do better by mimicking what is done for VMX. In the absence of PCI passthrough, the guest PAT can be ignored and the page attributes can be just WB. If passthrough is being used, instead, keep respecting the guest PAT, and emulate the guest MTRRs through the PAT field of the nested page tables. The only snag is that WP memory cannot be emulated correctly, because Linux's default PAT setting only includes the other types. In short, honoring guest MTRRs for VMX was initially a workaround of sorts for KVM ignoring guest PAT *and* for KVM not forcing UC for host MMIO. And while there *are* known cases where honoring guest MTRRs is desirable, e.g. passthrough VGA frame buffers, the desired behavior in that case is to get WC instead of UC, i.e. at this point it's for performance, not correctness. Furthermore, the complete absence of MTRR virtualization on NPT and shadow paging proves that, while KVM theoretically can do better, it's by no means necessary for correctnesss. Lastly, since kernels mostly rely on firmware to do MTRR setup, and the host typically provides guest firmware, honoring guest MTRRs is effectively honoring *host* userspace memtypes, which is also backwards. I.e. it would be far better for host userspace to communicate its desired memtype directly to KVM (or perhaps indirectly via VMAs in the host kernel), not through guest MTRRs. Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20240309010929.1403984-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-05KVM: x86: Keep consistent naming for APICv/AVIC inhibit reasonsAlejandro Jimenez
Keep kvm_apicv_inhibit enum naming consistent with the current pattern by renaming the reason/enumerator defined as APICV_INHIBIT_REASON_DISABLE to APICV_INHIBIT_REASON_DISABLED. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Link: https://lore.kernel.org/r/20240506225321.3440701-3-alejandro.j.jimenez@oracle.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03KVM: x86/pmu: Manipulate FIXED_CTR_CTRL MSR with macrosDapeng Mi
Magic numbers are used to manipulate the bit fields of FIXED_CTR_CTRL MSR. This makes reading code become difficult, so use pre-defined macros to replace these magic numbers. Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240430005239.13527-3-dapeng1.mi@linux.intel.com [sean: drop unnecessary curly braces] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03KVM: x86/pmu: Change ambiguous _mask suffix to _rsvd in kvm_pmuDapeng Mi
Several '_mask' suffixed variables such as, global_ctrl_mask, are defined in kvm_pmu structure. However the _mask suffix is ambiguous and misleading since it's not a real mask with positive logic. On the contrary it represents the reserved bits of corresponding MSRs and these bits should not be accessed. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240430005239.13527-2-dapeng1.mi@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03Merge branch 'kvm-fixes-6.10-1' into HEADPaolo Bonzini
* Fixes and debugging help for the #VE sanity check. Also disable it by default, even for CONFIG_DEBUG_KERNEL, because it was found to trigger spuriously (most likely a processor erratum as the exact symptoms vary by generation). * Avoid WARN() when two NMIs arrive simultaneously during an NMI-disabled situation (GIF=0 or interrupt shadow) when the processor supports virtual NMI. While generally KVM will not request an NMI window when virtual NMIs are supported, in this case it *does* have to single-step over the interrupt shadow or enable the STGI intercept, in order to deliver the latched second NMI. * Drop support for hand tuning APIC timer advancement from userspace. Since we have adaptive tuning, and it has proved to work well, drop the module parameter for manual configuration and with it a few stupid bugs that it had.
2024-06-03KVM: VMX: Switch to new Intel CPU model infrastructureTony Luck
Use x86_vfm (vendor, family, module) to detect CPUs that are affected by PERF_GLOBAL_CTRL bugs instead of manually checking the family and model. The new VFM infrastructure encodes all information in one handy location. No functional change intended. Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20240520224620.9480-10-tony.luck@intel.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03KVM: x86: Move shadow_phys_bits into "kvm_host", as "maxphyaddr"Sean Christopherson
Move shadow_phys_bits into "struct kvm_host_values", i.e. into KVM's global "kvm_host" variable, so that it is automatically exported for use in vendor modules. Rename the variable/field to maxphyaddr to more clearly capture what value it holds, now that it's used outside of the MMU (and because the "shadow" part is more than a bit misleading as the variable is not at all unique to shadow paging). Recomputing the raw/true host.MAXPHYADDR on every use can be subtly expensive, e.g. it will incur a VM-Exit on the CPUID if KVM is running as a nested hypervisor. Vendor code already has access to the information, e.g. by directly doing CPUID or by invoking kvm_get_shadow_phys_bits(), so there's no tangible benefit to making it MMU-only. Link: https://lore.kernel.org/r/20240423221521.2923759-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-06-03KVM: x86: Add a struct to consolidate host values, e.g. EFER, XCR0, etc...Sean Christopherson
Add "struct kvm_host_values kvm_host" to hold the various host values that KVM snapshots during initialization. Bundling the host values into a single struct simplifies adding new MSRs and other features with host state/values that KVM cares about, and provides a one-stop shop. E.g. adding a new value requires one line, whereas tracking each value individual often requires three: declaration, definition, and export. No functional change intended. Link: https://lore.kernel.org/r/20240423221521.2923759-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-05-23KVM: x86/mmu: Print SPTEs on unexpected #VESean Christopherson
Print the SPTEs that correspond to the faulting GPA on an unexpected EPT Violation #VE to help the user debug failures, e.g. to pinpoint which SPTE didn't have SUPPRESS_VE set. Opportunistically assert that the underlying exit reason was indeed an EPT Violation, as the CPU has *really* gone off the rails if a #VE occurs due to a completely unexpected exit reason. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: VMX: Dump VMCS on unexpected #VESean Christopherson
Dump the VMCS on an unexpected #VE, otherwise it's practically impossible to figure out why the #VE occurred. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: nVMX: Always handle #VEs in L0 (never forward #VEs from L2 to L1)Sean Christopherson
Always handle #VEs, e.g. due to prove EPT Violation #VE failures, in L0, as KVM does not expose any #VE capabilities to L1, i.e. any and all #VEs are KVM's responsibility. Fixes: 8131cf5b4fd8 ("KVM: VMX: Introduce test mode related to EPT violation VE") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: nVMX: Initialize #VE info page for vmcs02 when proving #VE supportSean Christopherson
Point vmcs02.VE_INFORMATION_ADDRESS at the vCPU's #VE info page when initializing vmcs02, otherwise KVM will run L2 with EPT Violation #VE enabled and a VE info address pointing at pfn 0. Fixes: 8131cf5b4fd8 ("KVM: VMX: Introduce test mode related to EPT violation VE") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-23KVM: VMX: Don't kill the VM on an unexpected #VESean Christopherson
Don't terminate the VM on an unexpected #VE, as it's extremely unlikely the #VE is fatal to the guest, and even less likely that it presents a danger to the host. Simply resume the guest on "failure", as the #VE info page's BUSY field will prevent converting any more EPT Violations to #VEs for the vCPU (at least, that's what the BUSY field is supposed to do). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240518000430.1118488-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-05-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "ARM: - Move a lot of state that was previously stored on a per vcpu basis into a per-CPU area, because it is only pertinent to the host while the vcpu is loaded. This results in better state tracking, and a smaller vcpu structure. - Add full handling of the ERET/ERETAA/ERETAB instructions in nested virtualisation. The last two instructions also require emulating part of the pointer authentication extension. As a result, the trap handling of pointer authentication has been greatly simplified. - Turn the global (and not very scalable) LPI translation cache into a per-ITS, scalable cache, making non directly injected LPIs much cheaper to make visible to the vcpu. - A batch of pKVM patches, mostly fixes and cleanups, as the upstreaming process seems to be resuming. Fingers crossed! - Allocate PPIs and SGIs outside of the vcpu structure, allowing for smaller EL2 mapping and some flexibility in implementing more or less than 32 private IRQs. - Purge stale mpidr_data if a vcpu is created after the MPIDR map has been created. - Preserve vcpu-specific ID registers across a vcpu reset. - Various minor cleanups and improvements. LoongArch: - Add ParaVirt IPI support - Add software breakpoint support - Add mmio trace events support RISC-V: - Support guest breakpoints using ebreak - Introduce per-VCPU mp_state_lock and reset_cntx_lock - Virtualize SBI PMU snapshot and counter overflow interrupts - New selftests for SBI PMU and Guest ebreak - Some preparatory work for both TDX and SNP page fault handling. This also cleans up the page fault path, so that the priorities of various kinds of fauls (private page, no memory, write to read-only slot, etc.) are easier to follow. x86: - Minimize amount of time that shadow PTEs remain in the special REMOVED_SPTE state. This is a state where the mmu_lock is held for reading but concurrent accesses to the PTE have to spin; shortening its use allows other vCPUs to repopulate the zapped region while the zapper finishes tearing down the old, defunct page tables. - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which is defined by hardware but left for software use. This lets KVM communicate its inability to map GPAs that set bits 51:48 on hosts without 5-level nested page tables. Guest firmware is expected to use the information when mapping BARs; this avoids that they end up at a legal, but unmappable, GPA. - Fixed a bug where KVM would not reject accesses to MSR that aren't supposed to exist given the vCPU model and/or KVM configuration. - As usual, a bunch of code cleanups. x86 (AMD): - Implement a new and improved API to initialize SEV and SEV-ES VMs, which will also be extendable to SEV-SNP. The new API specifies the desired encryption in KVM_CREATE_VM and then separately initializes the VM. The new API also allows customizing the desired set of VMSA features; the features affect the measurement of the VM's initial state, and therefore enabling them cannot be done tout court by the hypervisor. While at it, the new API includes two bugfixes that couldn't be applied to the old one without a flag day in userspace or without affecting the initial measurement. When a SEV-ES VM is created with the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are rejected once the VMSA has been encrypted. Also, the FPU and AVX state will be synchronized and encrypted too. - Support for GHCB version 2 as applicable to SEV-ES guests. This, once more, is only accessible when using the new KVM_SEV_INIT2 flow for initialization of SEV-ES VMs. x86 (Intel): - An initial bunch of prerequisite patches for Intel TDX were merged. They generally don't do anything interesting. The only somewhat user visible change is a new debugging mode that checks that KVM's MMU never triggers a #VE virtualization exception in the guest. - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to L1, as per the SDM. Generic: - Use vfree() instead of kvfree() for allocations that always use vcalloc() or __vcalloc(). - Remove .change_pte() MMU notifier - the changes to non-KVM code are small and Andrew Morton asked that I also take those through the KVM tree. The callback was only ever implemented by KVM (which was also the original user of MMU notifiers) but it had been nonfunctional ever since calls to set_pte_at_notify were wrapped with invalidate_range_start and invalidate_range_end... in 2012. Selftests: - Enhance the demand paging test to allow for better reporting and stressing of UFFD performance. - Convert the steal time test to generate TAP-friendly output. - Fix a flaky false positive in the xen_shinfo_test due to comparing elapsed time across two different clock domains. - Skip the MONITOR/MWAIT test if the host doesn't actually support MWAIT. - Avoid unnecessary use of "sudo" in the NX hugepage test wrapper shell script, to play nice with running in a minimal userspace environment. - Allow skipping the RSEQ test's sanity check that the vCPU was able to complete a reasonable number of KVM_RUNs, as the assert can fail on a completely valid setup. If the test is run on a large-ish system that is otherwise idle, and the test isn't affined to a low-ish number of CPUs, the vCPU task can be repeatedly migrated to CPUs that are in deep sleep states, which results in the vCPU having very little net runtime before the next migration due to high wakeup latencies. - Define _GNU_SOURCE for all selftests to fix a warning that was introduced by a change to kselftest_harness.h late in the 6.9 cycle, and because forcing every test to #define _GNU_SOURCE is painful. - Provide a global pseudo-RNG instance for all tests, so that library code can generate random, but determinstic numbers. - Use the global pRNG to randomly force emulation of select writes from guest code on x86, e.g. to help validate KVM's emulation of locked accesses. - Allocate and initialize x86's GDT, IDT, TSS, segments, and default exception handlers at VM creation, instead of forcing tests to manually trigger the related setup. Documentation: - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (225 commits) selftests/kvm: remove dead file KVM: selftests: arm64: Test vCPU-scoped feature ID registers KVM: selftests: arm64: Test that feature ID regs survive a reset KVM: selftests: arm64: Store expected register value in set_id_regs KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope KVM: arm64: Only reset vCPU-scoped feature ID regs once KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs() KVM: arm64: Rename is_id_reg() to imply VM scope KVM: arm64: Destroy mpidr_data for 'late' vCPU creation KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support KVM: arm64: Fix hvhe/nvhe early alias parsing KVM: SEV: Allow per-guest configuration of GHCB protocol version KVM: SEV: Add GHCB handling for termination requests KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests KVM: SEV: Add support to handle AP reset MSR protocol KVM: x86: Explicitly zero kvm_caps during vendor module load KVM: x86: Fully re-initialize supported_mce_cap on vendor module load KVM: x86: Fully re-initialize supported_vm_types on vendor module load KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values ...
2024-05-12Merge tag 'kvm-x86-vmx-6.10' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX changes for 6.10: - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to L1, as per the SDM. - Move kvm_vcpu_arch's exit_qualification into x86_exception, as the field is used only when synthesizing nested EPT violation, i.e. it's not the vCPU's "real" exit_qualification, which is tracked elsewhere. - Add a sanity check to assert that EPT Violations are the only sources of nested PML Full VM-Exits.
2024-05-10Merge tag 'loongarch-kvm-6.10' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.10 1. Add ParaVirt IPI support. 2. Add software breakpoint support. 3. Add mmio trace events support.
2024-04-30x86/irq: Remove bitfields in posted interrupt descriptorJacob Pan
Mixture of bitfields and types is weird and really not intuitive, remove bitfields and use typed data exclusively. Bitfields often result in inferior machine code. Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240423174114.526704-4-jacob.jun.pan@linux.intel.com Link: https://lore.kernel.org/all/20240404101735.402feec8@jacob-builder/T/#mf66e34a82a48f4d8e2926b5581eff59a122de53a
2024-04-30KVM: VMX: Move posted interrupt descriptor out of VMX codeJacob Pan
To prepare native usage of posted interrupts, move the PID declarations out of VMX code such that they can be shared. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20240423174114.526704-2-jacob.jun.pan@linux.intel.com
2024-04-20Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "This is a bit on the large side, mostly due to two changes: - Changes to disable some broken PMU virtualization (see below for details under "x86 PMU") - Clean up SVM's enter/exit assembly code so that it can be compiled without OBJECT_FILES_NON_STANDARD. This fixes a warning "Unpatched return thunk in use. This should not happen!" when running KVM selftests. Everything else is small bugfixes and selftest changes: - Fix a mostly benign bug in the gfn_to_pfn_cache infrastructure where KVM would allow userspace to refresh the cache with a bogus GPA. The bug has existed for quite some time, but was exposed by a new sanity check added in 6.9 (to ensure a cache is either GPA-based or HVA-based). - Drop an unused param from gfn_to_pfn_cache_invalidate_start() that got left behind during a 6.9 cleanup. - Fix a math goof in x86's hugepage logic for KVM_SET_MEMORY_ATTRIBUTES that results in an array overflow (detected by KASAN). - Fix a bug where KVM incorrectly clears root_role.direct when userspace sets guest CPUID. - Fix a dirty logging bug in the where KVM fails to write-protect SPTEs used by a nested guest, if KVM is using Page-Modification Logging and the nested hypervisor is NOT using EPT. x86 PMU: - Drop support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken without an obvious/easy path forward, and because exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak host kernel addresses to the guest. - Set the enable bits for general purpose counters in PERF_GLOBAL_CTRL at RESET time, as done by both Intel and AMD processors. - Disable LBR virtualization on CPUs that don't support LBR callstacks, as KVM unconditionally uses PERF_SAMPLE_BRANCH_CALL_STACK when creating the perf event, and would fail on such CPUs. Tests: - Fix a flaw in the max_guest_memory selftest that results in it exhausting the supply of ucall structures when run with more than 256 vCPUs. - Mark KVM_MEM_READONLY as supported for RISC-V in set_memory_region_test" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (30 commits) KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start() KVM: selftests: Add coverage of EPT-disabled to vmx_dirty_log_test KVM: x86/mmu: Fix and clarify comments about clearing D-bit vs. write-protecting KVM: x86/mmu: Remove function comments above clear_dirty_{gfn_range,pt_masked}() KVM: x86/mmu: Write-protect L2 SPTEs in TDP MMU when clearing dirty status KVM: x86/mmu: Precisely invalidate MMU root_role during CPUID update KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacks perf/x86/intel: Expose existence of callback support to KVM KVM: VMX: Snapshot LBR capabilities during module initialization KVM: x86/pmu: Do not mask LVTPC when handling a PMI on AMD platforms KVM: x86: Snapshot if a vCPU's vendor model is AMD vs. Intel compatible KVM: x86: Stop compiling vmenter.S with OBJECT_FILES_NON_STANDARD KVM: SVM: Create a stack frame in __svm_sev_es_vcpu_run() KVM: SVM: Save/restore args across SEV-ES VMRUN via host save area KVM: SVM: Save/restore non-volatile GPRs in SEV-ES VMRUN via host save area KVM: SVM: Clobber RAX instead of RBX when discarding spec_ctrl_intercepted KVM: SVM: Drop 32-bit "support" from __svm_sev_es_vcpu_run() KVM: SVM: Wrap __svm_sev_es_vcpu_run() with #ifdef CONFIG_KVM_AMD_SEV KVM: SVM: Create a stack frame in __svm_vcpu_run() for unwinding KVM: SVM: Remove a useless zeroing of allocated memory ...
2024-04-19KVM: VMX: Introduce test mode related to EPT violation VEIsaku Yamahata
To support TDX, KVM is enhanced to operate with #VE. For TDX, KVM uses the suppress #VE bit in EPT entries selectively, in order to be able to trap non-present conditions. However, #VE isn't used for VMX and it's a bug if it happens. To be defensive and test that VMX case isn't broken introduce an option ept_violation_ve_test and when it's set, BUG the vm. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Message-Id: <d6db6ba836605c0412e166359ba5c46a63c22f86.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-19KVM, x86: add architectural support code for #VEPaolo Bonzini
Dump the contents of the #VE info data structure and assert that #VE does not happen, but do not yet do anything with it. No functional change intended, separated for clarity only. Extracted from a patch by Isaku Yamahata <isaku.yamahata@intel.com>. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argumentSean Christopherson
TDX uses different ABI to get information about VM exit. Pass intr_info to the NMI and INTR handlers instead of pulling it from vcpu_vmx in preparation for sharing the bulk of the handlers with TDX. When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds exit qualification etc rather than the VMCS fields because VMM doesn't have access to the VMCS. The eventual code will be VMX: - get exit reason, intr_info, exit_qualification, and etc from VMCS - call NMI/INTR handlers (common code) TDX: - get exit reason, intr_info, exit_qualification, and etc from guest registers - call NMI/INTR handlers (common code) Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <0396a9ae70d293c9d0b060349dae385a8a4fbcec.1705965635.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-12KVM: VMX: Move out vmx_x86_ops to 'main.c' to dispatch VMX and TDXPaolo Bonzini
KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions to operate on VM. TDX doesn't allow VMM to operate VMCS directly. Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to indirectly operate those data structures. This means we must have a TDX version of kvm_x86_ops. The existing global struct kvm_x86_ops already defines an interface which can be adapted to TDX, but kvm_x86_ops is a system-wide, not per-VM structure. To allow VMX to coexist with TDs, the kvm_x86_ops callbacks will have wrappers "if (tdx) tdx_op() else vmx_op()" to pick VMX or TDX at run time. To split the runtime switch, the VMX implementation, and the TDX implementation, add main.c, and move out the vmx_x86_ops hooks in preparation for adding TDX. Use 'vt' for the naming scheme as a nod to VT-x and as a concatenation of VmxTdx. The eventually converted code will look like this: vmx.c: vmx_op() { ... } VMX initialization tdx.c: tdx_op() { ... } TDX initialization x86_ops.h: vmx_op(); tdx_op(); main.c: static vt_op() { if (tdx) tdx_op() else vmx_op() } static struct kvm_x86_ops vt_x86_ops = { .op = vt_op, initialization functions call both VMX and TDX initialization Opportunistically, fix the name inconsistency from vmx_create_vcpu() and vmx_free_vcpu() to vmx_vcpu_create() and vmx_vcpu_free(). Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-Id: <e603c317587f933a9d1bee8728c84e4935849c16.1705965634.git.isaku.yamahata@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-04-11KVM: VMX: Disable LBR virtualization if the CPU doesn't support LBR callstacksSean Christopherson
Disable LBR virtualization if the CPU doesn't support callstacks, which were introduced in HSW (see commit e9d7f7cd97c4 ("perf/x86/intel: Add basic Haswell LBR call stack support"), as KVM unconditionally configures the perf LBR event with PERF_SAMPLE_BRANCH_CALL_STACK, i.e. LBR virtualization always fails on pre-HSW CPUs. Simply disable LBR support on such CPUs, as it has never worked, i.e. there is no risk of breaking an existing setup, and figuring out a way to performantly context switch LBRs on old CPUs is not worth the effort. Fixes: be635e34c284 ("KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES") Cc: Mingwei Zhang <mizhang@google.com> Cc: Jim Mattson <jmattson@google.com> Tested-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20240307011344.835640-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-11KVM: VMX: Snapshot LBR capabilities during module initializationSean Christopherson
Snapshot VMX's LBR capabilities once during module initialization instead of calling into perf every time a vCPU reconfigures its vPMU. This will allow massaging the LBR capabilities, e.g. if the CPU doesn't support callstacks, without having to remember to update multiple locations. Opportunistically tag vmx_get_perf_capabilities() with __init, as it's only called from vmx_set_cpu_caps(). Reviewed-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20240307011344.835640-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09KVM: nVMX: Add a sanity check that nested PML Full stems from EPT ViolationsSean Christopherson
Add a WARN_ON_ONCE() sanity check to verify that a nested PML Full VM-Exit is only synthesized when the original VM-Exit from L2 was an EPT Violation. While KVM can fallthrough to kvm_mmu_do_page_fault() if an EPT Misconfig occurs on a stale MMIO SPTE, KVM should not treat the access as a write (there isn't enough information to know *what* the access was), i.e. KVM should never try to insert a PML entry in that case. Link: https://lore.kernel.org/r/20240209221700.393189-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09KVM: x86: Move nEPT exit_qualification field from kvm_vcpu_arch to x86_exceptionSean Christopherson
Move the exit_qualification field that is used to track information about in-flight nEPT violations from "struct kvm_vcpu_arch" to "x86_exception", i.e. associate the information with the actual nEPT violation instead of the vCPU. To handle bits that are pulled from vmcs.EXIT_QUALIFICATION, i.e. that are propagated from the "original" EPT violation VM-Exit, simply grab them from the VMCS on-demand when injecting a nEPT Violation or a PML Full VM-exit. Aside from being ugly, having an exit_qualification field in kvm_vcpu_arch is outright dangerous, e.g. see commit d7f0a00e438d ("KVM: VMX: Report up-to-date exit qualification to userspace"). Opportunstically add a comment to call out that PML Full and EPT Violation VM-Exits use the same bit to report NMI blocking information. Link: https://lore.kernel.org/r/20240209221700.393189-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-09KVM: nVMX: Clear EXIT_QUALIFICATION when injecting an EPT MisconfigSean Christopherson
Explicitly clear the EXIT_QUALIFCATION field when injecting an EPT misconfig into L1, as required by the VMX architecture. Per the SDM: This field is saved for VM exits due to the following causes: debug exceptions; page-fault exceptions; start-up IPIs (SIPIs); system-management interrupts (SMIs) that arrive immediately after the execution of I/O instructions; task switches; INVEPT; INVLPG; INVPCID; INVVPID; LGDT; LIDT; LLDT; LTR; SGDT; SIDT; SLDT; STR; VMCLEAR; VMPTRLD; VMPTRST; VMREAD; VMWRITE; VMXON; WBINVD; WBNOINVD; XRSTORS; XSAVES; control-register accesses; MOV DR; I/O instructions; MWAIT; accesses to the APIC-access page; EPT violations; EOI virtualization; APIC-write emulation; page-modification log full; SPP-related events; and instruction timeout. For all other VM exits, this field is cleared. Generating EXIT_QUALIFICATION from vcpu->arch.exit_qualification is wrong for all (two) paths that lead to nested_ept_inject_page_fault(). For EPT violations (the common case), vcpu->arch.exit_qualification will have been set by handle_ept_violation() to vmcs02.EXIT_QUALIFICATION, i.e. contains the information of a EPT violation and thus is likely non-zero. For an EPT misconfig, which can reach FNAME(walk_addr_generic) and thus inject a nEPT misconfig if KVM created an MMIO SPTE that became stale, vcpu->arch.exit_qualification will hold the information from the last EPT violation VM-Exit, as vcpu->arch.exit_qualification is _only_ written by handle_ept_violation(). Fixes: 4704d0befb07 ("KVM: nVMX: Exiting from L2 to L1") Link: https://lore.kernel.org/r/20240209221700.393189-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-08KVM: VMX: Ignore MKTME KeyID bits when intercepting #PF for ↵Tao Su
allow_smaller_maxphyaddr Use the raw/true host.MAXPHYADDR when deciding whether or not KVM must intercept #PFs when allow_smaller_maxphyaddr is enabled, as any adjustments the kernel makes to boot_cpu_data.x86_phys_bits to account for MKTME KeyID bits do not apply to the guest physical address space. I.e. the KeyID are off-limits for host physical addresses, but are not reserved for GPAs as far as hardware is concerned. Signed-off-by: Tao Su <tao1.su@linux.intel.com> Link: https://lore.kernel.org/r/20240319031111.495006-1-tao1.su@linux.intel.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-08KVM: x86/pmu: Disable support for adaptive PEBSSean Christopherson
Drop support for virtualizing adaptive PEBS, as KVM's implementation is architecturally broken without an obvious/easy path forward, and because exposing adaptive PEBS can leak host LBRs to the guest, i.e. can leak host kernel addresses to the guest. Bug #1 is that KVM doesn't account for the upper 32 bits of IA32_FIXED_CTR_CTRL when (re)programming fixed counters, e.g fixed_ctrl_field() drops the upper bits, reprogram_fixed_counters() stores local variables as u8s and truncates the upper bits too, etc. Bug #2 is that, because KVM _always_ sets precise_ip to a non-zero value for PEBS events, perf will _always_ generate an adaptive record, even if the guest requested a basic record. Note, KVM will also enable adaptive PEBS in individual *counter*, even if adaptive PEBS isn't exposed to the guest, but this is benign as MSR_PEBS_DATA_CFG is guaranteed to be zero, i.e. the guest will only ever see Basic records. Bug #3 is in perf. intel_pmu_disable_fixed() doesn't clear the upper bits either, i.e. leaves ICL_FIXED_0_ADAPTIVE set, and intel_pmu_enable_fixed() effectively doesn't clear ICL_FIXED_0_ADAPTIVE either. I.e. perf _always_ enables ADAPTIVE counters, regardless of what KVM requests. Bug #4 is that adaptive PEBS *might* effectively bypass event filters set by the host, as "Updated Memory Access Info Group" records information that might be disallowed by userspace via KVM_SET_PMU_EVENT_FILTER. Bug #5 is that KVM doesn't ensure LBR MSRs hold guest values (or at least zeros) when entering a vCPU with adaptive PEBS, which allows the guest to read host LBRs, i.e. host RIPs/addresses, by enabling "LBR Entries" records. Disable adaptive PEBS support as an immediate fix due to the severity of the LBR leak in particular, and because fixing all of the bugs will be non-trivial, e.g. not suitable for backporting to stable kernels. Note! This will break live migration, but trying to make KVM play nice with live migration would be quite complicated, wouldn't be guaranteed to work (i.e. KVM might still kill/confuse the guest), and it's not clear that there are any publicly available VMMs that support adaptive PEBS, let alone live migrate VMs that support adaptive PEBS, e.g. QEMU doesn't support PEBS in any capacity. Link: https://lore.kernel.org/all/20240306230153.786365-1-seanjc@google.com Link: https://lore.kernel.org/all/ZeepGjHCeSfadANM@google.com Fixes: c59a1f106f5c ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS") Cc: stable@vger.kernel.org Cc: Like Xu <like.xu.linux@gmail.com> Cc: Mingwei Zhang <mizhang@google.com> Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhang Xiong <xiong.y.zhang@intel.com> Cc: Lv Zhiyuan <zhiyuan.lv@intel.com> Cc: Dapeng Mi <dapeng1.mi@intel.com> Cc: Jim Mattson <jmattson@google.com> Acked-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20240307005833.827147-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-08x86/bhi: Mitigate KVM by defaultPawan Gupta
BHI mitigation mode spectre_bhi=auto does not deploy the software mitigation by default. In a cloud environment, it is a likely scenario where userspace is trusted but the guests are not trusted. Deploying system wide mitigation in such cases is not desirable. Update the auto mode to unconditionally mitigate against malicious guests. Deploy the software sequence at VMexit in auto mode also, when hardware mitigation is not available. Unlike the force =on mode, software sequence is not deployed at syscalls in auto mode. Suggested-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-04-08x86/bhi: Add support for clearing branch history at syscall entryPawan Gupta
Branch History Injection (BHI) attacks may allow a malicious application to influence indirect branch prediction in kernel by poisoning the branch history. eIBRS isolates indirect branch targets in ring0. The BHB can still influence the choice of indirect branch predictor entry, and although branch predictor entries are isolated between modes when eIBRS is enabled, the BHB itself is not isolated between modes. Alder Lake and new processors supports a hardware control BHI_DIS_S to mitigate BHI. For older processors Intel has released a software sequence to clear the branch history on parts that don't support BHI_DIS_S. Add support to execute the software sequence at syscall entry and VMexit to overwrite the branch history. For now, branch history is not cleared at interrupt entry, as malicious applications are not believed to have sufficient control over the registers, since previous register state is cleared at interrupt entry. Researchers continue to poke at this area and it may become necessary to clear at interrupt entry as well in the future. This mitigation is only defined here. It is enabled later. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Co-developed-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-03-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "S390: - Changes to FPU handling came in via the main s390 pull request - Only deliver to the guest the SCLP events that userspace has requested - More virtual vs physical address fixes (only a cleanup since virtual and physical address spaces are currently the same) - Fix selftests undefined behavior x86: - Fix a restriction that the guest can't program a PMU event whose encoding matches an architectural event that isn't included in the guest CPUID. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed *using the architectural encoding*. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the event *in general*. It might support it, and it might support it using the same encoding that made it into the architectural PMU spec - Fix a variety of bugs in KVM's emulation of RDPMC (more details on individual commits) and add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID and therefore are easier to validate with selftests than with custom guests (aka kvm-unit-tests) - Zero out PMU state on AMD if the virtual PMU is disabled, it does not cause any bug but it wastes time in various cases where KVM would check if a PMC event needs to be synthesized - Optimize triggering of emulated events, with a nice ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace with an internal error exit code - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot - Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM doesn't support yielding in the middle of processing a zap, and 1GiB granularity resulted in multi-millisecond lags that are quite impolite for CONFIG_PREEMPT kernels - Allocate write-tracking metadata on-demand to avoid the memory overhead when a kernel is built with i915 virtualization support but the workloads use neither shadow paging nor i915 virtualization - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives - Fix the debugregs ABI for 32-bit KVM - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit, which allowed some optimization for both Intel and AMD - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, causing extra unnecessary work - Cleanup the logic for checking if the currently loaded vCPU is in-kernel - Harden against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel x86 Xen emulation: - Overlay pages can now be cached based on host virtual address, instead of guest physical addresses. This removes the need to reconfigure and invalidate the cache if the guest changes the gpa but the underlying host virtual address remains the same - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior) - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs RISC-V: - Support exception and interrupt handling in selftests - New self test for RISC-V architectural timer (Sstc extension) - New extension support (Ztso, Zacas) - Support userspace emulation of random number seed CSRs ARM: - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests LoongArch: - Set reserved bits as zero in CPUCFG - Start SW timer only when vcpu is blocking - Do not restart SW timer when it is expired - Remove unnecessary CSR register saving during enter guest - Misc cleanups and fixes as usual Generic: - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically always true on all architectures except MIPS (where Kconfig determines the available depending on CPU capabilities). It is replaced either by an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM) everywhere else - Factor common "select" statements in common code instead of requiring each architecture to specify it - Remove thoroughly obsolete APIs from the uapi headers - Move architecture-dependent stuff to uapi/asm/kvm.h - Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded - Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, so that there's no need to remember to *conditionally* clean up after the worker Selftests: - Reduce boilerplate especially when utilize selftest TAP infrastructure - Add basic smoke tests for SEV and SEV-ES, along with a pile of library support for handling private/encrypted/protected memory - Fix benign bugs where tests neglect to close() guest_memfd files" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits) selftests: kvm: remove meaningless assignments in Makefiles KVM: riscv: selftests: Add Zacas extension to get-reg-list test RISC-V: KVM: Allow Zacas extension for Guest/VM KVM: riscv: selftests: Add Ztso extension to get-reg-list test RISC-V: KVM: Allow Ztso extension for Guest/VM RISC-V: KVM: Forward SEED CSR access to user space KVM: riscv: selftests: Add sstc timer test KVM: riscv: selftests: Change vcpu_has_ext to a common function KVM: riscv: selftests: Add guest helper to get vcpu id KVM: riscv: selftests: Add exception handling support LoongArch: KVM: Remove unnecessary CSR register saving during enter guest LoongArch: KVM: Do not restart SW timer when it is expired LoongArch: KVM: Start SW timer only when vcpu is blocking LoongArch: KVM: Set reserved bits as zero in CPUCFG KVM: selftests: Explicitly close guest_memfd files in some gmem tests KVM: x86/xen: fix recursive deadlock in timer injection KVM: pfncache: simplify locking and make more self-contained KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled KVM: x86/xen: improve accuracy of Xen timers ...
2024-03-11Merge tag 'x86-core-2024-03-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core x86 updates from Ingo Molnar: - The biggest change is the rework of the percpu code, to support the 'Named Address Spaces' GCC feature, by Uros Bizjak: - This allows C code to access GS and FS segment relative memory via variables declared with such attributes, which allows the compiler to better optimize those accesses than the previous inline assembly code. - The series also includes a number of micro-optimizations for various percpu access methods, plus a number of cleanups of %gs accesses in assembly code. - These changes have been exposed to linux-next testing for the last ~5 months, with no known regressions in this area. - Fix/clean up __switch_to()'s broken but accidentally working handling of FPU switching - which also generates better code - Propagate more RIP-relative addressing in assembly code, to generate slightly better code - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to make it easier for distros to follow & maintain these options - Rework the x86 idle code to cure RCU violations and to clean up the logic - Clean up the vDSO Makefile logic - Misc cleanups and fixes * tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits) x86/idle: Select idle routine only once x86/idle: Let prefer_mwait_c1_over_halt() return bool x86/idle: Cleanup idle_setup() x86/idle: Clean up idle selection x86/idle: Sanitize X86_BUG_AMD_E400 handling sched/idle: Conditionally handle tick broadcast in default_idle_call() x86: Increase brk randomness entropy for 64-bit systems x86/vdso: Move vDSO to mmap region x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o x86/retpoline: Ensure default return thunk isn't used at runtime x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32 x86/vdso: Use $(addprefix ) instead of $(foreach ) x86/vdso: Simplify obj-y addition x86/vdso: Consolidate targets and clean-files x86/bugs: Rename CONFIG_RETHUNK => CONFIG_MITIGATION_RETHUNK x86/bugs: Rename CONFIG_CPU_SRSO => CONFIG_MITIGATION_SRSO x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY => CONFIG_MITIGATION_IBRS_ENTRY x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY => CONFIG_MITIGATION_UNRET_ENTRY x86/bugs: Rename CONFIG_SLS => CONFIG_MITIGATION_SLS ...
2024-03-11Merge tag 'x86-fred-2024-03-10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 FRED support from Thomas Gleixner: "Support for x86 Fast Return and Event Delivery (FRED). FRED is a replacement for IDT event delivery on x86 and addresses most of the technical nightmares which IDT exposes: 1) Exception cause registers like CR2 need to be manually preserved in nested exception scenarios. 2) Hardware interrupt stack switching is suboptimal for nested exceptions as the interrupt stack mechanism rewinds the stack on each entry which requires a massive effort in the low level entry of #NMI code to handle this. 3) No hardware distinction between entry from kernel or from user which makes establishing kernel context more complex than it needs to be especially for unconditionally nestable exceptions like NMI. 4) NMI nesting caused by IRET unconditionally reenabling NMIs, which is a problem when the perf NMI takes a fault when collecting a stack trace. 5) Partial restore of ESP when returning to a 16-bit segment 6) Limitation of the vector space which can cause vector exhaustion on large systems. 7) Inability to differentiate NMI sources FRED addresses these shortcomings by: 1) An extended exception stack frame which the CPU uses to save exception cause registers. This ensures that the meta information for each exception is preserved on stack and avoids the extra complexity of preserving it in software. 2) Hardware interrupt stack switching is non-rewinding if a nested exception uses the currently interrupt stack. 3) The entry points for kernel and user context are separate and GS BASE handling which is required to establish kernel context for per CPU variable access is done in hardware. 4) NMIs are now nesting protected. They are only reenabled on the return from NMI. 5) FRED guarantees full restore of ESP 6) FRED does not put a limitation on the vector space by design because it uses a central entry points for kernel and user space and the CPUstores the entry type (exception, trap, interrupt, syscall) on the entry stack along with the vector number. The entry code has to demultiplex this information, but this removes the vector space restriction. The first hardware implementations will still have the current restricted vector space because lifting this limitation requires further changes to the local APIC. 7) FRED stores the vector number and meta information on stack which allows having more than one NMI vector in future hardware when the required local APIC changes are in place. The series implements the initial FRED support by: - Reworking the existing entry and IDT handling infrastructure to accomodate for the alternative entry mechanism. - Expanding the stack frame to accomodate for the extra 16 bytes FRED requires to store context and meta information - Providing FRED specific C entry points for events which have information pushed to the extended stack frame, e.g. #PF and #DB. - Providing FRED specific C entry points for #NMI and #MCE - Implementing the FRED specific ASM entry points and the C code to demultiplex the events - Providing detection and initialization mechanisms and the necessary tweaks in context switching, GS BASE handling etc. The FRED integration aims for maximum code reuse vs the existing IDT implementation to the extent possible and the deviation in hot paths like context switching are handled with alternatives to minimalize the impact. The low level entry and exit paths are seperate due to the extended stack frame and the hardware based GS BASE swichting and therefore have no impact on IDT based systems. It has been extensively tested on existing systems and on the FRED simulation and as of now there are no outstanding problems" * tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/fred: Fix init_task thread stack pointer initialization MAINTAINERS: Add a maintainer entry for FRED x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly x86/fred: Invoke FRED initialization code to enable FRED x86/fred: Add FRED initialization functions x86/syscall: Split IDT syscall setup code into idt_syscall_init() KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled x86/traps: Add sysvec_install() to install a system interrupt handler x86/fred: FRED entry/exit and dispatch code x86/fred: Add a machine check entry stub for FRED x86/fred: Add a NMI entry stub for FRED x86/fred: Add a debug fault entry stub for FRED x86/idtentry: Incorporate definitions/declarations of the FRED entries x86/fred: Make exc_page_fault() work for FRED x86/fred: Allow single-step trap and NMI when starting a new task x86/fred: No ESPFIX needed when FRED is enabled ...
2024-03-11Merge tag 'kvm-x86-pmu-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 PMU changes for 6.9: - Fix several bugs where KVM speciously prevents the guest from utilizing fixed counters and architectural event encodings based on whether or not guest CPUID reports support for the _architectural_ encoding. - Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads, priority of VMX interception vs #GP, PMC types in architectural PMUs, etc. - Add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID, i.e. are difficult to validate via KVM-Unit-Tests. - Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting cycles, e.g. when checking if a PMC event needs to be synthesized when skipping an instruction. - Optimize triggering of emulated events, e.g. for "count instructions" events when skipping an instruction, which yields a ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest. - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit.
2024-03-11Merge tag 'kvm-x86-vmx-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX changes for 6.9: - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace due to an unexpected VM-Exit while the CPU was vectoring an exception. - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support. - Clean up the logic for massaging the passthrough MSR bitmaps when userspace changes its MSR filter.
2024-03-11Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.9: - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives (though in fairness in KMSAN, it's comically difficult to see that the uninitialized memory is never truly consumed). - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading DR6 and DR7. - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit. This allows VMX to further optimize handling preemption timer exits, and allows SVM to avoid sending a duplicate IPI (SVM also has a need to force an exit). - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, and add WARN to guard against similar bugs. - Provide a dedicated arch hook for checking if a different vCPU was in-kernel (for directed yield), and simplify the logic for checking if the currently loaded vCPU is in-kernel. - Misc cleanups and fixes.
2024-02-27KVM: VMX: Combine "check" and "get" APIs for passthrough MSR lookupsSean Christopherson
Combine possible_passthrough_msr_slot() and is_valid_passthrough_msr() into a single function, vmx_get_passthrough_msr_slot(), and have the combined helper return the slot on success, using a negative value to indicate "failure". Combining the operations avoids iterating over the array of passthrough MSRs twice for relevant MSRs. Suggested-by: Dongli Zhang <dongli.zhang@oracle.com> Reviewed-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20240223202104.3330974-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-27KVM: VMX: return early if msr_bitmap is not supportedDongli Zhang
The vmx_msr_filter_changed() may directly/indirectly calls only vmx_enable_intercept_for_msr() or vmx_disable_intercept_for_msr(). Those two functions may exit immediately if !cpu_has_vmx_msr_bitmap(). vmx_msr_filter_changed() -> vmx_disable_intercept_for_msr() -> pt_update_intercept_for_msr() -> vmx_set_intercept_for_msr() -> vmx_enable_intercept_for_msr() -> vmx_disable_intercept_for_msr() Therefore, we exit early if !cpu_has_vmx_msr_bitmap(). Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20240223202104.3330974-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-27KVM: VMX: fix comment to add LBR to passthrough MSRsDongli Zhang
According to the is_valid_passthrough_msr(), the LBR MSRs are also passthrough MSRs, since the commit 1b5ac3226a1a ("KVM: vmx/pmu: Pass-through LBR msrs when the guest LBR event is ACTIVE"). Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Link: https://lore.kernel.org/r/20240223202104.3330974-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22KVM: x86: Fully defer to vendor code to decide how to force immediate exitSean Christopherson
Now that vmx->req_immediate_exit is used only in the scope of vmx_vcpu_run(), use force_immediate_exit to detect that KVM should usurp the VMX preemption to force a VM-Exit and let vendor code fully handle forcing a VM-Exit. Opportunsitically drop __kvm_request_immediate_exit() and just have vendor code call smp_send_reschedule() directly. SVM already does this when injecting an event while also trying to single-step an IRET, i.e. it's not exactly secret knowledge that KVM uses a reschedule IPI to force an exit. Link: https://lore.kernel.org/r/20240110012705.506918-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22KVM: VMX: Handle KVM-induced preemption timer exits in fastpath for L2Sean Christopherson
Eat VMX treemption timer exits in the fastpath regardless of whether L1 or L2 is active. The VM-Exit is 100% KVM-induced, i.e. there is nothing directly related to the exit that KVM needs to do on behalf of the guest, thus there is no reason to wait until the slow path to do nothing. Opportunistically add comments explaining why preemption timer exits for emulating the guest's APIC timer need to go down the slow path. Link: https://lore.kernel.org/r/20240110012705.506918-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22KVM: x86: Move handling of is_guest_mode() into fastpath exit handlersSean Christopherson
Let the fastpath code decide which exits can/can't be handled in the fastpath when L2 is active, e.g. when KVM generates a VMX preemption timer exit to forcefully regain control, there is no "work" to be done and so such exits can be handled in the fastpath regardless of whether L1 or L2 is active. Moving the is_guest_mode() check into the fastpath code also makes it easier to see that L2 isn't allowed to use the fastpath in most cases, e.g. it's not immediately obvious why handle_fastpath_preemption_timer() is called from the fastpath and the normal path. Link: https://lore.kernel.org/r/20240110012705.506918-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22KVM: VMX: Handle forced exit due to preemption timer in fastpathSean Christopherson
Handle VMX preemption timer VM-Exits due to KVM forcing an exit in the exit fastpath, i.e. avoid calling back into handle_preemption_timer() for the same exit. There is no work to be done for forced exits, as the name suggests the goal is purely to get control back in KVM. In addition to shaving a few cycles, this will allow cleanly separating handle_fastpath_preemption_timer() from handle_preemption_timer(), e.g. it's not immediately obvious why _apparently_ calling handle_fastpath_preemption_timer() twice on a "slow" exit is necessary: the "slow" call is necessary to handle exits from L2, which are excluded from the fastpath by vmx_vcpu_run(). Link: https://lore.kernel.org/r/20240110012705.506918-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>