summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/vmx
AgeCommit message (Collapse)Author
2025-05-13Merge branch 'x86/msr' into x86/core, to resolve conflictsIngo Molnar
Conflicts: arch/x86/boot/startup/sme.c arch/x86/coco/sev/core.c arch/x86/kernel/fpu/core.c arch/x86/kernel/fpu/xstate.c Semantic conflict: arch/x86/include/asm/sev-internal.h Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-13Merge branch 'x86/bugs' into x86/core, to merge dependent commitsIngo Molnar
Prepare to resolve conflicts with an upstream series of fixes that conflict with pending x86 changes: 6f5bf947bab0 Merge tag 'its-for-linus-20250509' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-05-02KVM: VMX: Clean up and macrofy x86_opsVishal Verma
Eliminate a lot of stub definitions by using macros to define the TDX vs non-TDX versions of various x86_ops. Moving the x86_ops wrappers under CONFIG_KVM_INTEL_TDX also allows nearly all of vmx/main.c to go under a single #ifdef, eliminating trampolines in the generated code, and almost all of the stubs. For example, with CONFIG_KVM_INTEL_TDX=n, before this cleanup, vt_refresh_apicv_exec_ctrl() would produce: 0000000000036490 <vt_refresh_apicv_exec_ctrl>: 36490: f3 0f 1e fa endbr64 36494: e8 00 00 00 00 call 36499 <vt_refresh_apicv_exec_ctrl+0x9> 36495: R_X86_64_PLT32 __fentry__-0x4 36499: e9 00 00 00 00 jmp 3649e <vt_refresh_apicv_exec_ctrl+0xe> 3649a: R_X86_64_PLT32 vmx_refresh_apicv_exec_ctrl-0x4 3649e: 66 90 xchg %ax,%ax After this patch, this is completely eliminated. Based on a patch by Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/kvm/Z6v9yjWLNTU6X90d@google.com/ Cc: Sean Christopherson <seanjc@google.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/20250318-vverma7-cleanup_x86_ops-v2-4-701e82d6b779@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02KVM: VMX: Define a VMX glue macro for kvm_complete_insn_gp()Vishal Verma
Define kvm_complete_insn_gp() as vmx_complete_emulated_msr() and use the glue wrapper in vt_complete_emulated_msr() so that VT's .complete_emulated_msr() implementation follows the soon-to-be-standard pattern of: vt_abc: if (is_td()) return tdx_abc(); return vmx_abc(); This will allow generating such wrappers via a macro, which in turn will make it trivially easy to skip the wrappers entirely when KVM_INTEL_TDX=n. Suggested-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/kvm/Z6v9yjWLNTU6X90d@google.com/ Cc: Sean Christopherson <seanjc@google.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/20250318-vverma7-cleanup_x86_ops-v2-3-701e82d6b779@intel.com [sean: massage shortlog+changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02KVM: VMX: Move vt_apicv_pre_state_restore() to posted_intr.c and tweak nameVishal Verma
In preparation for a cleanup of the kvm_x86_ops struct for TDX, all vt_* functions are expected to act as glue functions that route to either tdx_* or vmx_* based on the VM type. Specifically, the pattern is: vt_abc: if (is_td()) return tdx_abc(); return vmx_abc(); But vt_apicv_pre_state_restore() does not follow this pattern. To facilitate that cleanup, rename and move vt_apicv_pre_state_restore() into posted_intr.c. Opportunistically turn vcpu_to_pi_desc() back into a static function, as the only reason it was exposed outside of posted_intr.c was for vt_apicv_pre_state_restore(). No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/kvm/Z6v9yjWLNTU6X90d@google.com/ Cc: Sean Christopherson <seanjc@google.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linxu.intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Link: https://lore.kernel.org/r/20250318-vverma7-cleanup_x86_ops-v2-2-701e82d6b779@intel.com [sean: apply Chao's suggestions, massage shortlog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-05-02x86/msr: Convert __rdmsr() uses to native_rdmsrq() usesXin Li (Intel)
__rdmsr() is the lowest level MSR write API, with native_rdmsr() and native_rdmsrq() serving as higher-level wrappers around it. #define native_rdmsr(msr, val1, val2) \ do { \ u64 __val = __rdmsr((msr)); \ (void)((val1) = (u32)__val); \ (void)((val2) = (u32)(__val >> 32)); \ } while (0) static __always_inline u64 native_rdmsrq(u32 msr) { return __rdmsr(msr); } However, __rdmsr() continues to be utilized in various locations. MSR APIs are designed for different scenarios, such as native or pvops, with or without trace, and safe or non-safe. Unfortunately, the current MSR API names do not adequately reflect these factors, making it challenging to select the most appropriate API for various situations. To pave the way for improving MSR API names, convert __rdmsr() uses to native_rdmsrq() to ensure consistent usage. Later, these APIs can be renamed to better reflect their implications, such as native or pvops, with or without trace, and safe or non-safe. No functional change intended. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20250427092027.1598740-10-xin@zytor.com
2025-05-02x86/msr: Add explicit includes of <asm/msr.h>Xin Li (Intel)
For historic reasons there are some TSC-related functions in the <asm/msr.h> header, even though there's an <asm/tsc.h> header. To facilitate the relocation of rdtsc{,_ordered}() from <asm/msr.h> to <asm/tsc.h> and to eventually eliminate the inclusion of <asm/msr.h> in <asm/tsc.h>, add an explicit <asm/msr.h> dependency to the source files that reference definitions from <asm/msr.h>. [ mingo: Clarified the changelog. ] Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: Juergen Gross <jgross@suse.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Kees Cook <keescook@chromium.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250501054241.1245648-1-xin@zytor.com
2025-05-02Merge tag 'v6.15-rc4' into x86/msr, to pick up fixes and resolve conflictsIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-29x86/bugs: Restructure L1TF mitigationDavid Kaplan
Restructure L1TF to use select/apply functions to create consistent vulnerability handling. Define new AUTO mitigation for L1TF. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/20250418161721.1855190-16-david.kaplan@amd.com
2025-04-29KVM: x86: Unify cross-vCPU IBPBSean Christopherson
Both SVM and VMX have similar implementation for executing an IBPB between running different vCPUs on the same CPU to create separate prediction domains for different vCPUs. For VMX, when the currently loaded VMCS is changed in vmx_vcpu_load_vmcs(), an IBPB is executed if there is no 'buddy', which is the case on vCPU load. The intention is to execute an IBPB when switching vCPUs, but not when switching the VMCS within the same vCPU. Executing an IBPB on nested transitions within the same vCPU is handled separately and conditionally in nested_vmx_vmexit(). For SVM, the current VMCB is tracked on vCPU load and an IBPB is executed when it is changed. The intention is also to execute an IBPB when switching vCPUs, although it is possible that in some cases an IBBP is executed when switching VMCBs for the same vCPU. Executing an IBPB on nested transitions should be handled separately, and is proposed at [1]. Unify the logic by tracking the last loaded vCPU and execuintg the IBPB on vCPU change in kvm_arch_vcpu_load() instead. When a vCPU is destroyed, make sure all references to it are removed from any CPU. This is similar to how SVM clears the current_vmcb tracking on vCPU destruction. Remove the current VMCB tracking in SVM as it is no longer required, as well as the 'buddy' parameter to vmx_vcpu_load_vmcs(). [1] https://lore.kernel.org/lkml/20250221163352.3818347-4-yosry.ahmed@linux.dev Link: https://lore.kernel.org/all/20250320013759.3965869-1-yosry.ahmed@linux.dev Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> [sean: tweak comment to stay at/under 80 columns] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-28KVM: VMX: Flush shadow VMCS on emergency rebootChao Gao
Ensure the shadow VMCS cache is evicted during an emergency reboot to prevent potential memory corruption if the cache is evicted after reboot. This issue was identified through code inspection, as __loaded_vmcs_clear() flushes both the normal VMCS and the shadow VMCS. Avoid checking the "launched" state during an emergency reboot, unlike the behavior in __loaded_vmcs_clear(). This is important because reboot NMIs can interfere with operations like copy_shadow_to_vmcs12(), where shadow VMCSes are loaded directly using VMPTRLD. In such cases, if NMIs occur right after the VMCS load, the shadow VMCSes will be active but the "launched" state may not be set. Fixes: 16f5b9034b69 ("KVM: nVMX: Copy processor-specific shadow-vmcs to VMCS12") Cc: stable@vger.kernel.org Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250324140849.2099723-1-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: VMX: Use LEAVE in vmx_do_interrupt_irqoff()Uros Bizjak
Micro-optimize vmx_do_interrupt_irqoff() by substituting MOV %RBP,%RSP; POP %RBP instruction sequence with equivalent LEAVE instruction. GCC compiler does this by default for a generic tuning and for all modern processors: DEF_TUNE (X86_TUNE_USE_LEAVE, "use_leave", m_386 | m_CORE_ALL | m_K6_GEODE | m_AMD_MULTIPLE | m_ZHAOXIN | m_TREMONT | m_CORE_HYBRID | m_CORE_ATOM | m_GENERIC) The new code also saves a couple of bytes, from: 27: 48 89 ec mov %rbp,%rsp 2a: 5d pop %rbp to: 27: c9 leave No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/r/20250414081131.97374-2-ubizjak@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: nVMX: Check MSR load/store list counts during VM-Enter consistency checksSean Christopherson
Explicitly verify the MSR load/store list counts are below the advertised limit as part of the initial consistency checks on the lists, so that code that consumes the count doesn't need to worry about extreme edge cases. Enforcing the limit during the initial checks fixes a flaw on 32-bit KVM where a sufficiently high @count could lead to overflow: arch/x86/kvm/vmx/nested.c:834 nested_vmx_check_msr_switch() warn: potential user controlled sizeof overflow 'addr + count * 16' '0-u64max + 16-68719476720' arch/x86/kvm/vmx/nested.c 827 static int nested_vmx_check_msr_switch(struct kvm_vcpu *vcpu, 828 u32 count, u64 addr) 829 { 830 if (count == 0) 831 return 0; 832 833 if (!kvm_vcpu_is_legal_aligned_gpa(vcpu, addr, 16) || --> 834 !kvm_vcpu_is_legal_gpa(vcpu, (addr + count * sizeof(struct vmx_msr_entry) - 1))) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ While the SDM doesn't explicitly state an illegal count results in VM-Fail, the SDM states that exceeding the limit may result in undefined behavior. I.e. the SDM gives hardware, and thus KVM, carte blanche to do literally anything in response to a count that exceeds the "recommended" limit. If the limit is exceeded, undefined processor behavior may result (including a machine check during the VMX transition). KVM already enforces the limit when processing the MSRs, i.e. already signals a late VM-Exit Consistency Check for VM-Enter, and generates a VMX Abort for VM-Exit. I.e. explicitly checking the limits simply means KVM will signal VM-Fail instead of VM-Exit or VMX Abort. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/44961459-2759-4164-b604-f6bd43da8ce9@stanley.mountain Link: https://lore.kernel.org/r/20250315024402.2363098-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24x86/irq: KVM: Track PIR bitmap as an "unsigned long" arraySean Christopherson
Track the PIR bitmap in posted interrupt descriptor structures as an array of unsigned longs instead of using unionized arrays for KVM (u32s) versus IRQ management (u64s). In practice, because the non-KVM usage is (sanely) restricted to 64-bit kernels, all existing usage of the u64 variant is already working with unsigned longs. Using "unsigned long" for the array will allow reworking KVM's processing of the bitmap to read/write in 64-bit chunks on 64-bit kernels, i.e. will allow optimizing KVM by reducing the number of atomic accesses to PIR. Opportunstically replace the open coded literals in the posted MSIs code with the appropriate macro. Deliberately don't use ARRAY_SIZE() in the for-loops, even though it would be cleaner from a certain perspective, in anticipation of decoupling the processing from the array declaration. No functional change intended. Link: https://lore.kernel.org/r/20250401163447.846608-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Add module param to control and enumerate device posted IRQsSean Christopherson
Add a module param to each KVM vendor module to allow disabling device posted interrupts without having to sacrifice all of APICv/AVIC, and to also effectively enumerate to userspace whether or not KVM may be utilizing device posted IRQs. Disabling device posted interrupts is very desirable for testing, and can even be desirable for production environments, e.g. if the host kernel wants to interpose on device interrupts. Put the module param in kvm-{amd,intel}.ko instead of kvm.ko to match the overall APICv/AVIC controls, and to avoid complications with said controls. E.g. if the param is in kvm.ko, KVM needs to be snapshot the original user-defined value to play nice with a vendor module being reloaded with different enable_apicv settings. Link: https://lore.kernel.org/r/20250401161804.842968-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: VMX: Don't send UNBLOCK when starting device assignment without APICvSean Christopherson
When starting device assignment, i.e. potential IRQ bypass, don't blast KVM_REQ_UNBLOCK if APICv is disabled/unsupported. There is no need to wake vCPUs if they can never use VT-d posted IRQs (sending UNBLOCK guards against races being vCPUs blocking and devices starting IRQ bypass). Opportunistically use kvm_arch_has_irq_bypass() for all relevant checks in the VMX Posted Interrupt code so that all checks in KVM x86 incorporate the same information (once AMD/AVIC is given similar treatment). Cc: Yosry Ahmed <yosry.ahmed@linux.dev> Link: https://lore.kernel.org/r/20250401161804.842968-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24KVM: x86: Generalize IBRS virtualization on emulated VM-exitYosry Ahmed
Commit 2e7eab81425a ("KVM: VMX: Execute IBPB on emulated VM-exit when guest has IBRS") added an IBPB in the emulated VM-exit path on Intel to properly virtualize IBRS by providing separate predictor modes for L1 and L2. AMD requires similar handling, except when IbrsSameMode is enumerated by the host CPU (which is the case on most/all AMD CPUs). With IbrsSameMode, hardware IBRS is sufficient and no extra handling is needed from KVM. Generalize the handling in nested_vmx_vmexit() by moving it into a generic function, add the AMD handling, and use it in nested_svm_vmexit() too. The main reason for using a generic function is to have a single place to park the huge comment about virtualizing IBRS. Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reviewed-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250221163352.3818347-4-yosry.ahmed@linux.dev [sean: use kvm_nested_vmexit_handle_spec_ctrl() for the helper] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-04-24Merge branch 'kvm-fixes-6.15-rc4' into HEADPaolo Bonzini
* Single fix for broken usage of 'multi-MIDR' infrastructure in PI code, adding an open-coded erratum check for Cavium ThunderX * Bugfixes from a planned posted interrupt rework * Do not use kvm_rip_read() unconditionally to cater for guests with inaccessible register state.
2025-04-24KVM: x86: Reset IRTE to host control if *new* route isn't postableSean Christopherson
Restore an IRTE back to host control (remapped or posted MSI mode) if the *new* GSI route prevents posting the IRQ directly to a vCPU, regardless of the GSI routing type. Updating the IRTE if and only if the new GSI is an MSI results in KVM leaving an IRTE posting to a vCPU. The dangling IRTE can result in interrupts being incorrectly delivered to the guest, and in the worst case scenario can result in use-after-free, e.g. if the VM is torn down, but the underlying host IRQ isn't freed. Fixes: efc644048ecd ("KVM: x86: Update IRTE for posted-interrupts") Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250404193923.1413163-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-16x86/bugs: Rename mmio_stale_data_clear to cpu_buf_vm_clearPawan Gupta
The static key mmio_stale_data_clear controls the KVM-only mitigation for MMIO Stale Data vulnerability. Rename it to reflect its purpose. No functional change. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250416-mmio-rename-v2-1-ad1f5488767c@linux.intel.com
2025-04-10x86/msr: Rename 'native_wrmsrl()' to 'native_wrmsrq()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'rdmsrl_safe()' to 'rdmsrq_safe()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'wrmsrl()' to 'wrmsrq()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-10x86/msr: Rename 'rdmsrl()' to 'rdmsrq()'Ingo Molnar
Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Juergen Gross <jgross@suse.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Xin Li <xin@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-08Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM: - Rework heuristics for resolving the fault IPA (HPFAR_EL2 v. re-walk stage-1 page tables) to align with the architecture. This avoids possibly taking an SEA at EL2 on the page table walk or using an architecturally UNKNOWN fault IPA - Use acquire/release semantics in the KVM FF-A proxy to avoid reading a stale value for the FF-A version - Fix KVM guest driver to match PV CPUID hypercall ABI - Use Inner Shareable Normal Write-Back mappings at stage-1 in KVM selftests, which is the only memory type for which atomic instructions are architecturally guaranteed to work s390: - Don't use %pK for debug printing and tracepoints x86: - Use a separate subclass when acquiring KVM's per-CPU posted interrupts wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list of vCPUs to wake, to workaround a false positive deadlock. The schedule out code runs with a scheduler lock that the wakeup handler takes in the opposite order; but it does so with IRQs disabled and cannot run concurrently with a wakeup - Explicitly zero-initialize on-stack CPUID unions - Allow building irqbypass.ko as as module when kvm.ko is a module - Wrap relatively expensive sanity check with KVM_PROVE_MMU - Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses selftests: - Add more scenarios to the MONITOR/MWAIT test - Add option to rseq test to override /dev/cpu_dma_latency - Bring list of exit reasons up to date - Cleanup Makefile to list once tests that are valid on all architectures Other: - Documentation fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (26 commits) KVM: arm64: Use acquire/release to communicate FF-A version negotiation KVM: arm64: selftests: Explicitly set the page attrs to Inner-Shareable KVM: arm64: selftests: Introduce and use hardware-definition macros KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positive KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup list KVM: x86: Explicitly zero-initialize on-stack CPUID unions KVM: Allow building irqbypass.ko as as module when kvm.ko is a module KVM: x86/mmu: Wrap sanity check on number of TDP MMU pages with KVM_PROVE_MMU KVM: selftests: Add option to rseq test to override /dev/cpu_dma_latency KVM: x86: Acquire SRCU in KVM_GET_MP_STATE to protect guest memory accesses Documentation: kvm: remove KVM_CAP_MIPS_TE Documentation: kvm: organize capabilities in the right section Documentation: kvm: fix some definition lists Documentation: kvm: drop "Capability" heading from capabilities Documentation: kvm: give correct name for KVM_CAP_SPAPR_MULTITCE Documentation: KVM: KVM_GET_SUPPORTED_CPUID now exposes TSC_DEADLINE selftests: kvm: list once tests that are valid on all architectures selftests: kvm: bring list of exit reasons up to date selftests: kvm: revamp MONITOR/MWAIT tests KVM: arm64: Don't translate FAR if invalid/unsafe ...
2025-04-07Merge branch 'kvm-tdx-initial' into HEADPaolo Bonzini
This large commit contains the initial support for TDX in KVM. All x86 parts enable the host-side hypercalls that KVM uses to talk to the TDX module, a software component that runs in a special CPU mode called SEAM (Secure Arbitration Mode). The series is in turn split into multiple sub-series, each with a separate merge commit: - Initialization: basic setup for using the TDX module from KVM, plus ioctls to create TDX VMs and vCPUs. - MMU: in TDX, private and shared halves of the address space are mapped by different EPT roots, and the private half is managed by the TDX module. Using the support that was added to the generic MMU code in 6.14, add support for TDX's secure page tables to the Intel side of KVM. Generic KVM code takes care of maintaining a mirror of the secure page tables so that they can be queried efficiently, and ensuring that changes are applied to both the mirror and the secure EPT. - vCPU enter/exit: implement the callbacks that handle the entry of a TDX vCPU (via the SEAMCALL TDH.VP.ENTER) and the corresponding save/restore of host state. - Userspace exits: introduce support for guest TDVMCALLs that KVM forwards to userspace. These correspond to the usual KVM_EXIT_* "heavyweight vmexits" but are triggered through a different mechanism, similar to VMGEXIT for SEV-ES and SEV-SNP. - Interrupt handling: support for virtual interrupt injection as well as handling VM-Exits that are caused by vectored events. Exclusive to TDX are machine-check SMIs, which the kernel already knows how to handle through the kernel machine check handler (commit 7911f145de5f, "x86/mce: Implement recovery for errors in TDX/SEAM non-root mode") - Loose ends: handling of the remaining exits from the TDX module, including EPT violation/misconfig and several TDVMCALL leaves that are handled in the kernel (CPUID, HLT, RDMSR/WRMSR, GetTdVmCallInfo); plus returning an error or ignoring operations that are not supported by TDX guests Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-07Merge branch 'kvm-pi-fix-lockdep' into HEADPaolo Bonzini
2025-04-04KVM: VMX: Use separate subclasses for PI wakeup lock to squash false positiveYan Zhao
Use a separate subclass when acquiring KVM's per-CPU posted interrupts wakeup lock in the scheduled out path, i.e. when adding a vCPU on the list of vCPUs to wake, to workaround a false positive deadlock. Chain exists of: &p->pi_lock --> &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); lock(&rq->__lock); lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu)); lock(&p->pi_lock); *** DEADLOCK *** In the wakeup handler, the callchain is *always*: sysvec_kvm_posted_intr_wakeup_ipi() | --> pi_wakeup_handler() | --> kvm_vcpu_wake_up() | --> try_to_wake_up(), and the lock order is: &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) --> &p->pi_lock. For the schedule out path, the callchain is always (for all intents and purposes; if the kernel is preemptible, kvm_sched_out() can be called from something other than schedule(), but the beginning of the callchain will be the same point in vcpu_block()): vcpu_block() | --> schedule() | --> kvm_sched_out() | --> vmx_vcpu_put() | --> vmx_vcpu_pi_put() | --> pi_enable_wakeup_handler() and the lock order is: &rq->__lock --> &per_cpu(wakeup_vcpus_on_cpu_lock, cpu) I.e. lockdep sees AB+BC ordering for schedule out, and CA ordering for wakeup, and complains about the A=>C versus C=>A inversion. In practice, deadlock can't occur between schedule out and the wakeup handler as they are mutually exclusive. The entirely of the schedule out code that runs with the problematic scheduler locks held, does so with IRQs disabled, i.e. can't run concurrently with the wakeup handler. Use a subclass instead disabling lockdep entirely, and tell lockdep that both subclasses are being acquired when loading a vCPU, as the sched_out and sched_in paths are NOT mutually exclusive, e.g. CPU 0 CPU 1 --------------- --------------- vCPU0 sched_out vCPU1 sched_in vCPU1 sched_out vCPU 0 sched_in where vCPU0's sched_in may race with vCPU1's sched_out, on CPU 0's wakeup list+lock. Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-ID: <20250401154727.835231-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-04-04KVM: VMX: Assert that IRQs are disabled when putting vCPU on PI wakeup listSean Christopherson
Assert that IRQs are already disabled when putting a vCPU on a CPU's PI wakeup list, as opposed to saving/disabling+restoring IRQs. KVM relies on IRQs being disabled until the vCPU task is fully scheduled out, i.e. until the scheduler has dropped all of its per-CPU locks (e.g. for the runqueue), as attempting to wake the task while it's being scheduled out could lead to deadlock. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250401154727.835231-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-25Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "ARM: - Nested virtualization support for VGICv3, giving the nested hypervisor control of the VGIC hardware when running an L2 VM - Removal of 'late' nested virtualization feature register masking, making the supported feature set directly visible to userspace - Support for emulating FEAT_PMUv3 on Apple silicon, taking advantage of an IMPLEMENTATION DEFINED trap that covers all PMUv3 registers - Paravirtual interface for discovering the set of CPU implementations where a VM may run, addressing a longstanding issue of guest CPU errata awareness in big-little systems and cross-implementation VM migration - Userspace control of the registers responsible for identifying a particular CPU implementation (MIDR_EL1, REVIDR_EL1, AIDR_EL1), allowing VMs to be migrated cross-implementation - pKVM updates, including support for tracking stage-2 page table allocations in the protected hypervisor in the 'SecPageTable' stat - Fixes to vPMU, ensuring that userspace updates to the vPMU after KVM_RUN are reflected into the backing perf events LoongArch: - Remove unnecessary header include path - Assume constant PGD during VM context switch - Add perf events support for guest VM RISC-V: - Disable the kernel perf counter during configure - KVM selftests improvements for PMU - Fix warning at the time of KVM module removal x86: - Add support for aging of SPTEs without holding mmu_lock. Not taking mmu_lock allows multiple aging actions to run in parallel, and more importantly avoids stalling vCPUs. This includes an implementation of per-rmap-entry locking; aging the gfn is done with only a per-rmap single-bin spinlock taken, whereas locking an rmap for write requires taking both the per-rmap spinlock and the mmu_lock. Note that this decreases slightly the accuracy of accessed-page information, because changes to the SPTE outside aging might not use atomic operations even if they could race against a clear of the Accessed bit. This is deliberate because KVM and mm/ tolerate false positives/negatives for accessed information, and testing has shown that reducing the latency of aging is far more beneficial to overall system performance than providing "perfect" young/old information. - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to coalesce updates when multiple pieces of vCPU state are changing, e.g. as part of a nested transition - Fix a variety of nested emulation bugs, and add VMX support for synthesizing nested VM-Exit on interception (instead of injecting #UD into L2) - Drop "support" for async page faults for protected guests that do not set SEND_ALWAYS (i.e. that only want async page faults at CPL3) - Bring a bit of sanity to x86's VM teardown code, which has accumulated a lot of cruft over the years. Particularly, destroy vCPUs before the MMU, despite the latter being a VM-wide operation - Add common secure TSC infrastructure for use within SNP and in the future TDX - Block KVM_CAP_SYNC_REGS if guest state is protected. It does not make sense to use the capability if the relevant registers are not available for reading or writing - Don't take kvm->lock when iterating over vCPUs in the suspend notifier to fix a largely theoretical deadlock - Use the vCPU's actual Xen PV clock information when starting the Xen timer, as the cached state in arch.hv_clock can be stale/bogus - Fix a bug where KVM could bleed PVCLOCK_GUEST_STOPPED across different PV clocks; restrict PVCLOCK_GUEST_STOPPED to kvmclock, as KVM's suspend notifier only accounts for kvmclock, and there's no evidence that the flag is actually supported by Xen guests - Clean up the per-vCPU "cache" of its reference pvclock, and instead only track the vCPU's TSC scaling (multipler+shift) metadata (which is moderately expensive to compute, and rarely changes for modern setups) - Don't write to the Xen hypercall page on MSR writes that are initiated by the host (userspace or KVM) to fix a class of bugs where KVM can write to guest memory at unexpected times, e.g. during vCPU creation if userspace has set the Xen hypercall MSR index to collide with an MSR that KVM emulates - Restrict the Xen hypercall MSR index to the unofficial synthetic range to reduce the set of possible collisions with MSRs that are emulated by KVM (collisions can still happen as KVM emulates Hyper-V MSRs, which also reside in the synthetic range) - Clean up and optimize KVM's handling of Xen MSR writes and xen_hvm_config - Update Xen TSC leaves during CPUID emulation instead of modifying the CPUID entries when updating PV clocks; there is no guarantee PV clocks will be updated between TSC frequency changes and CPUID emulation, and guest reads of the TSC leaves should be rare, i.e. are not a hot path x86 (Intel): - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1 - Pass XFD_ERR as the payload when injecting #NM, as a preparatory step for upcoming FRED virtualization support - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits, both as a general cleanup and in anticipation of adding support for emulating Mode-Based Execution Control (MBEC) - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest state while KVM is in the middle of emulating nested VM-Enter - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs in anticipation of adding sanity checks for secondary exit controls (the primary field is out of bits) x86 (AMD): - Ensure the PSP driver is initialized when both the PSP and KVM modules are built-in (the initcall framework doesn't handle dependencies) - Use long-term pins when registering encrypted memory regions, so that the pages are migrated out of MIGRATE_CMA/ZONE_MOVABLE and don't lead to excessive fragmentation - Add macros and helpers for setting GHCB return/error codes - Add support for Idle HLT interception, which elides interception if the vCPU has a pending, unmasked virtual IRQ when HLT is executed - Fix a bug in INVPCID emulation where KVM fails to check for a non-canonical address - Don't attempt VMRUN for SEV-ES+ guests if the vCPU's VMSA is invalid, e.g. because the vCPU was "destroyed" via SNP's AP Creation hypercall - Reject SNP AP Creation if the requested SEV features for the vCPU don't match the VM's configured set of features Selftests: - Fix again the Intel PMU counters test; add a data load and do CLFLUSH{OPT} on the data instead of executing code. The theory is that modern Intel CPUs have learned new code prefetching tricks that bypass the PMU counters - Fix a flaw in the Intel PMU counters test where it asserts that an event is counting correctly without actually knowing what the event counts on the underlying hardware - Fix a variety of flaws, bugs, and false failures/passes dirty_log_test, and improve its coverage by collecting all dirty entries on each iteration - Fix a few minor bugs related to handling of stats FDs - Add infrastructure to make vCPU and VM stats FDs available to tests by default (open the FDs during VM/vCPU creation) - Relax an assertion on the number of HLT exits in the xAPIC IPI test when running on a CPU that supports AMD's Idle HLT (which elides interception of HLT if a virtual IRQ is pending and unmasked)" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (216 commits) RISC-V: KVM: Optimize comments in kvm_riscv_vcpu_isa_disable_allowed RISC-V: KVM: Teardown riscv specific bits after kvm_exit LoongArch: KVM: Register perf callbacks for guest LoongArch: KVM: Implement arch-specific functions for guest perf LoongArch: KVM: Add stub for kvm_arch_vcpu_preempted_in_kernel() LoongArch: KVM: Remove PGD saving during VM context switch LoongArch: KVM: Remove unnecessary header include path KVM: arm64: Tear down vGIC on failed vCPU creation KVM: arm64: PMU: Reload when resetting KVM: arm64: PMU: Reload when user modifies registers KVM: arm64: PMU: Fix SET_ONE_REG for vPMC regs KVM: arm64: PMU: Assume PMU presence in pmu-emul.c KVM: arm64: PMU: Set raw values from user to PM{C,I}NTEN{SET,CLR}, PMOVS{SET,CLR} KVM: arm64: Create each pKVM hyp vcpu after its corresponding host vcpu KVM: arm64: Factor out pKVM hyp vcpu creation to separate function KVM: arm64: Initialize HCRX_EL2 traps in pKVM KVM: arm64: Factor out setting HCRX_EL2 traps into separate function KVM: x86: block KVM_CAP_SYNC_REGS if guest state is protected KVM: x86: Add infrastructure for secure TSC KVM: x86: Push down setting vcpu.arch.user_set_tsc ...
2025-03-25Merge tag 'x86_bugs_for_v6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 speculation mitigation updates from Borislav Petkov: - Some preparatory work to convert the mitigations machinery to mitigating attack vectors instead of single vulnerabilities - Untangle and remove a now unneeded X86_FEATURE_USE_IBPB flag - Add support for a Zen5-specific SRSO mitigation - Cleanups and minor improvements * tag 'x86_bugs_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2 x86/bugs: Use the cpu_smt_possible() helper instead of open-coded code x86/bugs: Add AUTO mitigations for mds/taa/mmio/rfds x86/bugs: Relocate mds/taa/mmio/rfds defines x86/bugs: Add X86_BUG_SPECTRE_V2_USER x86/bugs: Remove X86_FEATURE_USE_IBPB KVM: nVMX: Always use IBPB to properly virtualize IBRS x86/bugs: Use a static branch to guard IBPB on vCPU switch x86/bugs: Remove the X86_FEATURE_USE_IBPB check in ib_prctl_set() x86/mm: Remove X86_FEATURE_USE_IBPB checks in cond_mitigation() x86/bugs: Move the X86_FEATURE_USE_IBPB check into callers x86/bugs: KVM: Add support for SRSO_MSR_FIX
2025-03-25Merge tag 'timers-cleanups-2025-03-23' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer cleanups from Thomas Gleixner: "A treewide hrtimer timer cleanup hrtimers are initialized with hrtimer_init() and a subsequent store to the callback pointer. This turned out to be suboptimal for the upcoming Rust integration and is obviously a silly implementation to begin with. This cleanup replaces the hrtimer_init(T); T->function = cb; sequence with hrtimer_setup(T, cb); The conversion was done with Coccinelle and a few manual fixups. Once the conversion has completely landed in mainline, hrtimer_init() will be removed and the hrtimer::function becomes a private member" * tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits) wifi: rt2x00: Switch to use hrtimer_update_function() io_uring: Use helper function hrtimer_update_function() serial: xilinx_uartps: Use helper function hrtimer_update_function() ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup() RDMA: Switch to use hrtimer_setup() virtio: mem: Switch to use hrtimer_setup() drm/vmwgfx: Switch to use hrtimer_setup() drm/xe/oa: Switch to use hrtimer_setup() drm/vkms: Switch to use hrtimer_setup() drm/msm: Switch to use hrtimer_setup() drm/i915/request: Switch to use hrtimer_setup() drm/i915/uncore: Switch to use hrtimer_setup() drm/i915/pmu: Switch to use hrtimer_setup() drm/i915/perf: Switch to use hrtimer_setup() drm/i915/gvt: Switch to use hrtimer_setup() drm/i915/huc: Switch to use hrtimer_setup() drm/amdgpu: Switch to use hrtimer_setup() stm class: heartbeat: Switch to use hrtimer_setup() i2c: Switch to use hrtimer_setup() iio: Switch to use hrtimer_setup() ...
2025-03-19Merge tag 'kvm-x86-vmx-6.15' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX changes for 6.15 - Fix a bug where KVM unnecessarily reads XFD_ERR from hardware and thus modifies the vCPU's XFD_ERR on a #NM due to CR0.TS=1. - Pass XFD_ERR as a psueo-payload when injecting #NM as a preparatory step for upcoming FRED virtualization support. - Decouple the EPT entry RWX protection bit macros from the EPT Violation bits as a general cleanup, and in anticipation of adding support for emulating Mode-Based Execution (MBEC). - Reject KVM_RUN if userspace manages to gain control and stuff invalid guest state while KVM is in the middle of emulating nested VM-Enter. - Add a macro to handle KVM's sanity checks on entry/exit VMCS control pairs in anticipation of adding sanity checks for secondary exit controls (the primary field is out of bits).
2025-03-19Merge tag 'kvm-x86-misc-6.15' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.15: - Fix a bug in PIC emulation that caused KVM to emit a spurious KVM_REQ_EVENT. - Add a helper to consolidate handling of mp_state transitions, and use it to clear pv_unhalted whenever a vCPU is made RUNNABLE. - Defer runtime CPUID updates until KVM emulates a CPUID instruction, to coalesce updates when multiple pieces of vCPU state are changing, e.g. as part of a nested transition. - Fix a variety of nested emulation bugs, and add VMX support for synthesizing nested VM-Exit on interception (instead of injecting #UD into L2). - Drop "support" for PV Async #PF with proctected guests without SEND_ALWAYS, as KVM can't get the current CPL. - Misc cleanups
2025-03-19Merge tag 'v6.14-rc7' into x86/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-14KVM: TDX: Make TDX VM type supportedIsaku Yamahata
Now all the necessary code for TDX is in place, it's ready to run TDX guest. Advertise the VM type of KVM_X86_TDX_VM so that the user space VMM like QEMU can start to use it. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> --- TDX "the rest" v2: - No change. TDX "the rest" v1: - Move down to the end of patch series. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: KVM: TDX: Always honor guest PAT on TDX enabled guestsYan Zhao
Always honor guest PAT in KVM-managed EPTs on TDX enabled guests by making self-snoop feature a hard dependency for TDX and making quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT not a valid quirk once TDX is enabled. The quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT only affects memory type of KVM-managed EPTs. For the TDX-module-managed private EPT, memory type is always forced to WB now. Honoring guest PAT in KVM-managed EPTs ensures KVM does not invoke kvm_zap_gfn_range() when attaching/detaching non-coherent DMA devices, which would cause mirrored EPTs for TDs to be zapped, leading to the TDX-module-managed private EPT being incorrectly zapped. As a new feature, TDX always comes with support for self-snoop, and does not have to worry about unmodifiable but buggy guests. So, simply ignore KVM_X86_QUIRK_IGNORE_GUEST_PAT on TDX guests just like kvm-amd.ko already does. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20250224071039.31511-1-yan.y.zhao@intel.com> [Only apply to TDX guests. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: remove shadow_memtype_maskPaolo Bonzini
The IGNORE_GUEST_PAT quirk is inapplicable, and thus always-disabled, if shadow_memtype_mask is zero. As long as vmx_get_mt_mask is not called for the shadow paging case, there is no need to consult shadow_memtype_mask and it can be removed altogether. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: x86: Introduce Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PATYan Zhao
Introduce an Intel specific quirk KVM_X86_QUIRK_IGNORE_GUEST_PAT to have KVM ignore guest PAT when this quirk is enabled. On AMD platforms, KVM always honors guest PAT. On Intel however there are two issues. First, KVM *cannot* honor guest PAT if CPU feature self-snoop is not supported. Second, UC access on certain Intel platforms can be very slow[1] and honoring guest PAT on those platforms may break some old guests that accidentally specify video RAM as UC. Those old guests may never expect the slowness since KVM always forces WB previously. See [2]. So, introduce a quirk that KVM can enable by default on all Intel platforms to avoid breaking old unmodifiable guests. Newer userspace can disable this quirk if it wishes KVM to honor guest PAT; disabling the quirk will fail if self-snoop is not supported, i.e. if KVM cannot obey the wish. The quirk is a no-op on AMD and also if any assigned devices have non-coherent DMA. This is not an issue, as KVM_X86_QUIRK_CD_NW_CLEARED is another example of a quirk that is sometimes automatically disabled. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Sean Christopherson <seanjc@google.com> Cc: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/all/Ztl9NWCOupNfVaCA@yzhao56-desk.sh.intel.com # [1] Link: https://lore.kernel.org/all/87jzfutmfc.fsf@redhat.com # [2] Message-ID: <20250224070946.31482-1-yan.y.zhao@intel.com> [Use supported_quirks/inapplicable_quirks to support both AMD and no-self-snoop cases, as well as to remove the shadow_memtype_mask check from kvm_mmu_may_ignore_guest_pat(). - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Enable guest access to MTRR MSRsBinbin Wu
Allow TDX guests to access MTRR MSRs as what KVM does for normal VMs, i.e., KVM emulates accesses to MTRR MSRs, but doesn't virtualize guest MTRR memory types. TDX module exposes MTRR feature to TDX guests unconditionally. KVM needs to support MTRR MSRs accesses for TDX guests to match the architectural behavior. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-19-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Add a method to ignore hypercall patchingIsaku Yamahata
Because guest TD memory is protected, VMM patching guest binary for hypercall instruction isn't possible. Add a method to ignore hypercall patching. Note: guest TD kernel needs to be modified to use TDG.VP.VMCALL for hypercall. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-18-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Ignore setting up mceIsaku Yamahata
Because vmx_set_mce function is VMX specific and it cannot be used for TDX. Add vt stub to ignore setting up mce for TDX. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-17-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Add methods to ignore accesses to TSCIsaku Yamahata
TDX protects TDX guest TSC state from VMM. Implement access methods to ignore guest TSC. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-16-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Add methods to ignore VMX preemption timerIsaku Yamahata
TDX doesn't support VMX preemption timer. Implement access methods for VMM to ignore VMX preemption timer. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-15-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Add method to ignore guest instruction emulationIsaku Yamahata
Skip instruction emulation and let the TDX guest retry for MMIO emulation after installing the MMIO SPTE with suppress #VE bit cleared. TDX protects TDX guest state from VMM, instructions in guest memory cannot be emulated. MMIO emulation is the only case that triggers the instruction emulation code path for TDX guest. The MMIO emulation handling flow as following: - The TDX guest issues a vMMIO instruction. (The GPA must be shared and is not covered by KVM memory slot.) - The default SPTE entry for shared-EPT by KVM has suppress #VE bit set. So EPT violation causes TD exit to KVM. - Trigger KVM page fault handler and install a new SPTE with suppress #VE bit cleared. - Skip instruction emulation and return X86EMU_RETRY_INSTR to let the vCPU retry. - TDX guest re-executes the vMMIO instruction. - TDX guest gets #VE because KVM has cleared #VE suppress bit. - TDX guest #VE handler converts MMIO into TDG.VP.VMCALL<MMIO> Return X86EMU_RETRY_INSTR in the callback check_emulate_instruction() for TDX guests to retry the MMIO instruction. Also, the instruction emulation handling will be skipped, so that the callback check_intercept() will never be called for TDX guest. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-14-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Add methods to ignore accesses to CPU stateIsaku Yamahata
TDX protects TDX guest state from VMM. Implement access methods for TDX guest state to ignore them or return zero. Because those methods can be called by kvm ioctls to set/get cpu registers, they don't have KVM_BUG_ON. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-13-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercallIsaku Yamahata
Implement TDG.VP.VMCALL<GetTdVmCallInfo> hypercall. If the input value is zero, return success code and zero in output registers. TDG.VP.VMCALL<GetTdVmCallInfo> hypercall is a subleaf of TDG.VP.VMCALL to enumerate which TDG.VP.VMCALL sub leaves are supported. This hypercall is for future enhancement of the Guest-Host-Communication Interface (GHCI) specification. The GHCI version of 344426-001US defines it to require input R12 to be zero and to return zero in output registers, R11, R12, R13, and R14 so that guest TD enumerates no enhancement. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-12-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Enable guest access to LMCE related MSRsIsaku Yamahata
Allow TDX guest to configure LMCE (Local Machine Check Event) by handling MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL. MCE and MCA are advertised via cpuid based on the TDX module spec. Guest kernel can access IA32_FEAT_CTL to check whether LMCE is opted-in by the platform or not. If LMCE is opted-in by the platform, guest kernel can access IA32_MCG_EXT_CTL to enable/disable LMCE. Handle MSR IA32_FEAT_CTL and IA32_MCG_EXT_CTL for TDX guests to avoid failure when a guest accesses them with TDG.VP.VMCALL<MSR> on #VE. E.g., Linux guest will treat the failure as a #GP(0). Userspace VMM may not opt-in LMCE by default, e.g., QEMU disables it by default, "-cpu lmce=on" is needed in QEMU command line to opt-in it. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [binbin: rework changelog] Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Message-ID: <20250227012021.1778144-11-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercallIsaku Yamahata
Morph PV RDMSR/WRMSR hypercall to EXIT_REASON_MSR_{READ,WRITE} and wire up KVM backend functions. For complete_emulated_msr() callback, instead of injecting #GP on error, implement tdx_complete_emulated_msr() to set return code on error. Also set return value on MSR read according to the values from kvm x86 registers. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250227012021.1778144-10-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-03-14KVM: TDX: Implement callbacks for MSR operationsIsaku Yamahata
Add functions to implement MSR related callbacks, .set_msr(), .get_msr(), and .has_emulated_msr(), for preparation of handling hypercalls from TDX guest for PV RDMSR and WRMSR. Ignore KVM_REQ_MSR_FILTER_CHANGED for TDX. There are three classes of MSR virtualization for TDX. - Non-configurable: TDX module directly virtualizes it. VMM can't configure it, the value set by KVM_SET_MSRS is ignored. - Configurable: TDX module directly virtualizes it. VMM can configure it at VM creation time. The value set by KVM_SET_MSRS is used. - #VE case: TDX guest would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}> and VMM handles the MSR hypercall. The value set by KVM_SET_MSRS is used. For the MSRs belonging to the #VE case, the TDX module injects #VE to the TDX guest upon RDMSR or WRMSR. The exact list of such MSRs is defined in TDX Module ABI Spec. Upon #VE, the TDX guest may call TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}>, which are defined in GHCI (Guest-Host Communication Interface) so that the host VMM (e.g. KVM) can virtualize the MSRs. TDX doesn't allow VMM to configure interception of MSR accesses. Ignore KVM_REQ_MSR_FILTER_CHANGED for TDX guest. If the userspace has set any MSR filters, it will be applied when handling TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR}> in a later patch. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Binbin Wu <binbin.wu@linux.intel.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20250227012021.1778144-9-binbin.wu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>