summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/x86.c
AgeCommit message (Collapse)Author
2024-11-01KVM: x86: Disallow changing MSR_PLATFORM_INFO after vCPU has runSean Christopherson
Tag MSR_PLATFORM_INFO as a feature MSR (because it is), i.e. disallow it from being modified after the vCPU has run. To make KVM's selftest compliant, simply delete the userspace MSR write that restores KVM's original value at the end of the test. Verifying that userspace can write back what it originally read is uninteresting in this particular case, because KVM doesn't enforce _any_ bits in the MSR, i.e. userspace should be able to write any arbitrary value. Link: https://lore.kernel.org/r/20240802185511.305849-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Co-locate initialization of feature MSRs in kvm_arch_vcpu_create()Sean Christopherson
Bunch all of the feature MSR initialization in kvm_arch_vcpu_create() so that it can be easily quirked in a future patch. No functional change intended. Link: https://lore.kernel.org/r/20240802185511.305849-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: model canonical checks more preciselyMaxim Levitsky
As a result of a recent investigation, it was determined that x86 CPUs which support 5-level paging, don't always respect CR4.LA57 when doing canonical checks. In particular: 1. MSRs which contain a linear address, allow full 57-bitcanonical address regardless of CR4.LA57 state. For example: MSR_KERNEL_GS_BASE. 2. All hidden segment bases and GDT/IDT bases also behave like MSRs. This means that full 57-bit canonical address can be loaded to them regardless of CR4.LA57, both using MSRS (e.g GS_BASE) and instructions (e.g LGDT). 3. TLB invalidation instructions also allow the user to use full 57-bit address regardless of the CR4.LA57. Finally, it must be noted that the CPU doesn't prevent the user from disabling 5-level paging, even when the full 57-bit canonical address is present in one of the registers mentioned above (e.g GDT base). In fact, this can happen without any userspace help, when the CPU enters SMM mode - some MSRs, for example MSR_KERNEL_GS_BASE are left to contain a non-canonical address in regard to the new mode. Since most of the affected MSRs and all segment bases can be read and written freely by the guest without any KVM intervention, this patch makes the emulator closely follow hardware behavior, which means that the emulator doesn't take in the account the guest CPUID support for 5-level paging, and only takes in the account the host CPU support. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906221824.491834-4-mlevitsk@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Add X86EMUL_F_MSR and X86EMUL_F_DT_LOAD to aid canonical checksMaxim Levitsky
Add emulation flags for MSR accesses and Descriptor Tables loads, and pass the new flags as appropriate to emul_is_noncanonical_address(). The flags will be used to perform the correct canonical check, as the type of access affects whether or not CR4.LA57 is consulted when determining the canonical bit. No functional change is intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906221824.491834-3-mlevitsk@redhat.com [sean: split to separate patch, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Route non-canonical checks in emulator through emulate_opsMaxim Levitsky
Add emulate_ops.is_canonical_addr() to perform (non-)canonical checks in the emulator, which will allow extending is_noncanonical_address() to support different flavors of canonical checks, e.g. for descriptor table bases vs. MSRs, without needing duplicate logic in the emulator. No functional change is intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906221824.491834-3-mlevitsk@redhat.com [sean: separate from additional of flags, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Use '0' for guest RIP if PMI encounters protected guest stateSean Christopherson
Explicitly return '0' for guest RIP when handling a PMI VM-Exit for a vCPU with protected guest state, i.e. when KVM can't read the real RIP. While there is no "right" value, and profiling a protect guest is rather futile, returning the last known RIP is worse than returning obviously "bad" data. E.g. for SEV-ES+, the last known RIP will often point somewhere in the guest's boot flow. Opportunistically add WARNs to effectively assert that the in_kernel() and get_ip() callbacks are restricted to the common PMI handler, as the return values for the protected guest state case are largely arbitrary, i.e. only make any sense whatsoever for PMIs, where the returned values have no functional impact and thus don't truly matter. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241009175002.1118178-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Bypass register cache when querying CPL from kvm_sched_out()Sean Christopherson
When querying guest CPL to determine if a vCPU was preempted while in kernel mode, bypass the register cache, i.e. always read SS.AR_BYTES from the VMCS on Intel CPUs. If the kernel is running with full preemption enabled, using the register cache in the preemption path can result in stale and/or uninitialized data being cached in the segment cache. In particular the following scenario is currently possible: - vCPU is just created, and the vCPU thread is preempted before SS.AR_BYTES is written in vmx_vcpu_reset(). - When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() => vmx_get_cpl() reads and caches '0' for SS.AR_BYTES. - vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't invoke vmx_segment_cache_clear() to invalidate the cache. As a result, KVM retains a stale value in the cache, which can be read, e.g. via KVM_GET_SREGS. Usually this is not a problem because the VMX segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM selftests) reads and writes system registers just after the vCPU was created, _without_ modifying SS.AR_BYTES, userspace will write back the stale '0' value and ultimately will trigger a VM-Entry failure due to incorrect SS segment type. Note, the VM-Enter failure can also be avoided by moving the call to vmx_segment_cache_clear() until after the vmx_vcpu_reset() initializes all segments. However, while that change is correct and desirable (and will come along shortly), it does not address the underlying problem that accessing KVM's register caches from !task context is generally unsafe. In addition to fixing the immediate bug, bypassing the cache for this particular case will allow hardening KVM register caching log to assert that the caches are accessed only when KVM _knows_ it is safe to do so. Fixes: de63ad4cf497 ("KVM: X86: implement the logic for spinlock optimization") Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Closes: https://lore.kernel.org/all/20240716022014.240960-3-mlevitsk@redhat.com Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20241009175002.1118178-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Ensure vcpu->mode is loaded from memory in kvm_vcpu_exit_request()Sean Christopherson
Wrap kvm_vcpu_exit_request()'s load of vcpu->mode with READ_ONCE() to ensure the variable is re-loaded from memory, as there is no guarantee the caller provides the necessary annotations to ensure KVM sees a fresh value, e.g. the VM-Exit fastpath could theoretically reuse the pre-VM-Enter value. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20240828232013.768446-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Fix a comment inside __kvm_set_or_clear_apicv_inhibit()Kai Huang
Change svm_vcpu_run() to vcpu_enter_guest() in the comment of __kvm_set_or_clear_apicv_inhibit() to make it reflect the fact. When one thread updates VM's APICv state due to updating the APICv inhibit reasons, it kicks off all vCPUs and makes them wait until the new reason has been updated and can be seen by all vCPUs. There was one WARN() to make sure VM's APICv state is consistent with vCPU's APICv state in the svm_vcpu_run(). Commit ee49a8932971 ("KVM: x86: Move SVM's APICv sanity check to common x86") moved that WARN() to x86 common code vcpu_enter_guest() due to the logic is not unique to SVM, and added comments to both __kvm_set_or_clear_apicv_inhibit() and vcpu_enter_guest() to explain this. However, although the comment in __kvm_set_or_clear_apicv_inhibit() mentioned the WARN(), it seems forgot to reflect that the WARN() had been moved to x86 common, i.e., it still mentioned the svm_vcpu_run() but not vcpu_enter_guest(). Fix it. Note after the change the first line that contains vcpu_enter_guest() exceeds 80 characters, but leave it as is to make the diff clean. Fixes: ee49a8932971 ("KVM: x86: Move SVM's APICv sanity check to common x86") Signed-off-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/e462e7001b8668649347f879c66597d3327dbac2.1728383775.git.kai.huang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-01KVM: x86: Fix a comment inside kvm_vcpu_update_apicv()Kai Huang
The sentence "... so that KVM can the AVIC doorbell to ..." doesn't have a verb. Fix it. After adding the verb 'use', that line exceeds 80 characters. Thus wrap the 'to' to the next line. Signed-off-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/666e991edf81e1fccfba9466f3fe65965fcba897.1728383775.git.kai.huang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-17Merge tag 'kvm-x86-vmx-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX changes for 6.12: - Set FINAL/PAGE in the page fault error code for EPT Violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical. - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2. - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures. - Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible). - Minor SGX fix and a cleanup.
2024-09-17Merge tag 'kvm-x86-pat_vmx_msrs-6.12' of https://github.com/kvm-x86/linux ↵Paolo Bonzini
into HEAD KVM VMX and x86 PAT MSR macro cleanup for 6.12: - Add common defines for the x86 architectural memory types, i.e. the types that are shared across PAT, MTRRs, VMCSes, and EPTPs. - Clean up the various VMX MSR macros to make the code self-documenting (inasmuch as possible), and to make it less painful to add new macros.
2024-09-17Merge tag 'kvm-x86-mmu-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 MMU changes for 6.12: - Overhaul the "unprotect and retry" logic to more precisely identify cases where retrying is actually helpful, and to harden all retry paths against putting the guest into an infinite retry loop. - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in the shadow MMU. - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for adding MGLRU support in KVM. - Misc cleanups
2024-09-17Merge tag 'kvm-x86-misc-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.12 - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10 functionality that is on the horizon). - Rework common MSR handling code to suppress errors on userspace accesses to unsupported-but-advertised MSRs. This will allow removing (almost?) all of KVM's exemptions for userspace access to MSRs that shouldn't exist based on the vCPU model (the actual cleanup is non-trivial future work). - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the 64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv) stores the entire 64-bit value a the ICR offset. - Fix a bug where KVM would fail to exit to userspace if one was triggered by a fastpath exit handler. - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when there's already a pending wake event at the time of the exit. - Finally fix the RSM vs. nested VM-Enter WARN by forcing the vCPU out of guest mode prior to signalling SHUTDOWN (architecturally, the SHUTDOWN is supposed to hit L1, not L2).
2024-09-17Merge branch 'kvm-redo-enable-virt' into HEADPaolo Bonzini
Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed. The primary motivation for this series is to simplify dealing with enabling virtualization for Intel's TDX, which needs to enable virtualization when kvm-intel.ko is loaded, i.e. long before the first VM is created. That said, this is a nice cleanup on its own. By registering the callbacks on-demand, the callbacks themselves don't need to check kvm_usage_count, because their very existence implies a non-zero count. Patch 1 (re)adds a dedicated lock for kvm_usage_count. This avoids a lock ordering issue between cpus_read_lock() and kvm_lock. The lock ordering issue still exist in very rare cases, and will be fixed for good by switching vm_list to an (S)RCU-protected list. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-09KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure()Sean Christopherson
Rename reexecute_instruction() to kvm_unprotect_and_retry_on_failure() to make the intent and purpose of the helper much more obvious. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Update retry protection fields when forcing retry on emulation failureSean Christopherson
When retrying the faulting instruction after emulation failure, refresh the infinite loop protection fields even if no shadow pages were zapped, i.e. avoid hitting an infinite loop even when retrying the instruction as a last-ditch effort to avoid terminating the guest. Link: https://lore.kernel.org/r/20240831001538.336683-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Apply retry protection to "unprotect on failure" pathSean Christopherson
Use kvm_mmu_unprotect_gfn_and_retry() in reexecute_instruction() to pick up protection against infinite loops, e.g. if KVM somehow manages to encounter an unsupported instruction and unprotecting the gfn doesn't allow the vCPU to make forward progress. Other than that, the retry-on- failure logic is a functionally equivalent, open coded version of kvm_mmu_unprotect_gfn_and_retry(). Note, the emulation failure path still isn't fully protected, as KVM won't update the retry protection fields if no shadow pages are zapped (but this change is still a step forward). That flaw will be addressed in a future patch. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfnSean Christopherson
Don't bother unprotecting the target gfn if EMULTYPE_WRITE_PF_TO_SP is set, as KVM will simply report the emulation failure to userspace. This will allow converting reexecute_instruction() to use kvm_mmu_unprotect_gfn_instead_retry() instead of kvm_mmu_unprotect_page(). Link: https://lore.kernel.org/r/20240831001538.336683-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Remove manual pfn lookup when retrying #PF after failed emulationSean Christopherson
Drop the manual pfn look when retrying an instruction that KVM failed to emulation in response to a #PF due to a write-protected gfn. Now that KVM sets EMULTYPE_ALLOW_RETRY_PF if and only if the page fault hit a write- protected gfn, i.e. if and only if there's a writable memslot, there's no need to redo the lookup to avoid retrying an instruction that failed on emulated MMIO (no slot, or a write to a read-only slot). I.e. KVM will never attempt to retry an instruction that failed on emulated MMIO, whereas that was not the case prior to the introduction of RET_PF_WRITE_PROTECTED. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Fold retry_instruction() into x86_emulate_instruction()Sean Christopherson
Now that retry_instruction() is reasonably tiny, fold it into its sole caller, x86_emulate_instruction(). In addition to getting rid of the absurdly confusing retry_instruction() name, handling the retry in x86_emulate_instruction() pairs it back up with the code that resets last_retry_{eip,address}. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Move EMULTYPE_ALLOW_RETRY_PF to x86_emulate_instruction()Sean Christopherson
Move the sanity checks for EMULTYPE_ALLOW_RETRY_PF to the top of x86_emulate_instruction(). In addition to deduplicating a small amount of code, this makes the connection between EMULTYPE_ALLOW_RETRY_PF and EMULTYPE_PF even more explicit, and will allow dropping retry_instruction() entirely. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86/mmu: Apply retry protection to "fast nTDP unprotect" pathSean Christopherson
Move the anti-infinite-loop protection provided by last_retry_{eip,addr} into kvm_mmu_write_protect_fault() so that it guards unprotect+retry that never hits the emulator, as well as reexecute_instruction(), which is the last ditch "might as well try it" logic that kicks in when emulation fails on an instruction that faulted on a write-protected gfn. Add a new helper, kvm_mmu_unprotect_gfn_and_retry(), to set the retry fields and deduplicate other code (with more to come). Link: https://lore.kernel.org/r/20240831001538.336683-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Store gpa as gpa_t, not unsigned long, when unprotecting for retrySean Christopherson
Store the gpa used to unprotect the faulting gfn for retry as a gpa_t, not an unsigned long. This fixes a bug where 32-bit KVM would unprotect and retry the wrong gfn if the gpa had bits 63:32!=0. In practice, this bug is functionally benign, as unprotecting the wrong gfn is purely a performance issue (thanks to the anti-infinite-loop logic). And of course, almost no one runs 32-bit KVM these days. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Get RIP from vCPU state when storing it to last_retry_eipSean Christopherson
Read RIP from vCPU state instead of pulling it from the emulation context when filling last_retry_eip, which is part of the anti-infinite-loop protection used when unprotecting and retrying instructions that hit a write-protected gfn. This will allow reusing the anti-infinite-loop protection in flows that never make it into the emulator. No functional change intended, as ctxt->eip is set to kvm_rip_read() in init_emulate_ctxt(), and EMULTYPE_PF emulation is mutually exclusive with EMULTYPE_NO_DECODE and EMULTYPE_SKIP, i.e. always goes through x86_decode_emulated_instruction() and hasn't advanced ctxt->eip (yet). Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Retry to-be-emulated insn in "slow" unprotect path iff sp is zappedSean Christopherson
Resume the guest and thus skip emulation of a non-PTE-writing instruction if and only if unprotecting the gfn actually zapped at least one shadow page. If the gfn is write-protected for some reason other than shadow paging, attempting to unprotect the gfn will effectively fail, and thus retrying the instruction is all but guaranteed to be pointless. This bug has existed for a long time, but was effectively fudged around by the retry RIP+address anti-loop detection. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-09KVM: x86: Forcibly leave nested if RSM to L2 hits shutdownSean Christopherson
Leave nested mode before synthesizing shutdown (a.k.a. TRIPLE_FAULT) if RSM fails when resuming L2 (a.k.a. guest mode). Architecturally, shutdown on RSM occurs _before_ the transition back to guest mode on both Intel and AMD. On Intel, per the SDM pseudocode, SMRAM state is loaded before critical VMX state: restore state normally from SMRAM; ... CR4.VMXE := value stored internally; IF internal storage indicates that the logical processor had been in VMX operation (root or non-root) THEN enter VMX operation (root or non-root); restore VMX-critical state as defined in Section 32.14.1; ... restore current VMCS pointer; FI; AMD's APM is both less clearcut and more explicit. Because AMD CPUs save VMCB and guest state in SMRAM itself, given the lack of anything in the APM to indicate a shutdown in guest mode is possible, a straightforward reading of the clause on invalid state is that _what_ state is invalid is irrelevant, i.e. all roads lead to shutdown. An RSM causes a processor shutdown if an invalid-state condition is found in the SMRAM state-save area. This fixes a bug found by syzkaller where synthesizing shutdown for L2 led to a nested VM-Exit (if L1 is intercepting shutdown), which in turn caused KVM to complain about trying to cancel a nested VM-Enter (see commit 759cbd59674a ("KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM"). Note, Paolo pointed out that KVM shouldn't set nested_run_pending until after loading SMRAM state. But as above, that's only half the story, KVM shouldn't transition to guest mode either. Unfortunately, fixing that mess requires rewriting the nVMX and nSVM RSM flows to not piggyback their nested VM-Enter flows, as executing the nested VM-Enter flows after loading state from SMRAM would clobber much of said state. For now, add a FIXME to call out that transitioning to guest mode before loading state from SMRAM is wrong. Link: https://lore.kernel.org/all/CABgObfYaUHXyRmsmg8UjRomnpQ0Jnaog9-L2gMjsjkqChjDYUQ@mail.gmail.com Reported-by: syzbot+988d9efcdf137bc05f66@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@google.com Reported-by: Zheyu Ma <zheyuma97@gmail.com> Closes: https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@mail.gmail.com Analyzed-by: Michal Wilczynski <michal.wilczynski@intel.com> Cc: Kishen Maloor <kishen.maloor@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240906161337.1118412-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-09-06KVM: x86: don't fall through case statements without annotationsLinus Torvalds
clang warns on this because it has an unannotated fall-through between cases: arch/x86/kvm/x86.c:4819:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough] and while we could annotate it as a fallthrough, the proper fix is to just add the break for this case, instead of falling through to the default case and the break there. gcc also has that warning, but it looks like gcc only warns for the cases where they fall through to "real code", rather than to just a break. Odd. Fixes: d30d9ee94cc0 ("KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM") Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Tom Dohrmann <erbse.13@gmx.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-04KVM: x86: Register "emergency disable" callbacks when virt is enabledSean Christopherson
Register the "disable virtualization in an emergency" callback just before KVM enables virtualization in hardware, as there is no functional need to keep the callbacks registered while KVM happens to be loaded, but is inactive, i.e. if KVM hasn't enabled virtualization. Note, unregistering the callback every time the last VM is destroyed could have measurable latency due to the synchronize_rcu() needed to ensure all references to the callback are dropped before KVM is unloaded. But the latency should be a small fraction of the total latency of disabling virtualization across all CPUs, and userspace can set enable_virt_at_load to completely eliminate the runtime overhead. Add a pointer in kvm_x86_ops to allow vendor code to provide its callback. There is no reason to force vendor code to do the registration, and either way KVM would need a new kvm_x86_ops hook. Suggested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04KVM: x86: Rename virtualization {en,dis}abling APIs to match common KVMSean Christopherson
Rename x86's the per-CPU vendor hooks used to enable virtualization in hardware to align with the recently renamed arch hooks. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240830043600.127750-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-04KVM: Rename arch hooks related to per-CPU virtualization enablingSean Christopherson
Rename the per-CPU hooks used to enable virtualization in hardware to align with the KVM-wide helpers in kvm_main.c, and to better capture that the callbacks are invoked on every online CPU. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240830043600.127750-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-02KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VMTom Dohrmann
Until recently, KVM_CAP_READONLY_MEM was unconditionally supported on x86, but this is no longer the case for SEV-ES and SEV-SNP VMs. When KVM_CHECK_EXTENSION is invoked on a VM, only advertise KVM_CAP_READONLY_MEM when it's actually supported. Fixes: 66155de93bcf ("KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)") Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Michael Roth <michael.roth@amd.com> Signed-off-by: Tom Dohrmann <erbse.13@gmx.de> Message-ID: <20240902144219.3716974-1-erbse.13@gmx.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-08-29KVM: x86: Add fastpath handling of HLT VM-ExitsSean Christopherson
Add a fastpath for HLT VM-Exits by immediately re-entering the guest if it has a pending wake event. When virtual interrupt delivery is enabled, i.e. when KVM doesn't need to manually inject interrupts, this allows KVM to stay in the fastpath run loop when a vIRQ arrives between the guest doing CLI and STI;HLT. Without AMD's Idle HLT-intercept support, the CPU generates a HLT VM-Exit even though KVM will immediately resume the guest. Note, on bare metal, it's relatively uncommon for a modern guest kernel to actually trigger this scenario, as the window between the guest checking for a wake event and committing to HLT is quite small. But in a nested environment, the timings change significantly, e.g. rudimentary testing showed that ~50% of HLT exits where HLT-polling was successful would be serviced by this fastpath, i.e. ~50% of the time that a nested vCPU gets a wake event before KVM schedules out the vCPU, the wake event was pending even before the VM-Exit. Link: https://lore.kernel.org/all/20240528041926.3989-3-manali.shukla@amd.com Link: https://lore.kernel.org/r/20240802195120.325560-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29KVM: x86: Reorganize code in x86.c to co-locate vCPU blocking/running helpersSean Christopherson
Shuffle code around in x86.c so that the various helpers related to vCPU blocking/running logic are (a) located near each other and (b) ordered so that HLT emulation can use kvm_vcpu_has_events() in a future path. No functional change intended. Link: https://lore.kernel.org/r/20240802195120.325560-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29KVM: x86: Exit to userspace if fastpath triggers one on instruction skipSean Christopherson
Exit to userspace if a fastpath handler triggers such an exit, which can happen when skipping the instruction, e.g. due to userspace single-stepping the guest via KVM_GUESTDBG_SINGLESTEP or because of an emulation failure. Fixes: 404d5d7bff0d ("KVM: X86: Introduce more exit_fastpath_completion enum values") Link: https://lore.kernel.org/r/20240802195120.325560-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29KVM: x86: Dedup fastpath MSR post-handling logicSean Christopherson
Now that the WRMSR fastpath for x2APIC_ICR and TSC_DEADLINE are identical, ignoring the backend MSR handling, consolidate the common bits of skipping the instruction and setting the return value. No functional change intended. Link: https://lore.kernel.org/r/20240802195120.325560-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-29KVM: x86: Re-enter guest if WRMSR(X2APIC_ICR) fastpath is successfulSean Christopherson
Re-enter the guest in the fastpath if WRMSR emulation for x2APIC's ICR is successful, as no additional work is needed, i.e. there is no code unique for WRMSR exits between the fastpath and the "!= EXIT_FASTPATH_NONE" check in __vmx_handle_exit(). Link: https://lore.kernel.org/r/20240802195120.325560-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Suppress userspace access failures on unsupported, "emulated" MSRsSean Christopherson
Extend KVM's suppression of userspace MSR access failures to MSRs that KVM reports as emulated, but are ultimately unsupported, e.g. if the VMX MSRs are emulated by KVM, but are unsupported given the vCPU model. Suggested-by: Weijiang Yang <weijiang.yang@intel.com> Reviewed-by: Weijiang Yang <weijiang.yang@intel.com> Link: https://lore.kernel.org/r/20240802181935.292540-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Suppress failures on userspace access to advertised, unsupported MSRsSean Christopherson
Extend KVM's suppression of failures due to a userspace access to an unsupported, but advertised as a "to save" MSR to all MSRs, not just those that happen to reach the default case statements in kvm_get_msr_common() and kvm_set_msr_common(). KVM's soon-to-be-established ABI is that if an MSR is advertised to userspace, then userspace is allowed to read the MSR, and write back the value that was read, i.e. why an MSR is unsupported doesn't change KVM's ABI. Practically speaking, this is very nearly a nop, as the only other paths that return KVM_MSR_RET_UNSUPPORTED are {svm,vmx}_get_feature_msr(), and it's unlikely, though not impossible, that userspace is using KVM_GET_MSRS on unsupported MSRs. The primary goal of moving the suppression to common code is to allow returning KVM_MSR_RET_UNSUPPORTED as appropriate throughout KVM, without having to manually handle the "is userspace accessing an advertised" waiver. I.e. this will allow formalizing KVM's ABI without incurring a high maintenance cost. Link: https://lore.kernel.org/r/20240802181935.292540-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Hoist x86.c's global msr_* variables up above kvm_do_msr_access()Sean Christopherson
Move the definitions of the various MSR arrays above kvm_do_msr_access() so that kvm_do_msr_access() can query the arrays when handling failures, e.g. to squash errors if userspace tries to read an MSR that isn't fully supported, but that KVM advertised as being an MSR-to-save. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Funnel all fancy MSR return value handling into a common helperSean Christopherson
Add a common helper, kvm_do_msr_access(), to invoke the "leaf" APIs that are type and access specific, and more importantly to handle errors that are returned from the leaf APIs. I.e. turn kvm_msr_ignored_check() from a a helper that is called on an error, into a trampoline that detects errors *and* applies relevant side effects, e.g. logging unimplemented accesses. Because the leaf APIs are used for guest accesses, userspace accesses, and KVM accesses, and because KVM supports restricting access to MSRs from userspace via filters, the error handling is subtly non-trivial. E.g. KVM has had at least one bug escape due to making each "outer" function handle errors. See commit 3376ca3f1a20 ("KVM: x86: Fix KVM_GET_MSRS stack info leak"). Link: https://lore.kernel.org/r/20240802181935.292540-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Refactor kvm_get_feature_msr() to avoid struct kvm_msr_entrySean Christopherson
Refactor kvm_get_feature_msr() to take the components of kvm_msr_entry as separate parameters, along with a vCPU pointer, i.e. to give it the same prototype as kvm_{g,s}et_msr_ignored_check(). This will allow using a common inner helper for handling accesses to "regular" and feature MSRs. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Rename get_msr_feature() APIs to get_feature_msr()Sean Christopherson
Rename all APIs related to feature MSRs from get_msr_feature() to get_feature_msr(). The APIs get "feature MSRs", not "MSR features". And unlike kvm_{g,s}et_msr_common(), the "feature" adjective doesn't describe the helper itself. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Refactor kvm_x86_ops.get_msr_feature() to avoid kvm_msr_entrySean Christopherson
Refactor get_msr_feature() to take the index and data pointer as distinct parameters in anticipation of eliminating "struct kvm_msr_entry" usage further up the primary callchain. No functional change intended. Link: https://lore.kernel.org/r/20240802181935.292540-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Rename KVM_MSR_RET_INVALID to KVM_MSR_RET_UNSUPPORTEDSean Christopherson
Rename the "INVALID" internal MSR error return code to "UNSUPPORTED" to try and make it more clear that access was denied because the MSR itself is unsupported/unknown. "INVALID" is too ambiguous, as it could just as easily mean the value for WRMSR as invalid. Avoid UNKNOWN and UNIMPLEMENTED, as the error code is used for MSRs that _are_ actually implemented by KVM, e.g. if the MSR is unsupported because an associated feature flag is not present in guest CPUID. Opportunistically beef up the comments for the internal MSR error codes. Link: https://lore.kernel.org/r/20240802181935.292540-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Use this_cpu_ptr() in kvm_user_return_msr_cpu_onlineLi Chen
Use this_cpu_ptr() instead of open coding the equivalent in kvm_user_return_msr_cpu_online. Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Link: https://lore.kernel.org/r/87zfp96ojk.wl-me@linux.beauty Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: nVMX: Honor userspace MSR filter lists for nested VM-Enter/VM-ExitSean Christopherson
Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has disallowed access to via an MSR filter. Intel already disallows including a handful of "special" MSRs in the VMCS lists, so denying access isn't completely without precedent. More importantly, the behavior is well-defined _and_ can be communicated the end user, e.g. to the customer that owns a VM running as L1 on top of KVM. On the other hand, ignoring userspace MSR filters is all but guaranteed to result in unexpected behavior as the access will hit KVM's internal state, which is likely not up-to-date. Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS fields, the MSRs in the VMCS load/store lists are 100% guest controlled, thus making it all but impossible to reason about the correctness of ignoring the MSR filter. And if userspace *really* wants to deny access to MSRs via the aforementioned scenarios, userspace can hide the associated feature from the guest, e.g. by disabling the PMU to prevent accessing PERF_GLOBAL_CTRL via its VMCS field. But for the MSR lists, KVM is blindly processing MSRs; the MSR filters are the _only_ way for userspace to deny access. This partially reverts commit ac8d6cad3c7b ("KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr"). Cc: Hou Wenlong <houwenlong.hwl@antgroup.com> Cc: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20240722235922.3351122-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Stuff vCPU's PAT with default value at RESET, not creationSean Christopherson
Move the stuffing of the vCPU's PAT to the architectural "default" value from kvm_arch_vcpu_create() to kvm_vcpu_reset(), guarded by !init_event, to better capture that the default value is the value "Following Power-up or Reset". E.g. setting PAT only during creation would break if KVM were to expose a RESET ioctl() to userspace (which is unlikely, but that's not a good reason to have unintuitive code). No functional change. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Zhao Liu <zhao1.liu@intel.com> Link: https://lore.kernel.org/r/20240605231918.2915961-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-22KVM: x86: Acquire kvm->srcu when handling KVM_SET_VCPU_EVENTSSean Christopherson
Grab kvm->srcu when processing KVM_SET_VCPU_EVENTS, as KVM will forcibly leave nested VMX/SVM if SMM mode is being toggled, and leaving nested VMX reads guest memory. Note, kvm_vcpu_ioctl_x86_set_vcpu_events() can also be called from KVM_RUN via sync_regs(), which already holds SRCU. I.e. trying to precisely use kvm_vcpu_srcu_read_lock() around the problematic SMM code would cause problems. Acquiring SRCU isn't all that expensive, so for simplicity, grab it unconditionally for KVM_SET_VCPU_EVENTS. ============================= WARNING: suspicious RCU usage 6.10.0-rc7-332d2c1d713e-next-vm #552 Not tainted ----------------------------- include/linux/kvm_host.h:1027 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by repro/1071: #0: ffff88811e424430 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x7d/0x970 [kvm] stack backtrace: CPU: 15 PID: 1071 Comm: repro Not tainted 6.10.0-rc7-332d2c1d713e-next-vm #552 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x7f/0x90 lockdep_rcu_suspicious+0x13f/0x1a0 kvm_vcpu_gfn_to_memslot+0x168/0x190 [kvm] kvm_vcpu_read_guest+0x3e/0x90 [kvm] nested_vmx_load_msr+0x6b/0x1d0 [kvm_intel] load_vmcs12_host_state+0x432/0xb40 [kvm_intel] vmx_leave_nested+0x30/0x40 [kvm_intel] kvm_vcpu_ioctl_x86_set_vcpu_events+0x15d/0x2b0 [kvm] kvm_arch_vcpu_ioctl+0x1107/0x1750 [kvm] ? mark_held_locks+0x49/0x70 ? kvm_vcpu_ioctl+0x7d/0x970 [kvm] ? kvm_vcpu_ioctl+0x497/0x970 [kvm] kvm_vcpu_ioctl+0x497/0x970 [kvm] ? lock_acquire+0xba/0x2d0 ? find_held_lock+0x2b/0x80 ? do_user_addr_fault+0x40c/0x6f0 ? lock_release+0xb7/0x270 __x64_sys_ioctl+0x82/0xb0 do_syscall_64+0x6c/0x170 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7ff11eb1b539 </TASK> Fixes: f7e570780efc ("KVM: x86: Forcibly leave nested virt when SMM state is toggled") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240723232055.3643811-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-08-13KVM: x86: Use this_cpu_ptr() instead of per_cpu_ptr(smp_processor_id())Isaku Yamahata
Use this_cpu_ptr() instead of open coding the equivalent in various user return MSR helpers. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Message-ID: <20240802201630.339306-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>