summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2023-08-31KVM: x86/mmu: Handle KVM bookkeeping in page-track APIs, not callersSean Christopherson
Get/put references to KVM when a page-track notifier is (un)registered instead of relying on the caller to do so. Forcing the caller to do the bookkeeping is unnecessary and adds one more thing for users to get wrong, e.g. see commit 9ed1fdee9ee3 ("drm/i915/gvt: Get reference to KVM iff attachment to VM is successful"). Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Drop @slot param from exported/external page-track APIsSean Christopherson
Refactor KVM's exported/external page-track, a.k.a. write-track, APIs to take only the gfn and do the required memslot lookup in KVM proper. Forcing users of the APIs to get the memslot unnecessarily bleeds KVM internals into KVMGT and complicates usage of the APIs. No functional change intended. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-28-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Bug the VM if write-tracking is used but not enabledSean Christopherson
Bug the VM if something attempts to write-track a gfn, but write-tracking isn't enabled. The VM is doomed (and KVM has an egregious bug) if KVM or KVMGT wants to shadow guest page tables but can't because write-tracking isn't enabled. Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-27-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Assert that correct locks are held for page write-trackingSean Christopherson
When adding/removing gfns to/from write-tracking, assert that mmu_lock is held for write, and that either slots_lock or kvm->srcu is held. mmu_lock must be held for write to protect gfn_write_track's refcount, and SRCU or slots_lock must be held to protect the memslot itself. Tested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-26-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Rename page-track APIs to reflect the new realitySean Christopherson
Rename the page-track APIs to capture that they're all about tracking writes, now that the facade of supporting multiple modes is gone. Opportunstically replace "slot" with "gfn" in anticipation of removing the @slot param from the external APIs. No functional change intended. Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Drop infrastructure for multiple page-track modesSean Christopherson
Drop "support" for multiple page-track modes, as there is no evidence that array-based and refcounted metadata is the optimal solution for other modes, nor is there any evidence that other use cases, e.g. for access-tracking, will be a good fit for the page-track machinery in general. E.g. one potential use case of access-tracking would be to prevent guest access to poisoned memory (from the guest's perspective). In that case, the number of poisoned pages is likely to be a very small percentage of the guest memory, and there is no need to reference count the number of access-tracking users, i.e. expanding gfn_track[] for a new mode would be grossly inefficient. And for poisoned memory, host userspace would also likely want to trap accesses, e.g. to inject #MC into the guest, and that isn't currently supported by the page-track framework. A better alternative for that poisoned page use case is likely a variation of the proposed per-gfn attributes overlay (linked), which would allow efficiently tracking the sparse set of poisoned pages, and by default would exit to userspace on access. Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com Cc: Ben Gardon <bgardon@google.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Use page-track notifiers iff there are external usersSean Christopherson
Disable the page-track notifier code at compile time if there are no external users, i.e. if CONFIG_KVM_EXTERNAL_WRITE_TRACKING=n. KVM itself now hooks emulated writes directly instead of relying on the page-track mechanism. Provide a stub for "struct kvm_page_track_notifier_node" so that including headers directly from the command line, e.g. for testing include guards, doesn't fail due to a struct having an incomplete type. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Move KVM-only page-track declarations to internal headerSean Christopherson
Bury the declaration of the page-track helpers that are intended only for internal KVM use in a "private" header. In addition to guarding against unwanted usage of the internal-only helpers, dropping their definitions avoids exposing other structures that should be KVM-internal, e.g. for memslots. This is a baby step toward making kvm_host.h a KVM-internal header in the very distant future. Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86: Remove the unused page-track hook track_flush_slot()Yan Zhao
Remove ->track_remove_slot(), there are no longer any users and it's unlikely a "flush" hook will ever be the correct API to provide to an external page-track user. Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86: Add a new page-track hook to handle memslot deletionYan Zhao
Add a new page-track hook, track_remove_region(), that is called when a memslot DELETE operation is about to be committed. The "remove" hook will be used by KVMGT and will effectively replace the existing track_flush_slot() altogether now that KVM itself doesn't rely on the "flush" hook either. The "flush" hook is flawed as it's invoked before the memslot operation is guaranteed to succeed, i.e. KVM might ultimately keep the existing memslot without notifying external page track users, a.k.a. KVMGT. In practice, this can't currently happen on x86, but there are no guarantees that won't change in the future, not to mention that "flush" does a very poor job of describing what is happening. Pass in the gfn+nr_pages instead of the slot itself so external users, i.e. KVMGT, don't need to exposed to KVM internals (memslots). This will help set the stage for additional cleanups to the page-track APIs. Opportunistically align the existing srcu_read_lock_held() usage so that the new case doesn't stand out like a sore thumb (and not aligning the new code makes bots unhappy). Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230729013535.1070024-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86: Reject memslot MOVE operations if KVMGT is attachedSean Christopherson
Disallow moving memslots if the VM has external page-track users, i.e. if KVMGT is being used to expose a virtual GPU to the guest, as KVMGT doesn't correctly handle moving memory regions. Note, this is potential ABI breakage! E.g. userspace could move regions that aren't shadowed by KVMGT without harming the guest. However, the only known user of KVMGT is QEMU, and QEMU doesn't move generic memory regions. KVM's own support for moving memory regions was also broken for multiple years (albeit for an edge case, but arguably moving RAM is itself an edge case), e.g. see commit edd4fa37baa6 ("KVM: x86: Allocate new rmap and large page tracking when moving memslot"). Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: drm/i915/gvt: Drop @vcpu from KVM's ->track_write() hookSean Christopherson
Drop @vcpu from KVM's ->track_write() hook provided for external users of the page-track APIs now that KVM itself doesn't use the page-track mechanism. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Reviewed-by: Zhi Wang <zhi.a.wang@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Don't bounce through page-track mechanism for guest PTEsSean Christopherson
Don't use the generic page-track mechanism to handle writes to guest PTEs in KVM's MMU. KVM's MMU needs access to information that should not be exposed to external page-track users, e.g. KVM needs (for some definitions of "need") the vCPU to query the current paging mode, whereas external users, i.e. KVMGT, have no ties to the current vCPU and so should never need the vCPU. Moving away from the page-track mechanism will allow dropping use of the page-track mechanism for KVM's own MMU, and will also allow simplifying and cleaning up the page-track APIs. Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Don't rely on page-track mechanism to flush on memslot changeSean Christopherson
Call kvm_mmu_zap_all_fast() directly when flushing a memslot instead of bouncing through the page-track mechanism. KVM (unfortunately) needs to zap and flush all page tables on memslot DELETE/MOVE irrespective of whether KVM is shadowing guest page tables. This will allow changing KVM to register a page-track notifier on the first shadow root allocation, and will also allow deleting the misguided kvm_page_track_flush_slot() hook itself once KVM-GT also moves to a different method for reacting to memslot changes. No functional change intended. Cc: Yan Zhao <yan.y.zhao@intel.com> Link: https://lore.kernel.org/r/20221110014821.1548347-2-seanjc@google.com Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Move kvm_arch_flush_shadow_{all,memslot}() to mmu.cSean Christopherson
Move x86's implementation of kvm_arch_flush_shadow_{all,memslot}() into mmu.c, and make kvm_mmu_zap_all() static as it was globally visible only for kvm_arch_flush_shadow_all(). This will allow refactoring kvm_arch_flush_shadow_memslot() to call kvm_mmu_zap_all() directly without having to expose kvm_mmu_zap_all_fast() outside of mmu.c. Keeping everything in mmu.c will also likely simplify supporting TDX, which intends to do zap only relevant SPTEs on memslot updates. No functional change intended. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yongwei Ma <yongwei.ma@intel.com> Link: https://lore.kernel.org/r/20230729013535.1070024-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: BUG() in rmap helpers iff CONFIG_BUG_ON_DATA_CORRUPTION=ySean Christopherson
Introduce KVM_BUG_ON_DATA_CORRUPTION() and use it in the low-level rmap helpers to convert the existing BUG()s to WARN_ON_ONCE() when the kernel is built with CONFIG_BUG_ON_DATA_CORRUPTION=n, i.e. does NOT want to BUG() on corruption of host kernel data structures. Environments that don't have infrastructure to automatically capture crash dumps, i.e. aren't likely to enable CONFIG_BUG_ON_DATA_CORRUPTION=y, are typically better served overall by WARN-and-continue behavior (for the kernel, the VM is dead regardless), as a BUG() while holding mmu_lock all but guarantees the _best_ case scenario is a panic(). Make the BUG()s conditional instead of removing/replacing them entirely as there's a non-zero chance (though by no means a guarantee) that the damage isn't contained to the target VM, e.g. if no rmap is found for a SPTE then KVM may be double-zapping the SPTE, i.e. has already freed the memory the SPTE pointed at and thus KVM is reading/writing memory that KVM no longer owns. Link: https://lore.kernel.org/all/20221129191237.31447-1-mizhang@google.com Suggested-by: Mingwei Zhang <mizhang@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Jim Mattson <jmattson@google.com> Reviewed-by: Mingwei Zhang <mizhang@google.com> Link: https://lore.kernel.org/r/20230729004722.1056172-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Plumb "struct kvm" all the way to pte_list_remove()Mingwei Zhang
Plumb "struct kvm" all the way to pte_list_remove() to allow the usage of KVM_BUG() and/or KVM_BUG_ON(). This will allow killing only the offending VM instead of doing BUG() if the kernel is built with CONFIG_BUG_ON_DATA_CORRUPTION=n, i.e. does NOT want to BUG() if KVM's data structures (rmaps) appear to be corrupted. Signed-off-by: Mingwei Zhang <mizhang@google.com> [sean: tweak changelog] Link: https://lore.kernel.org/r/20230729004722.1056172-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Use BUILD_BUG_ON_INVALID() for KVM_MMU_WARN_ON() stubSean Christopherson
Use BUILD_BUG_ON_INVALID() instead of an empty do-while loop to stub out KVM_MMU_WARN_ON() when CONFIG_KVM_PROVE_MMU=n, that way _some_ build issues with the usage of KVM_MMU_WARN_ON() will be dected even if the kernel is using the stubs, e.g. basic syntax errors will be detected. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Replace MMU_DEBUG with proper KVM_PROVE_MMU KconfigSean Christopherson
Replace MMU_DEBUG, which requires manually modifying KVM to enable the macro, with a proper Kconfig, KVM_PROVE_MMU. Now that pgprintk() and rmap_printk() are gone, i.e. the macro guards only KVM_MMU_WARN_ON() and won't flood the kernel logs, enabling the option for debug kernels is both desirable and feasible. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Bug the VM if a vCPU ends up in long mode without PAE enabledSean Christopherson
Promote the ASSERT(), which is quite dead code in KVM, into a KVM_BUG_ON() for KVM's sanity check that CR4.PAE=1 if the vCPU is in long mode when performing a walk of guest page tables. The sanity is quite cheap since neither EFER nor CR4.PAE requires a VMREAD, especially relative to the cost of walking the guest page tables. More importantly, the sanity check would have prevented the true badness fixed by commit 112e66017bff ("KVM: nVMX: add missing consistency checks for CR0 and CR4"). The missed consistency check resulted in some versions of KVM corrupting the on-stack guest_walker structure due to KVM thinking there are 4/5 levels of page tables, but wiring up the MMU hooks to point at the paging32 implementation, which only allocates space for two levels of page tables in "struct guest_walker32". Queue a page fault for injection if the assertion fails, as both callers, FNAME(gva_to_gpa) and FNAME(walk_addr_generic), assume that walker.fault contains sane info on a walk failure. E.g. not populating the fault info could result in KVM consuming and/or exposing uninitialized stack data before the vCPU is kicked out to userspace, which doesn't happen until KVM checks for KVM_REQ_VM_DEAD on the next enter. Move the check below the initialization of "pte_access" so that the aforementioned to-be-injected page fault doesn't consume uninitialized stack data. The information _shouldn't_ reach the guest or userspace, but there's zero downside to being paranoid in this case. Link: https://lore.kernel.org/r/20230729004722.1056172-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Convert "runtime" WARN_ON() assertions to WARN_ON_ONCE()Sean Christopherson
Convert all "runtime" assertions, i.e. assertions that can be triggered while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the MMU that is tied to running vCPUs, i.e. not contained to loading and initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if KVM ends up with a bug that causes a root to be invalidated before the page fault handler is invoked, pretty much _every_ page fault VM-Exit triggers the WARN. If a WARN is triggered frequently, the resulting spam usually causes a lot of damage of its own, e.g. consumes resources to log the WARN and pollutes the kernel log, often to the point where other useful information can be lost. In many case, the damage caused by the spam is actually worse than the bug itself, e.g. KVM can almost always recover from an unexpectedly invalid root. On the flip side, warning every time is rarely helpful for debug and triage, i.e. a single splat is usually sufficient to point a debugger in the right direction, and automated testing, e.g. syzkaller, typically runs with warn_on_panic=1, i.e. will never get past the first WARN anyways. Lastly, when an assertions fails multiple times, the stack traces in KVM are almost always identical, i.e. the full splat only needs to be captured once. And _if_ there is value in captruing information about the failed assert, a ratelimited printk() is sufficient and less likely to rack up a large amount of collateral damage. Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Rename MMU_WARN_ON() to KVM_MMU_WARN_ON()Sean Christopherson
Rename MMU_WARN_ON() to make it super obvious that the assertions are all about KVM's MMU, not the primary MMU. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Cleanup sanity check of SPTEs at SP freeSean Christopherson
Massage the error message for the sanity check on SPTEs when freeing a shadow page to be more verbose, and to print out all shadow-present SPTEs, not just the first SPTE encountered. Printing all SPTEs can be quite valuable for debug, e.g. highlights whether the leak is a one-off or widepsread, or possibly the result of memory corruption (something else in the kernel stomping on KVM's SPTEs). Opportunistically move the MMU_WARN_ON() into the helper itself, which will allow a future cleanup to use BUILD_BUG_ON_INVALID() as the stub for MMU_WARN_ON(). BUILD_BUG_ON_INVALID() works as intended and results in the compiler complaining about is_empty_shadow_page() not being declared. Link: https://lore.kernel.org/r/20230729004722.1056172-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Avoid pointer arithmetic when iterating over SPTEsSean Christopherson
Replace the pointer arithmetic used to iterate over SPTEs in is_empty_shadow_page() with more standard interger-based iteration. No functional change intended. Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230729004722.1056172-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Delete the "dbg" module paramSean Christopherson
Delete KVM's "dbg" module param now that its usage in KVM is gone (it used to guard pgprintk() and rmap_printk()). Link: https://lore.kernel.org/r/20230729004722.1056172-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Delete rmap_printk() and all its usageSean Christopherson
Delete rmap_printk() so that MMU_WARN_ON() and MMU_DEBUG can be morphed into something that can be regularly enabled for debug kernels. The information provided by rmap_printk() isn't all that useful now that the rmap and unsync code is mature, as the prints are simultaneously too verbose (_lots_ of message) and yet not verbose enough to be helpful for debug (most instances print just the SPTE pointer/value, which is rarely sufficient to root cause anything but trivial bugs). Alternatively, rmap_printk() could be reworked to into tracepoints, but it's not clear there is a real need as rmap bugs rarely escape initial development, and when bugs do escape to production, they are often edge cases and/or reside in code that isn't directly related to the rmaps. In other words, the problems with rmap_printk() being unhelpful also apply to tracepoints. And deleting rmap_printk() doesn't preclude adding tracepoints in the future. Link: https://lore.kernel.org/r/20230729004722.1056172-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Delete pgprintk() and all its usageSean Christopherson
Delete KVM's pgprintk() and all its usage, as the code is very prone to bitrot due to being buried behind MMU_DEBUG, and the functionality has been rendered almost entirely obsolete by the tracepoints KVM has gained over the years. And for the situations where the information provided by KVM's tracepoints is insufficient, pgprintk() rarely fills in the gaps, and is almost always far too noisy, i.e. developers end up implementing custom prints anyways. Link: https://lore.kernel.org/r/20230729004722.1056172-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Guard against collision with KVM-defined PFERR_IMPLICIT_ACCESSSean Christopherson
Add an assertion in kvm_mmu_page_fault() to ensure the error code provided by hardware doesn't conflict with KVM's software-defined IMPLICIT_ACCESS flag. In the unlikely scenario that future hardware starts using bit 48 for a hardware-defined flag, preserving the bit could result in KVM incorrectly interpreting the unknown flag as KVM's IMPLICIT_ACCESS flag. WARN so that any such conflict can be surfaced to KVM developers and resolved, but otherwise ignore the bit as KVM can't possibly rely on a flag it knows nothing about. Fixes: 4f4aa80e3b88 ("KVM: X86: Handle implicit supervisor access with SMAP") Acked-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230721223711.2334426-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31KVM: x86/mmu: Move the lockdep_assert of mmu_lock to inside ↵Like Xu
clear_dirty_pt_masked() Move the lockdep_assert_held_write(&kvm->mmu_lock) from the only one caller kvm_tdp_mmu_clear_dirty_pt_masked() to inside clear_dirty_pt_masked(). This change makes it more obvious why it's safe for clear_dirty_pt_masked() to use the non-atomic (for non-volatile SPTEs) tdp_mmu_clear_spte_bits() helper. for_each_tdp_mmu_root() does its own lockdep, so the only "loss" in lockdep coverage is if the list is completely empty. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Link: https://lore.kernel.org/r/20230627042639.12636-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-08-31Merge tag 'kvm-x86-misc-6.6' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 changes for 6.6: - Misc cleanups - Retry APIC optimized recalculation if a vCPU is added/enabled - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the "emergency disabling" behavior to KVM actually being loaded, and move all of the logic within KVM - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC ratio MSR can diverge from the default iff TSC scaling is enabled, and clean up related code - Add a framework to allow "caching" feature flags so that KVM can check if the guest can use a feature without needing to search guest CPUID
2023-08-31Merge tag 'kvm-x86-svm-6.6' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM: x86: SVM changes for 6.6: - Add support for SEV-ES DebugSwap, i.e. allow SEV-ES guests to use debug registers and generate/handle #DBs - Clean up LBR virtualization code - Fix a bug where KVM fails to set the target pCPU during an IRTE update - Fix fatal bugs in SEV-ES intrahost migration - Fix a bug where the recent (architecturally correct) change to reinject #BP and skip INT3 broke SEV guests (can't decode INT3 to skip it)
2023-08-31Merge tag 'kvm-x86-vmx-6.6' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM: x86: VMX changes for 6.6: - Misc cleanups - Fix a bug where KVM reads a stale vmcs.IDT_VECTORING_INFO_FIELD when trying to handle NMI VM-Exits
2023-08-31Merge tag 'kvm-x86-pmu-6.6' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 PMU changes for 6.6: - Clean up KVM's handling of Intel architectural events
2023-08-31Merge tag 'kvm-x86-selftests-6.6' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM: x86: Selftests changes for 6.6: - Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs - Add support for printf() in guest code and covert all guest asserts to use printf-based reporting - Clean up the PMU event filter test and add new testcases - Include x86 selftests in the KVM x86 MAINTAINERS entry
2023-08-31Merge tag 'kvm-x86-generic-6.6' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
Common KVM changes for 6.6: - Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier events to pass action specific data without needing to constantly update the main handlers. - Drop unused function declarations
2023-08-31Merge tag 'kvmarm-6.6' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 6.6 - Add support for TLB range invalidation of Stage-2 page tables, avoiding unnecessary invalidations. Systems that do not implement range invalidation still rely on a full invalidation when dealing with large ranges. - Add infrastructure for forwarding traps taken from a L2 guest to the L1 guest, with L0 acting as the dispatcher, another baby step towards the full nested support. - Simplify the way we deal with the (long deprecated) 'CPU target', resulting in a much needed cleanup. - Fix another set of PMU bugs, both on the guest and host sides, as we seem to never have any shortage of those... - Relax the alignment requirements of EL2 VA allocations for non-stack allocations, as we were otherwise wasting a lot of that precious VA space. - The usual set of non-functional cleanups, although I note the lack of spelling fixes...
2023-08-30Merge tag 'x86_apic_for_6.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 apic updates from Dave Hansen: "This includes a very thorough rework of the 'struct apic' handlers. Quite a variety of them popped up over the years, especially in the 32-bit days when odd apics were much more in vogue. The end result speaks for itself, which is a removal of a ton of code and static calls to replace indirect calls. If there's any breakage here, it's likely to be around the 32-bit museum pieces that get light to no testing these days. Summary: - Rework apic callbacks, getting rid of unnecessary ones and coalescing lots of silly duplicates. - Use static_calls() instead of indirect calls for apic->foo() - Tons of cleanups an crap removal along the way" * tag 'x86_apic_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) x86/apic: Turn on static calls x86/apic: Provide static call infrastructure for APIC callbacks x86/apic: Wrap IPI calls into helper functions x86/apic: Mark all hotpath APIC callback wrappers __always_inline x86/xen/apic: Mark apic __ro_after_init x86/apic: Convert other overrides to apic_update_callback() x86/apic: Replace acpi_wake_cpu_handler_update() and apic_set_eoi_cb() x86/apic: Provide apic_update_callback() x86/xen/apic: Use standard apic driver mechanism for Xen PV x86/apic: Provide common init infrastructure x86/apic: Wrap apic->native_eoi() into a helper x86/apic: Nuke ack_APIC_irq() x86/apic: Remove pointless arguments from [native_]eoi_write() x86/apic/noop: Tidy up the code x86/apic: Remove pointless NULL initializations x86/apic: Sanitize APIC ID range validation x86/apic: Prepare x2APIC for using apic::max_apic_id x86/apic: Simplify X2APIC ID validation x86/apic: Add max_apic_id member x86/apic: Wrap APIC ID validation into an inline ...
2023-08-28KVM: VMX: Refresh available regs and IDT vectoring info before NMI handlingSean Christopherson
Reset the mask of available "registers" and refresh the IDT vectoring info snapshot in vmx_vcpu_enter_exit(), before KVM potentially handles a an NMI VM-Exit. One of the "registers" that KVM VMX lazily loads is the vmcs.VM_EXIT_INTR_INFO field, which is holds the vector+type on "exception or NMI" VM-Exits, i.e. is needed to identify NMIs. Clearing the available registers bitmask after handling NMIs results in KVM querying info from the last VM-Exit that read vmcs.VM_EXIT_INTR_INFO, and leads to both missed NMIs and spurious NMIs in the host. Opportunistically grab vmcs.IDT_VECTORING_INFO_FIELD early in the VM-Exit path too, e.g. to guard against similar consumption of stale data. The field is read on every "normal" VM-Exit, and there's no point in delaying the inevitable. Reported-by: Like Xu <like.xu.linux@gmail.com> Fixes: 11df586d774f ("KVM: VMX: Handle NMI VM-Exits in noinstr region") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230825014532.2846714-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25KVM: VMX: Delete ancient pr_warn() about KVM_SET_TSS_ADDR not being setSean Christopherson
Delete KVM's printk about KVM_SET_TSS_ADDR not being called. When the printk was added by commit 776e58ea3d37 ("KVM: unbreak userspace that does not sets tss address"), KVM also stuffed a "hopefully safe" value, i.e. the message wasn't purely informational. For reasons unknown, ostensibly to try and help people running outdated qemu-kvm versions, the message got left behind when KVM's stuffing was removed by commit 4918c6ca6838 ("KVM: VMX: Require KVM_SET_TSS_ADDR being called prior to running a VCPU"). Today, the message is completely nonsensical, as it has been over a decade since KVM supported userspace running a Real Mode guest, on a CPU without unrestricted guest support, without doing KVM_SET_TSS_ADDR before KVM_RUN. I.e. KVM's ABI has required KVM_SET_TSS_ADDR for 10+ years. To make matters worse, the message is prone to false positives as it triggers when simply *creating* a vCPU due to RESET putting vCPUs into Real Mode, even when the user has no intention of ever *running* the vCPU in a Real Mode. E.g. KVM selftests stuff 64-bit mode and never touch Real Mode, but trigger the message even though they run just fine without doing KVM_SET_TSS_ADDR. Creating "dummy" vCPUs, e.g. to probe features, can also trigger the message. In both scenarios, the message confuses users and falsely implies that they've done something wrong. Reported-by: Thorsten Glaser <t.glaser@tarent.de> Closes: https://lkml.kernel.org/r/f1afa6c0-cde2-ab8b-ea71-bfa62a45b956%40tarent.de Link: https://lore.kernel.org/r/20230815174215.433222-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25KVM: SVM: Require nrips support for SEV guests (and beyond)Sean Christopherson
Disallow SEV (and beyond) if nrips is disabled via module param, as KVM can't read guest memory to partially emulate and skip an instruction. All CPUs that support SEV support NRIPS, i.e. this is purely stopping the user from shooting themselves in the foot. Cc: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20230825013621.2845700-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25KVM: SVM: Don't inject #UD if KVM attempts to skip SEV guest insnSean Christopherson
Don't inject a #UD if KVM attempts to "emulate" to skip an instruction for an SEV guest, and instead resume the guest and hope that it can make forward progress. When commit 04c40f344def ("KVM: SVM: Inject #UD on attempted emulation for SEV guest w/o insn buffer") added the completely arbitrary #UD behavior, there were no known scenarios where a well-behaved guest would induce a VM-Exit that triggered emulation, i.e. it was thought that injecting #UD would be helpful. However, now that KVM (correctly) attempts to re-inject INT3/INTO, e.g. if a #NPF is encountered when attempting to deliver the INT3/INTO, an SEV guest can trigger emulation without a buffer, through no fault of its own. Resuming the guest and retrying the INT3/INTO is architecturally wrong, e.g. the vCPU will incorrectly re-hit code #DBs, but for SEV guests there is literally no other option that has a chance of making forward progress. Drop the #UD injection for all "skip" emulation, not just those related to INT3/INTO, even though that means that the guest will likely end up in an infinite loop instead of getting a #UD (the vCPU may also crash, e.g. if KVM emulated everything about an instruction except for advancing RIP). There's no evidence that suggests that an unexpected #UD is actually better than hanging the vCPU, e.g. a soft-hung vCPU can still respond to IRQs and NMIs to generate a backtrace. Reported-by: Wu Zongyo <wuzongyo@mail.ustc.edu.cn> Closes: https://lore.kernel.org/all/8eb933fd-2cf3-d7a9-32fe-2a1d82eac42a@mail.ustc.edu.cn Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction") Cc: stable@vger.kernel.org Cc: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20230825013621.2845700-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25KVM: SVM: Skip VMSA init in sev_es_init_vmcb() if pointer is NULLSean Christopherson
Skip initializing the VMSA physical address in the VMCB if the VMSA is NULL, which occurs during intrahost migration as KVM initializes the VMCB before copying over state from the source to the destination (including the VMSA and its physical address). In normal builds, __pa() is just math, so the bug isn't fatal, but with CONFIG_DEBUG_VIRTUAL=y, the validity of the virtual address is verified and passing in NULL will make the kernel unhappy. Fixes: 6defa24d3b12 ("KVM: SEV: Init target VMCBs in sev_migrate_from") Cc: stable@vger.kernel.org Cc: Peter Gonda <pgonda@google.com> Reviewed-by: Peter Gonda <pgonda@google.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20230825022357.2852133-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-25KVM: SVM: Get source vCPUs from source VM for SEV-ES intrahost migrationSean Christopherson
Fix a goof where KVM tries to grab source vCPUs from the destination VM when doing intrahost migration. Grabbing the wrong vCPU not only hoses the guest, it also crashes the host due to the VMSA pointer being left NULL. BUG: unable to handle page fault for address: ffffe38687000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 39 PID: 17143 Comm: sev_migrate_tes Tainted: GO 6.5.0-smp--fff2e47e6c3b-next #151 Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 34.28.0 07/10/2023 RIP: 0010:__free_pages+0x15/0xd0 RSP: 0018:ffff923fcf6e3c78 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffe38687000000 RCX: 0000000000000100 RDX: 0000000000000100 RSI: 0000000000000000 RDI: ffffe38687000000 RBP: ffff923fcf6e3c88 R08: ffff923fcafb0000 R09: 0000000000000000 R10: 0000000000000000 R11: ffffffff83619b90 R12: ffff923fa9540000 R13: 0000000000080007 R14: ffff923f6d35d000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff929d0d7c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffe38687000000 CR3: 0000005224c34005 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> sev_free_vcpu+0xcb/0x110 [kvm_amd] svm_vcpu_free+0x75/0xf0 [kvm_amd] kvm_arch_vcpu_destroy+0x36/0x140 [kvm] kvm_destroy_vcpus+0x67/0x100 [kvm] kvm_arch_destroy_vm+0x161/0x1d0 [kvm] kvm_put_kvm+0x276/0x560 [kvm] kvm_vm_release+0x25/0x30 [kvm] __fput+0x106/0x280 ____fput+0x12/0x20 task_work_run+0x86/0xb0 do_exit+0x2e3/0x9c0 do_group_exit+0xb1/0xc0 __x64_sys_exit_group+0x1b/0x20 do_syscall_64+0x41/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK> CR2: ffffe38687000000 Fixes: 6defa24d3b12 ("KVM: SEV: Init target VMCBs in sev_migrate_from") Cc: stable@vger.kernel.org Cc: Peter Gonda <pgonda@google.com> Reviewed-by: Peter Gonda <pgonda@google.com> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Link: https://lore.kernel.org/r/20230825022357.2852133-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17KVM: x86: Disallow guest CPUID lookups when IRQs are disabledSean Christopherson
Now that KVM has a framework for caching guest CPUID feature flags, add a "rule" that IRQs must be enabled when doing guest CPUID lookups, and enforce the rule via a lockdep assertion. CPUID lookups are slow, and within KVM, IRQs are only ever disabled in hot paths, e.g. the core run loop, fast page fault handling, etc. I.e. querying guest CPUID with IRQs disabled, especially in the run loop, should be avoided. Link: https://lore.kernel.org/r/20230815203653.519297-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17KVM: nSVM: Use KVM-governed feature framework to track "vNMI enabled"Sean Christopherson
Track "virtual NMI exposed to L1" via a governed feature flag instead of using a dedicated bit/flag in vcpu_svm. Note, checking KVM's capabilities instead of the "vnmi" param means that the code isn't strictly equivalent, as vnmi_enabled could have been set if nested=false where as that the governed feature cannot. But that's a glorified nop as the feature/flag is consumed only by paths that are gated by nSVM being enabled. Link: https://lore.kernel.org/r/20230815203653.519297-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17KVM: nSVM: Use KVM-governed feature framework to track "vGIF enabled"Sean Christopherson
Track "virtual GIF exposed to L1" via a governed feature flag instead of using a dedicated bit/flag in vcpu_svm. Note, checking KVM's capabilities instead of the "vgif" param means that the code isn't strictly equivalent, as vgif_enabled could have been set if nested=false where as that the governed feature cannot. But that's a glorified nop as the feature/flag is consumed only by paths that are Link: https://lore.kernel.org/r/20230815203653.519297-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17KVM: nSVM: Use KVM-governed feature framework to track "Pause Filter enabled"Sean Christopherson
Track "Pause Filtering is exposed to L1" via governed feature flags instead of using dedicated bits/flags in vcpu_svm. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20230815203653.519297-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17KVM: nSVM: Use KVM-governed feature framework to track "LBRv enabled"Sean Christopherson
Track "LBR virtualization exposed to L1" via a governed feature flag instead of using a dedicated bit/flag in vcpu_svm. Note, checking KVM's capabilities instead of the "lbrv" param means that the code isn't strictly equivalent, as lbrv_enabled could have been set if nested=false where as that the governed feature cannot. But that's a glorified nop as the feature/flag is consumed only by paths that are gated by nSVM being enabled. Link: https://lore.kernel.org/r/20230815203653.519297-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17KVM: nSVM: Use KVM-governed feature framework to track "vVM{SAVE,LOAD} enabled"Sean Christopherson
Track "virtual VMSAVE/VMLOAD exposed to L1" via a governed feature flag instead of using a dedicated bit/flag in vcpu_svm. Opportunistically add a comment explaining why KVM disallows virtual VMLOAD/VMSAVE when the vCPU model is Intel. No functional change intended. Link: https://lore.kernel.org/r/20230815203653.519297-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17KVM: nSVM: Use KVM-governed feature framework to track "TSC scaling enabled"Sean Christopherson
Track "TSC scaling exposed to L1" via a governed feature flag instead of using a dedicated bit/flag in vcpu_svm. Note, this fixes a benign bug where KVM would mark TSC scaling as exposed to L1 even if overall nested SVM supported is disabled, i.e. KVM would let L1 write MSR_AMD64_TSC_RATIO even when KVM didn't advertise TSCRATEMSR support to userspace. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20230815203653.519297-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>