Age | Commit message (Collapse) | Author |
|
Get/put references to KVM when a page-track notifier is (un)registered
instead of relying on the caller to do so. Forcing the caller to do the
bookkeeping is unnecessary and adds one more thing for users to get
wrong, e.g. see commit 9ed1fdee9ee3 ("drm/i915/gvt: Get reference to KVM
iff attachment to VM is successful").
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Refactor KVM's exported/external page-track, a.k.a. write-track, APIs
to take only the gfn and do the required memslot lookup in KVM proper.
Forcing users of the APIs to get the memslot unnecessarily bleeds
KVM internals into KVMGT and complicates usage of the APIs.
No functional change intended.
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-28-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Bug the VM if something attempts to write-track a gfn, but write-tracking
isn't enabled. The VM is doomed (and KVM has an egregious bug) if KVM or
KVMGT wants to shadow guest page tables but can't because write-tracking
isn't enabled.
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When adding/removing gfns to/from write-tracking, assert that mmu_lock
is held for write, and that either slots_lock or kvm->srcu is held.
mmu_lock must be held for write to protect gfn_write_track's refcount,
and SRCU or slots_lock must be held to protect the memslot itself.
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Rename the page-track APIs to capture that they're all about tracking
writes, now that the facade of supporting multiple modes is gone.
Opportunstically replace "slot" with "gfn" in anticipation of removing
the @slot param from the external APIs.
No functional change intended.
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Drop "support" for multiple page-track modes, as there is no evidence
that array-based and refcounted metadata is the optimal solution for
other modes, nor is there any evidence that other use cases, e.g. for
access-tracking, will be a good fit for the page-track machinery in
general.
E.g. one potential use case of access-tracking would be to prevent guest
access to poisoned memory (from the guest's perspective). In that case,
the number of poisoned pages is likely to be a very small percentage of
the guest memory, and there is no need to reference count the number of
access-tracking users, i.e. expanding gfn_track[] for a new mode would be
grossly inefficient. And for poisoned memory, host userspace would also
likely want to trap accesses, e.g. to inject #MC into the guest, and that
isn't currently supported by the page-track framework.
A better alternative for that poisoned page use case is likely a
variation of the proposed per-gfn attributes overlay (linked), which
would allow efficiently tracking the sparse set of poisoned pages, and by
default would exit to userspace on access.
Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com
Cc: Ben Gardon <bgardon@google.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Disable the page-track notifier code at compile time if there are no
external users, i.e. if CONFIG_KVM_EXTERNAL_WRITE_TRACKING=n. KVM itself
now hooks emulated writes directly instead of relying on the page-track
mechanism.
Provide a stub for "struct kvm_page_track_notifier_node" so that including
headers directly from the command line, e.g. for testing include guards,
doesn't fail due to a struct having an incomplete type.
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Bury the declaration of the page-track helpers that are intended only for
internal KVM use in a "private" header. In addition to guarding against
unwanted usage of the internal-only helpers, dropping their definitions
avoids exposing other structures that should be KVM-internal, e.g. for
memslots. This is a baby step toward making kvm_host.h a KVM-internal
header in the very distant future.
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Remove ->track_remove_slot(), there are no longer any users and it's
unlikely a "flush" hook will ever be the correct API to provide to an
external page-track user.
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add a new page-track hook, track_remove_region(), that is called when a
memslot DELETE operation is about to be committed. The "remove" hook
will be used by KVMGT and will effectively replace the existing
track_flush_slot() altogether now that KVM itself doesn't rely on the
"flush" hook either.
The "flush" hook is flawed as it's invoked before the memslot operation
is guaranteed to succeed, i.e. KVM might ultimately keep the existing
memslot without notifying external page track users, a.k.a. KVMGT. In
practice, this can't currently happen on x86, but there are no guarantees
that won't change in the future, not to mention that "flush" does a very
poor job of describing what is happening.
Pass in the gfn+nr_pages instead of the slot itself so external users,
i.e. KVMGT, don't need to exposed to KVM internals (memslots). This will
help set the stage for additional cleanups to the page-track APIs.
Opportunistically align the existing srcu_read_lock_held() usage so that
the new case doesn't stand out like a sore thumb (and not aligning the
new code makes bots unhappy).
Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Disallow moving memslots if the VM has external page-track users, i.e. if
KVMGT is being used to expose a virtual GPU to the guest, as KVMGT doesn't
correctly handle moving memory regions.
Note, this is potential ABI breakage! E.g. userspace could move regions
that aren't shadowed by KVMGT without harming the guest. However, the
only known user of KVMGT is QEMU, and QEMU doesn't move generic memory
regions. KVM's own support for moving memory regions was also broken for
multiple years (albeit for an edge case, but arguably moving RAM is
itself an edge case), e.g. see commit edd4fa37baa6 ("KVM: x86: Allocate
new rmap and large page tracking when moving memslot").
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Drop @vcpu from KVM's ->track_write() hook provided for external users of
the page-track APIs now that KVM itself doesn't use the page-track
mechanism.
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Reviewed-by: Zhi Wang <zhi.a.wang@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Don't use the generic page-track mechanism to handle writes to guest PTEs
in KVM's MMU. KVM's MMU needs access to information that should not be
exposed to external page-track users, e.g. KVM needs (for some definitions
of "need") the vCPU to query the current paging mode, whereas external
users, i.e. KVMGT, have no ties to the current vCPU and so should never
need the vCPU.
Moving away from the page-track mechanism will allow dropping use of the
page-track mechanism for KVM's own MMU, and will also allow simplifying
and cleaning up the page-track APIs.
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Call kvm_mmu_zap_all_fast() directly when flushing a memslot instead of
bouncing through the page-track mechanism. KVM (unfortunately) needs to
zap and flush all page tables on memslot DELETE/MOVE irrespective of
whether KVM is shadowing guest page tables.
This will allow changing KVM to register a page-track notifier on the
first shadow root allocation, and will also allow deleting the misguided
kvm_page_track_flush_slot() hook itself once KVM-GT also moves to a
different method for reacting to memslot changes.
No functional change intended.
Cc: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20221110014821.1548347-2-seanjc@google.com
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move x86's implementation of kvm_arch_flush_shadow_{all,memslot}() into
mmu.c, and make kvm_mmu_zap_all() static as it was globally visible only
for kvm_arch_flush_shadow_all(). This will allow refactoring
kvm_arch_flush_shadow_memslot() to call kvm_mmu_zap_all() directly without
having to expose kvm_mmu_zap_all_fast() outside of mmu.c. Keeping
everything in mmu.c will also likely simplify supporting TDX, which
intends to do zap only relevant SPTEs on memslot updates.
No functional change intended.
Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yongwei Ma <yongwei.ma@intel.com>
Link: https://lore.kernel.org/r/20230729013535.1070024-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Introduce KVM_BUG_ON_DATA_CORRUPTION() and use it in the low-level rmap
helpers to convert the existing BUG()s to WARN_ON_ONCE() when the kernel
is built with CONFIG_BUG_ON_DATA_CORRUPTION=n, i.e. does NOT want to BUG()
on corruption of host kernel data structures. Environments that don't
have infrastructure to automatically capture crash dumps, i.e. aren't
likely to enable CONFIG_BUG_ON_DATA_CORRUPTION=y, are typically better
served overall by WARN-and-continue behavior (for the kernel, the VM is
dead regardless), as a BUG() while holding mmu_lock all but guarantees
the _best_ case scenario is a panic().
Make the BUG()s conditional instead of removing/replacing them entirely as
there's a non-zero chance (though by no means a guarantee) that the damage
isn't contained to the target VM, e.g. if no rmap is found for a SPTE then
KVM may be double-zapping the SPTE, i.e. has already freed the memory the
SPTE pointed at and thus KVM is reading/writing memory that KVM no longer
owns.
Link: https://lore.kernel.org/all/20221129191237.31447-1-mizhang@google.com
Suggested-by: Mingwei Zhang <mizhang@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Link: https://lore.kernel.org/r/20230729004722.1056172-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Plumb "struct kvm" all the way to pte_list_remove() to allow the usage of
KVM_BUG() and/or KVM_BUG_ON(). This will allow killing only the offending
VM instead of doing BUG() if the kernel is built with
CONFIG_BUG_ON_DATA_CORRUPTION=n, i.e. does NOT want to BUG() if KVM's data
structures (rmaps) appear to be corrupted.
Signed-off-by: Mingwei Zhang <mizhang@google.com>
[sean: tweak changelog]
Link: https://lore.kernel.org/r/20230729004722.1056172-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Use BUILD_BUG_ON_INVALID() instead of an empty do-while loop to stub out
KVM_MMU_WARN_ON() when CONFIG_KVM_PROVE_MMU=n, that way _some_ build
issues with the usage of KVM_MMU_WARN_ON() will be dected even if the
kernel is using the stubs, e.g. basic syntax errors will be detected.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/r/20230729004722.1056172-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Replace MMU_DEBUG, which requires manually modifying KVM to enable the
macro, with a proper Kconfig, KVM_PROVE_MMU. Now that pgprintk() and
rmap_printk() are gone, i.e. the macro guards only KVM_MMU_WARN_ON() and
won't flood the kernel logs, enabling the option for debug kernels is both
desirable and feasible.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/r/20230729004722.1056172-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Promote the ASSERT(), which is quite dead code in KVM, into a KVM_BUG_ON()
for KVM's sanity check that CR4.PAE=1 if the vCPU is in long mode when
performing a walk of guest page tables. The sanity is quite cheap since
neither EFER nor CR4.PAE requires a VMREAD, especially relative to the
cost of walking the guest page tables.
More importantly, the sanity check would have prevented the true badness
fixed by commit 112e66017bff ("KVM: nVMX: add missing consistency checks
for CR0 and CR4"). The missed consistency check resulted in some versions
of KVM corrupting the on-stack guest_walker structure due to KVM thinking
there are 4/5 levels of page tables, but wiring up the MMU hooks to point
at the paging32 implementation, which only allocates space for two levels
of page tables in "struct guest_walker32".
Queue a page fault for injection if the assertion fails, as both callers,
FNAME(gva_to_gpa) and FNAME(walk_addr_generic), assume that walker.fault
contains sane info on a walk failure. E.g. not populating the fault info
could result in KVM consuming and/or exposing uninitialized stack data
before the vCPU is kicked out to userspace, which doesn't happen until
KVM checks for KVM_REQ_VM_DEAD on the next enter.
Move the check below the initialization of "pte_access" so that the
aforementioned to-be-injected page fault doesn't consume uninitialized
stack data. The information _shouldn't_ reach the guest or userspace,
but there's zero downside to being paranoid in this case.
Link: https://lore.kernel.org/r/20230729004722.1056172-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Rename MMU_WARN_ON() to make it super obvious that the assertions are
all about KVM's MMU, not the primary MMU.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/r/20230729004722.1056172-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Massage the error message for the sanity check on SPTEs when freeing a
shadow page to be more verbose, and to print out all shadow-present SPTEs,
not just the first SPTE encountered. Printing all SPTEs can be quite
valuable for debug, e.g. highlights whether the leak is a one-off or
widepsread, or possibly the result of memory corruption (something else
in the kernel stomping on KVM's SPTEs).
Opportunistically move the MMU_WARN_ON() into the helper itself, which
will allow a future cleanup to use BUILD_BUG_ON_INVALID() as the stub for
MMU_WARN_ON(). BUILD_BUG_ON_INVALID() works as intended and results in
the compiler complaining about is_empty_shadow_page() not being declared.
Link: https://lore.kernel.org/r/20230729004722.1056172-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Replace the pointer arithmetic used to iterate over SPTEs in
is_empty_shadow_page() with more standard interger-based iteration.
No functional change intended.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Link: https://lore.kernel.org/r/20230729004722.1056172-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Delete KVM's "dbg" module param now that its usage in KVM is gone (it
used to guard pgprintk() and rmap_printk()).
Link: https://lore.kernel.org/r/20230729004722.1056172-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Delete rmap_printk() so that MMU_WARN_ON() and MMU_DEBUG can be morphed
into something that can be regularly enabled for debug kernels. The
information provided by rmap_printk() isn't all that useful now that the
rmap and unsync code is mature, as the prints are simultaneously too
verbose (_lots_ of message) and yet not verbose enough to be helpful for
debug (most instances print just the SPTE pointer/value, which is rarely
sufficient to root cause anything but trivial bugs).
Alternatively, rmap_printk() could be reworked to into tracepoints, but
it's not clear there is a real need as rmap bugs rarely escape initial
development, and when bugs do escape to production, they are often edge
cases and/or reside in code that isn't directly related to the rmaps.
In other words, the problems with rmap_printk() being unhelpful also apply
to tracepoints. And deleting rmap_printk() doesn't preclude adding
tracepoints in the future.
Link: https://lore.kernel.org/r/20230729004722.1056172-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Delete KVM's pgprintk() and all its usage, as the code is very prone
to bitrot due to being buried behind MMU_DEBUG, and the functionality has
been rendered almost entirely obsolete by the tracepoints KVM has gained
over the years. And for the situations where the information provided by
KVM's tracepoints is insufficient, pgprintk() rarely fills in the gaps,
and is almost always far too noisy, i.e. developers end up implementing
custom prints anyways.
Link: https://lore.kernel.org/r/20230729004722.1056172-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add an assertion in kvm_mmu_page_fault() to ensure the error code provided
by hardware doesn't conflict with KVM's software-defined IMPLICIT_ACCESS
flag. In the unlikely scenario that future hardware starts using bit 48
for a hardware-defined flag, preserving the bit could result in KVM
incorrectly interpreting the unknown flag as KVM's IMPLICIT_ACCESS flag.
WARN so that any such conflict can be surfaced to KVM developers and
resolved, but otherwise ignore the bit as KVM can't possibly rely on a
flag it knows nothing about.
Fixes: 4f4aa80e3b88 ("KVM: X86: Handle implicit supervisor access with SMAP")
Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20230721223711.2334426-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
clear_dirty_pt_masked()
Move the lockdep_assert_held_write(&kvm->mmu_lock) from the only one caller
kvm_tdp_mmu_clear_dirty_pt_masked() to inside clear_dirty_pt_masked().
This change makes it more obvious why it's safe for clear_dirty_pt_masked()
to use the non-atomic (for non-volatile SPTEs) tdp_mmu_clear_spte_bits()
helper. for_each_tdp_mmu_root() does its own lockdep, so the only "loss"
in lockdep coverage is if the list is completely empty.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Like Xu <likexu@tencent.com>
Link: https://lore.kernel.org/r/20230627042639.12636-1-likexu@tencent.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM x86 changes for 6.6:
- Misc cleanups
- Retry APIC optimized recalculation if a vCPU is added/enabled
- Overhaul emergency reboot code to bring SVM up to par with VMX, tie the
"emergency disabling" behavior to KVM actually being loaded, and move all of
the logic within KVM
- Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC
ratio MSR can diverge from the default iff TSC scaling is enabled, and clean
up related code
- Add a framework to allow "caching" feature flags so that KVM can check if
the guest can use a feature without needing to search guest CPUID
|
|
KVM: x86: SVM changes for 6.6:
- Add support for SEV-ES DebugSwap, i.e. allow SEV-ES guests to use debug
registers and generate/handle #DBs
- Clean up LBR virtualization code
- Fix a bug where KVM fails to set the target pCPU during an IRTE update
- Fix fatal bugs in SEV-ES intrahost migration
- Fix a bug where the recent (architecturally correct) change to reinject
#BP and skip INT3 broke SEV guests (can't decode INT3 to skip it)
|
|
KVM: x86: VMX changes for 6.6:
- Misc cleanups
- Fix a bug where KVM reads a stale vmcs.IDT_VECTORING_INFO_FIELD when trying
to handle NMI VM-Exits
|
|
KVM x86 PMU changes for 6.6:
- Clean up KVM's handling of Intel architectural events
|
|
KVM: x86: Selftests changes for 6.6:
- Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs
- Add support for printf() in guest code and covert all guest asserts to use
printf-based reporting
- Clean up the PMU event filter test and add new testcases
- Include x86 selftests in the KVM x86 MAINTAINERS entry
|
|
Common KVM changes for 6.6:
- Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier events to pass
action specific data without needing to constantly update the main handlers.
- Drop unused function declarations
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for Linux 6.6
- Add support for TLB range invalidation of Stage-2 page tables,
avoiding unnecessary invalidations. Systems that do not implement
range invalidation still rely on a full invalidation when dealing
with large ranges.
- Add infrastructure for forwarding traps taken from a L2 guest to
the L1 guest, with L0 acting as the dispatcher, another baby step
towards the full nested support.
- Simplify the way we deal with the (long deprecated) 'CPU target',
resulting in a much needed cleanup.
- Fix another set of PMU bugs, both on the guest and host sides,
as we seem to never have any shortage of those...
- Relax the alignment requirements of EL2 VA allocations for
non-stack allocations, as we were otherwise wasting a lot of that
precious VA space.
- The usual set of non-functional cleanups, although I note the lack
of spelling fixes...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 apic updates from Dave Hansen:
"This includes a very thorough rework of the 'struct apic' handlers.
Quite a variety of them popped up over the years, especially in the
32-bit days when odd apics were much more in vogue.
The end result speaks for itself, which is a removal of a ton of code
and static calls to replace indirect calls.
If there's any breakage here, it's likely to be around the 32-bit
museum pieces that get light to no testing these days.
Summary:
- Rework apic callbacks, getting rid of unnecessary ones and
coalescing lots of silly duplicates.
- Use static_calls() instead of indirect calls for apic->foo()
- Tons of cleanups an crap removal along the way"
* tag 'x86_apic_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
x86/apic: Turn on static calls
x86/apic: Provide static call infrastructure for APIC callbacks
x86/apic: Wrap IPI calls into helper functions
x86/apic: Mark all hotpath APIC callback wrappers __always_inline
x86/xen/apic: Mark apic __ro_after_init
x86/apic: Convert other overrides to apic_update_callback()
x86/apic: Replace acpi_wake_cpu_handler_update() and apic_set_eoi_cb()
x86/apic: Provide apic_update_callback()
x86/xen/apic: Use standard apic driver mechanism for Xen PV
x86/apic: Provide common init infrastructure
x86/apic: Wrap apic->native_eoi() into a helper
x86/apic: Nuke ack_APIC_irq()
x86/apic: Remove pointless arguments from [native_]eoi_write()
x86/apic/noop: Tidy up the code
x86/apic: Remove pointless NULL initializations
x86/apic: Sanitize APIC ID range validation
x86/apic: Prepare x2APIC for using apic::max_apic_id
x86/apic: Simplify X2APIC ID validation
x86/apic: Add max_apic_id member
x86/apic: Wrap APIC ID validation into an inline
...
|
|
Reset the mask of available "registers" and refresh the IDT vectoring
info snapshot in vmx_vcpu_enter_exit(), before KVM potentially handles a
an NMI VM-Exit. One of the "registers" that KVM VMX lazily loads is the
vmcs.VM_EXIT_INTR_INFO field, which is holds the vector+type on "exception
or NMI" VM-Exits, i.e. is needed to identify NMIs. Clearing the available
registers bitmask after handling NMIs results in KVM querying info from
the last VM-Exit that read vmcs.VM_EXIT_INTR_INFO, and leads to both
missed NMIs and spurious NMIs in the host.
Opportunistically grab vmcs.IDT_VECTORING_INFO_FIELD early in the VM-Exit
path too, e.g. to guard against similar consumption of stale data. The
field is read on every "normal" VM-Exit, and there's no point in delaying
the inevitable.
Reported-by: Like Xu <like.xu.linux@gmail.com>
Fixes: 11df586d774f ("KVM: VMX: Handle NMI VM-Exits in noinstr region")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230825014532.2846714-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Delete KVM's printk about KVM_SET_TSS_ADDR not being called. When the
printk was added by commit 776e58ea3d37 ("KVM: unbreak userspace that does
not sets tss address"), KVM also stuffed a "hopefully safe" value, i.e.
the message wasn't purely informational. For reasons unknown, ostensibly
to try and help people running outdated qemu-kvm versions, the message got
left behind when KVM's stuffing was removed by commit 4918c6ca6838
("KVM: VMX: Require KVM_SET_TSS_ADDR being called prior to running a VCPU").
Today, the message is completely nonsensical, as it has been over a decade
since KVM supported userspace running a Real Mode guest, on a CPU without
unrestricted guest support, without doing KVM_SET_TSS_ADDR before KVM_RUN.
I.e. KVM's ABI has required KVM_SET_TSS_ADDR for 10+ years.
To make matters worse, the message is prone to false positives as it
triggers when simply *creating* a vCPU due to RESET putting vCPUs into
Real Mode, even when the user has no intention of ever *running* the vCPU
in a Real Mode. E.g. KVM selftests stuff 64-bit mode and never touch Real
Mode, but trigger the message even though they run just fine without
doing KVM_SET_TSS_ADDR. Creating "dummy" vCPUs, e.g. to probe features,
can also trigger the message. In both scenarios, the message confuses
users and falsely implies that they've done something wrong.
Reported-by: Thorsten Glaser <t.glaser@tarent.de>
Closes: https://lkml.kernel.org/r/f1afa6c0-cde2-ab8b-ea71-bfa62a45b956%40tarent.de
Link: https://lore.kernel.org/r/20230815174215.433222-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Disallow SEV (and beyond) if nrips is disabled via module param, as KVM
can't read guest memory to partially emulate and skip an instruction. All
CPUs that support SEV support NRIPS, i.e. this is purely stopping the user
from shooting themselves in the foot.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20230825013621.2845700-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't inject a #UD if KVM attempts to "emulate" to skip an instruction
for an SEV guest, and instead resume the guest and hope that it can make
forward progress. When commit 04c40f344def ("KVM: SVM: Inject #UD on
attempted emulation for SEV guest w/o insn buffer") added the completely
arbitrary #UD behavior, there were no known scenarios where a well-behaved
guest would induce a VM-Exit that triggered emulation, i.e. it was thought
that injecting #UD would be helpful.
However, now that KVM (correctly) attempts to re-inject INT3/INTO, e.g. if
a #NPF is encountered when attempting to deliver the INT3/INTO, an SEV
guest can trigger emulation without a buffer, through no fault of its own.
Resuming the guest and retrying the INT3/INTO is architecturally wrong,
e.g. the vCPU will incorrectly re-hit code #DBs, but for SEV guests there
is literally no other option that has a chance of making forward progress.
Drop the #UD injection for all "skip" emulation, not just those related to
INT3/INTO, even though that means that the guest will likely end up in an
infinite loop instead of getting a #UD (the vCPU may also crash, e.g. if
KVM emulated everything about an instruction except for advancing RIP).
There's no evidence that suggests that an unexpected #UD is actually
better than hanging the vCPU, e.g. a soft-hung vCPU can still respond to
IRQs and NMIs to generate a backtrace.
Reported-by: Wu Zongyo <wuzongyo@mail.ustc.edu.cn>
Closes: https://lore.kernel.org/all/8eb933fd-2cf3-d7a9-32fe-2a1d82eac42a@mail.ustc.edu.cn
Fixes: 6ef88d6e36c2 ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
Cc: stable@vger.kernel.org
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20230825013621.2845700-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Skip initializing the VMSA physical address in the VMCB if the VMSA is
NULL, which occurs during intrahost migration as KVM initializes the VMCB
before copying over state from the source to the destination (including
the VMSA and its physical address).
In normal builds, __pa() is just math, so the bug isn't fatal, but with
CONFIG_DEBUG_VIRTUAL=y, the validity of the virtual address is verified
and passing in NULL will make the kernel unhappy.
Fixes: 6defa24d3b12 ("KVM: SEV: Init target VMCBs in sev_migrate_from")
Cc: stable@vger.kernel.org
Cc: Peter Gonda <pgonda@google.com>
Reviewed-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20230825022357.2852133-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fix a goof where KVM tries to grab source vCPUs from the destination VM
when doing intrahost migration. Grabbing the wrong vCPU not only hoses
the guest, it also crashes the host due to the VMSA pointer being left
NULL.
BUG: unable to handle page fault for address: ffffe38687000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP NOPTI
CPU: 39 PID: 17143 Comm: sev_migrate_tes Tainted: GO 6.5.0-smp--fff2e47e6c3b-next #151
Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 34.28.0 07/10/2023
RIP: 0010:__free_pages+0x15/0xd0
RSP: 0018:ffff923fcf6e3c78 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffe38687000000 RCX: 0000000000000100
RDX: 0000000000000100 RSI: 0000000000000000 RDI: ffffe38687000000
RBP: ffff923fcf6e3c88 R08: ffff923fcafb0000 R09: 0000000000000000
R10: 0000000000000000 R11: ffffffff83619b90 R12: ffff923fa9540000
R13: 0000000000080007 R14: ffff923f6d35d000 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff929d0d7c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffe38687000000 CR3: 0000005224c34005 CR4: 0000000000770ee0
PKRU: 55555554
Call Trace:
<TASK>
sev_free_vcpu+0xcb/0x110 [kvm_amd]
svm_vcpu_free+0x75/0xf0 [kvm_amd]
kvm_arch_vcpu_destroy+0x36/0x140 [kvm]
kvm_destroy_vcpus+0x67/0x100 [kvm]
kvm_arch_destroy_vm+0x161/0x1d0 [kvm]
kvm_put_kvm+0x276/0x560 [kvm]
kvm_vm_release+0x25/0x30 [kvm]
__fput+0x106/0x280
____fput+0x12/0x20
task_work_run+0x86/0xb0
do_exit+0x2e3/0x9c0
do_group_exit+0xb1/0xc0
__x64_sys_exit_group+0x1b/0x20
do_syscall_64+0x41/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
</TASK>
CR2: ffffe38687000000
Fixes: 6defa24d3b12 ("KVM: SEV: Init target VMCBs in sev_migrate_from")
Cc: stable@vger.kernel.org
Cc: Peter Gonda <pgonda@google.com>
Reviewed-by: Peter Gonda <pgonda@google.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Link: https://lore.kernel.org/r/20230825022357.2852133-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that KVM has a framework for caching guest CPUID feature flags, add
a "rule" that IRQs must be enabled when doing guest CPUID lookups, and
enforce the rule via a lockdep assertion. CPUID lookups are slow, and
within KVM, IRQs are only ever disabled in hot paths, e.g. the core run
loop, fast page fault handling, etc. I.e. querying guest CPUID with IRQs
disabled, especially in the run loop, should be avoided.
Link: https://lore.kernel.org/r/20230815203653.519297-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Track "virtual NMI exposed to L1" via a governed feature flag instead of
using a dedicated bit/flag in vcpu_svm.
Note, checking KVM's capabilities instead of the "vnmi" param means that
the code isn't strictly equivalent, as vnmi_enabled could have been set
if nested=false where as that the governed feature cannot. But that's a
glorified nop as the feature/flag is consumed only by paths that are
gated by nSVM being enabled.
Link: https://lore.kernel.org/r/20230815203653.519297-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Track "virtual GIF exposed to L1" via a governed feature flag instead of
using a dedicated bit/flag in vcpu_svm.
Note, checking KVM's capabilities instead of the "vgif" param means that
the code isn't strictly equivalent, as vgif_enabled could have been set
if nested=false where as that the governed feature cannot. But that's a
glorified nop as the feature/flag is consumed only by paths that are
Link: https://lore.kernel.org/r/20230815203653.519297-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Track "Pause Filtering is exposed to L1" via governed feature flags
instead of using dedicated bits/flags in vcpu_svm.
No functional change intended.
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Track "LBR virtualization exposed to L1" via a governed feature flag
instead of using a dedicated bit/flag in vcpu_svm.
Note, checking KVM's capabilities instead of the "lbrv" param means that
the code isn't strictly equivalent, as lbrv_enabled could have been set
if nested=false where as that the governed feature cannot. But that's a
glorified nop as the feature/flag is consumed only by paths that are
gated by nSVM being enabled.
Link: https://lore.kernel.org/r/20230815203653.519297-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Track "virtual VMSAVE/VMLOAD exposed to L1" via a governed feature flag
instead of using a dedicated bit/flag in vcpu_svm.
Opportunistically add a comment explaining why KVM disallows virtual
VMLOAD/VMSAVE when the vCPU model is Intel.
No functional change intended.
Link: https://lore.kernel.org/r/20230815203653.519297-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Track "TSC scaling exposed to L1" via a governed feature flag instead of
using a dedicated bit/flag in vcpu_svm.
Note, this fixes a benign bug where KVM would mark TSC scaling as exposed
to L1 even if overall nested SVM supported is disabled, i.e. KVM would let
L1 write MSR_AMD64_TSC_RATIO even when KVM didn't advertise TSCRATEMSR
support to userspace.
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Link: https://lore.kernel.org/r/20230815203653.519297-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|