Age | Commit message (Collapse) | Author |
|
Convert the arch timer tests to use __pin_task_to_cpu() and
pin_self_to_cpu().
No functional change intended.
Link: https://lore.kernel.org/r/20250626001225.744268-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
For a VCPU thread pinned to a single LPU, verify that interleaved host
and guest reads of IA32_[AM]PERF return strictly increasing values when
APERFMPERF exiting is disabled.
Run the test in both L1 and L2 to verify that KVM passes through the
APERF and MPERF MSRs when L1 doesn't want to intercept them (or any MSRs).
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250530185239.2335185-4-jmattson@google.com
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250626001225.744268-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Expand kvm_pin_this_task_to_pcpu() into a set of APIs to allow pinning a
task (or self) to a CPU (any or specific). This will allow deduplicating
code throughout a variety of selftests.
Opportunistically use "self" instead of "this_task" as it is both more
concise and less ambiguous.
Link: https://lore.kernel.org/r/20250626001225.744268-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Allow a guest to read the physical IA32_APERF and IA32_MPERF MSRs
without interception.
The IA32_APERF and IA32_MPERF MSRs are not virtualized. Writes are not
handled at all. The MSR values are not zeroed on vCPU creation, saved
on suspend, or restored on resume. No accommodation is made for
processor migration or for sharing a logical processor with other
tasks. No adjustments are made for non-unit TSC multipliers. The MSRs
do not account for time the same way as the comparable PMU events,
whether the PMU is virtualized by the traditional emulation method or
the new mediated pass-through approach.
Nonetheless, in a properly constrained environment, this capability
can be combined with a guest CPUID table that advertises support for
CPUID.6:ECX.APERFMPERF[bit 0] to induce a Linux guest to report the
effective physical CPU frequency in /proc/cpuinfo. Moreover, there is
no performance cost for this capability.
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250530185239.2335185-3-jmattson@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250626001225.744268-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Store each "disabled exit" boolean in a single bit rather than a byte.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Link: https://lore.kernel.org/r/20250530185239.2335185-2-jmattson@google.com
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250626001225.744268-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Advertise support for LKGS (load into IA32_KERNEL_GS_BASE) to userspace
if the instruction is supported by the underlying CPU.
LKGS is introduced with FRED to completely eliminate the need to swapgs
explicilty. It behaves like the MOV to GS instruction except that it
loads the base address into the IA32_KERNEL_GS_BASE MSR instead of the
GS segment’s descriptor cache, which is exactly what Linux kernel does
to load a user level GS base. Thus there is no need to SWAPGS away
from the kernel GS base.
LKGS is an independent CPU feature that works correctly in a KVM guest
without requiring explicit enablement.
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Link: https://lore.kernel.org/r/20250626173521.2301088-1-xin@zytor.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add VMX_HOST_OWNED_DEBUGCTL_BITS to track which bits are host-owned, i.e.
need to be preserved when running the guest, to dedup the logic without
having to incur a memory load to get at kvm_x86_ops.HOST_OWNED_DEBUGCTL.
No functional change intended.
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Link: https://lore.kernel.org/all/aF1yni8U6XNkyfRf@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use svm_set_intercept_for_msr() directly to configure IA32_XSS MSR
interception, ensuring consistency with other cases where MSRs are
intercepted depending on guest caps and CPUIDs.
No functional change intended.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250612081947.94081-3-chao.gao@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Extract a common function from MSR interception disabling logic and create
disabling and enabling functions based on it. This removes most of the
duplicated code for MSR interception disabling/enabling.
No functional change intended.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250612081947.94081-2-chao.gao@intel.com
[sean: s/enable/set, inline the wrappers]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Convert the incoming mp_state to INIT_RECIEVED instead of manually calling
kvm_set_mp_state() to make it more obvious that the SIPI_RECEIVED logic is
translating the incoming state to KVM's internal tracking, as opposed to
being some entirely unique flow.
Opportunistically add a comment to explain what the code is doing.
No functional change intended.
Link: https://lore.kernel.org/r/20250605195018.539901-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Check for the should-be-impossible scenario of a vCPU being in
Wait-For-SIPI with INIT/SIPI blocked during KVM_RUN instead of trying to
detect and prevent illegal combinations in every ioctl that sets relevant
state. Attempting to handle every possible "set" path is a losing game of
whack-a-mole, and risks breaking userspace. E.g. INIT/SIPI are blocked on
Intel if the vCPU is in VMX Root mode (post-VMXON), and on AMD if GIF=0.
Handling those scenarios would require potentially breaking changes to
{vmx,svm}_set_nested_state().
Moving the check to KVM_RUN fixes a syzkaller-induced splat due to the
aforementioned VMXON case, and in theory should close the hole once and for
all.
Note, kvm_x86_vcpu_pre_run() already handles SIPI_RECEIVED, only the WFS
case needs additional attention.
Reported-by: syzbot+c1cbaedc2613058d5194@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?id=490ae63d8d89cb82c5d462d16962cf371df0e476
Link: https://lore.kernel.org/r/20250605195018.539901-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
WARN if KVM_RUN is reached with a vCPU's mp_state set to SIPI_RECEIVED, as
KVM no longer uses SIPI_RECEIVED internally, and should morph SIPI_RECEIVED
into INIT_RECEIVED with a pending SIPI if userspace forces SIPI_RECEIVED.
See commit 66450a21f996 ("KVM: x86: Rework INIT and SIPI handling") for
more history and details.
Link: https://lore.kernel.org/r/20250605195018.539901-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Allow userspace to set a vCPU's mp_state to INIT_RECEIVED in conjunction
with a pending SMI, as rejecting that combination could result in KVM
disallowing reflecting the output from KVM_GET_VCPU_EVENTS back into KVM
via KVM_SET_VCPU_EVENTS.
At the time the check was added, smi_pending could only be set in the
context of KVM_RUN, with the vCPU in the RUNNABLE state. I.e. it was
impossible for KVM to save vCPU state such that userspace could see a
pending SMI for a vCPU in WFS.
That no longer holds true now that KVM processes requested SMIs during
KVM_GET_VCPU_EVENTS, e.g. if a vCPU receives an SMI while in WFS, and
then userspace saves vCPU state.
Note, this may partially re-open the user-triggerable WARN that was mostly
closed by commit 28bf28887976 ("KVM: x86: fix user triggerable warning in
kvm_apic_accept_events()"), but that WARN can already be triggered in
several other ways, e.g. if userspace stuffs VMXON=1 after putting the
vCPU into WFS. That issue will be addressed in an upcoming commit, in a
more robust fashion (hopefully).
Fixes: 1f7becf1b7e2 ("KVM: x86: get smi pending status correctly")
Link: https://lore.kernel.org/r/20250605195018.539901-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Re-read MSR_{FS,GS}_BASE after restoring the "allow everything" userspace
MSR filter to verify that KVM stops forwarding exits to userspace. This
can also be used in conjunction with manual verification (e.g. printk) to
ensure KVM is correctly updating the MSR bitmaps consumed by hardware.
Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Tested-by: Manali Shukla <Manali.Shukla@amd.com>
Link: https://lore.kernel.org/r/20250610225737.156318-33-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Refactor {svm,vmx}_disable_intercept_for_msr() to simplify the handling of
userspace filters that disallow access to an MSR. The more complicated
logic is no longer needed or justified now that KVM recalculates all MSR
intercepts on a userspace MSR filter change, i.e. now that KVM doesn't
need to also update shadow bitmaps.
No functional change intended.
Suggested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a helper to allocate and initialize an MSR or I/O permissions map, as
the logic is identical between the two map types, the only difference is
the size of the bitmap. Opportunistically add a comment to explain why
the bitmaps are initialized with 0xff, e.g. instead of the more common
zero-initialized behavior, which is the main motivation for deduplicating
the code.
No functional change intended.
Link: https://lore.kernel.org/r/20250610225737.156318-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When merging L0 and L1 MSRPMs as part of nested VMRUN emulation, access
the bitmaps using "unsigned long" chunks, i.e. use 8-byte access for
64-bit kernels instead of arbitrarily working on 4-byte chunks.
Opportunistically rename local variables in nested_svm_merge_msrpm() to
more precisely/accurately reflect their purpose ("offset" in particular is
extremely ambiguous).
Link: https://lore.kernel.org/r/20250610225737.156318-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Return -EINVAL instead of MSR_INVALID from svm_msrpm_bit_nr() to indicate
that the MSR isn't covered by one of the (currently) three MSRPM ranges,
and delete the MSR_INVALID macro now that all users are gone.
Link: https://lore.kernel.org/r/20250610225737.156318-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Access the MSRPM using u32/4-byte chunks (and appropriately adjusted
offsets) only when merging L0 and L1 bitmaps as part of emulating VMRUN.
The only reason to batch accesses to MSRPMs is to avoid the overhead of
uaccess operations (e.g. STAC/CLAC and bounds checks) when reading L1's
bitmap pointed at by vmcb12. For all other uses, either per-bit accesses
are more than fast enough (no uaccess), or KVM is only accessing a single
bit (nested_svm_exit_handled_msr()) and so there's nothing to batch.
In addition to (hopefully) documenting the uniqueness of the merging code,
restricting chunked access to _just_ the merging code will allow for
increasing the chunk size (to unsigned long) with minimal risk.
Link: https://lore.kernel.org/r/20250610225737.156318-28-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Store KVM's MSRPM pointers as "void *" instead of "u32 *" to guard against
directly accessing the bitmaps outside of code that is explicitly written
to access the bitmaps with a specific type.
Opportunistically use svm_vcpu_free_msrpm() in svm_vcpu_free() instead of
open coding an equivalent.
Link: https://lore.kernel.org/r/20250610225737.156318-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move svm_msrpm_offset() from svm.c to nested.c now that all usage of the
u32-index offsets is nested virtualization specific.
No functional change intended.
Link: https://lore.kernel.org/r/20250610225737.156318-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that msr_write_intercepted() defaults to true, i.e. accurately reflects
hardware behavior for out-of-range MSRs, and doesn't WARN (or BUG) on an
out-of-range MSR, drop sev_es_prevent_msr_access()'s svm_msrpm_offset()
check that guarded against calling msr_write_intercepted() with a "bad"
index.
Opportunistically clean up the helper's formatting.
Link: https://lore.kernel.org/r/20250610225737.156318-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Merge svm_recalc_intercepts_after_set_cpuid() and
svm_recalc_instruction_intercepts() such that the "after set CPUID" helper
simply invokes the type-specific helpers (MSRs vs. instructions), i.e.
make svm_recalc_intercepts_after_set_cpuid() a single entry point for all
intercept updates that need to be performed after a CPUID change.
No functional change intended.
Link: https://lore.kernel.org/r/20250610225737.156318-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fold svm_vcpu_init_msrpm() into svm_recalc_msr_intercepts() now that there
is only the one caller (and because the "init" misnomer is even more
misleading than it was in the past).
No functional change intended.
Link: https://lore.kernel.org/r/20250610225737.156318-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename init_vmcb_after_set_cpuid() to svm_recalc_intercepts_after_set_cpuid()
to more precisely describe its role. Strictly speaking, the name isn't
perfect as toggling virtual VM{LOAD,SAVE} is arguably not recalculating an
intercept, but practically speaking it's close enough.
No functional change intended.
Link: https://lore.kernel.org/r/20250610225737.156318-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename msr_filter_changed() to recalc_msr_intercepts() and drop the
trampoline wrapper now that both SVM and VMX use a filter-agnostic recalc
helper to react to the new userspace filter.
No functional change intended.
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
On a userspace MSR filter change, recalculate all MSR intercepts using the
filter-agnostic logic instead of maintaining a "shadow copy" of KVM's
desired intercepts. The shadow bitmaps add yet another point of failure,
are confusing (e.g. what does "handled specially" mean!?!?), an eyesore,
and a maintenance burden.
Given that KVM *must* be able to recalculate the correct intercepts at any
given time, and that MSR filter updates are not hot paths, there is zero
benefit to maintaining the shadow bitmaps.
Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as
appropriate.
Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com
Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local
Cc: Francesco Lavra <francescolavra.fl@gmail.com>
Link: https://lore.kernel.org/r/20250610225737.156318-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
On a userspace MSR filter change, recalculate all MSR intercepts using the
filter-agnostic logic instead of maintaining a "shadow copy" of KVM's
desired intercepts. The shadow bitmaps add yet another point of failure,
are confusing (e.g. what does "handled specially" mean!?!?), an eyesore,
and a maintenance burden.
Given that KVM *must* be able to recalculate the correct intercepts at any
given time, and that MSR filter updates are not hot paths, there is zero
benefit to maintaining the shadow bitmaps.
Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as
appropriate.
Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com
Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local
Cc: Borislav Petkov <bp@alien8.de>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Dedup the definition of X2APIC_MSR and put it in the local APIC code
where it belongs.
No functional change intended.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the "always" flag from the array of possible passthrough MSRs, and
instead manually initialize the permissions for the handful of MSRs that
KVM passes through by default. In addition to cutting down on boilerplate
copy+paste code and eliminating a misleading flag (the MSRs aren't always
passed through, e.g. thanks to MSR filters), this will allow for removing
the direct_access_msrs array entirely.
Link: https://lore.kernel.org/r/20250610225737.156318-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Disable interception of the GHCB MSR if and only if the VM is an SEV-ES
guest. While the exact behavior is completely undocumented in the APM,
common sense and testing on SEV-ES capable CPUs says that accesses to the
GHCB from non-SEV-ES guests will #GP. I.e. from the guest's perspective,
no functional change intended.
Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading")
Link: https://lore.kernel.org/r/20250610225737.156318-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add and use SVM MSR interception APIs (in most paths) to match VMX's
APIs and nomenclature. Specifically, add SVM variants of:
vmx_disable_intercept_for_msr(vcpu, msr, type)
vmx_enable_intercept_for_msr(vcpu, msr, type)
vmx_set_intercept_for_msr(vcpu, msr, type, intercept)
to eventually replace SVM's single helper:
set_msr_interception(vcpu, msrpm, msr, allow_read, allow_write)
which is awkward to use (in all cases, KVM either applies the same logic
for both reads and writes, or intercepts one of read or write), and is
unintuitive due to using '0' to indicate interception should be *set*.
Keep the guts of the old API for the moment to avoid churning the MSR
filter code, as that mess will be overhauled in the near future. Leave
behind a temporary comment to call out that the shadow bitmaps have
inverted polarity relative to the bitmaps consumed by hardware.
No functional change intended.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add macro-built helpers for testing, setting, and clearing MSRPM entries
without relying on precomputed offsets. This sets the stage for eventually
removing general KVM use of precomputed offsets, which are quite confusing
and rather inefficient for the vast majority of KVM's usage.
Outside of merging L0 and L1 bitmaps for nested SVM, using u32-indexed
offsets and accesses is at best unnecessary, and at worst introduces extra
operations to retrieve the individual bit from within the offset u32 value.
And simply calling them "offsets" is very confusing, as the "unit" of the
offset isn't immediately obvious.
Use the new helpers in set_msr_interception_bitmap() and
msr_write_intercepted() to verify the math and operations, but keep the
existing offset-based logic in set_msr_interception_bitmap() to sanity
check the "clear" and "set" operations. Manipulating MSR interceptions
isn't a hot path and no kernel release is ever expected to contain this
specific version of set_msr_interception_bitmap() (it will be removed
entirely in the near future).
Link: https://lore.kernel.org/r/20250610225737.156318-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't initialize vmcb02's MSRPM with KVM's set of "always passthrough"
MSRs, as KVM always needs to consult L1's intercepts, i.e. needs to merge
vmcb01 with vmcb12 and write the result to vmcb02. This will eventually
allow for the removal of svm_vcpu_init_msrpm().
Note, the bitmaps are truly initialized by svm_vcpu_alloc_msrpm() (default
to intercepting all MSRs), e.g. if there is a bug lurking elsewhere, the
worst case scenario from dropping the call to svm_vcpu_init_msrpm() should
be that KVM would fail to passthrough MSRs to L2.
Link: https://lore.kernel.org/r/20250610225737.156318-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't merge bitmaps on nested VMRUN for MSRs that KVM passes through only
for SEV-ES guests. KVM doesn't support nested virtualization for SEV-ES,
and likely never will.
Link: https://lore.kernel.org/r/20250610225737.156318-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use a dedicated array of MSRPM offsets to merge L0 and L1 bitmaps, i.e. to
merge KVM's vmcb01 bitmap with L1's vmcb12 bitmap. This will eventually
allow for the removal of direct_access_msrs, as the only path where
tracking the offsets is truly justified is the merge for nested SVM, where
merging in chunks is an easy way to batch uaccess reads/writes.
Opportunistically omit the x2APIC MSRs from the merge-specific array
instead of filtering them out at runtime.
Note, disabling interception of DEBUGCTL, XSS, EFER, PAT, GHCB, and
TSC_AUX is mutually exclusive with nested virtualization, as KVM passes
through those MSRs only for SEV-ES guests, and KVM doesn't support nested
virtualization for SEV+ guests. Defer removing those MSRs to a future
cleanup in order to make this refactoring as benign as possible.
Link: https://lore.kernel.org/r/20250610225737.156318-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move SVM's MSR Permissions Map macros to svm.h in anticipation of adding
helpers that are available to SVM code, and opportunistically replace a
variety of open-coded literals with (hopefully) informative macros.
Opportunistically open code ARRAY_SIZE(msrpm_ranges) instead of wrapping
it as NUM_MSR_MAPS, which is an ambiguous name even if it were qualified
with "SVM_MSRPM".
Deliberately leave the ranges as open coded literals, as using macros to
define the ranges actually introduces more potential failure points, since
both the definitions and the usage have to be careful to use the correct
index. The lack of clear intent behind the ranges will be addressed in
future patches.
No functional change intended.
Link: https://lore.kernel.org/r/20250610225737.156318-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename nested_svm_vmrun_msrpm() to nested_svm_merge_msrpm() to better
capture its role, and opportunistically feed it @vcpu instead of @svm, as
grabbing "svm" only to turn around and grab svm->vcpu is rather silly.
No functional change intended.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Manipulate the MSR bitmaps using non-atomic bit ops APIs (two underscores),
as the bitmaps are per-vCPU and are only ever accessed while vcpu->mutex is
held.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
WARN and kill the VM instead of panicking the host if KVM attempts to set
or query MSR interception for an unsupported MSR. Accessing the MSR
interception bitmaps only meaningfully affects post-VMRUN behavior, and
KVM_BUG_ON() is guaranteed to prevent the current vCPU from doing VMRUN,
i.e. there is no need to panic the entire host.
Opportunistically move the sanity checks about their use to index into the
MSRPM, e.g. so that bugs only WARN and terminate the VM, as opposed to
doing that _and_ generating an out-of-bounds load.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the unnecessary and dangerous value-terminated behavior of
direct_access_msrs, and simply iterate over the actual size of the array.
The use in svm_set_x2apic_msr_interception() is especially sketchy, as it
relies on unused capacity being zero-initialized, and '0' being outside
the range of x2APIC MSRs.
To ensure the array and shadow_msr_intercept stay synchronized, simply
assert that their sizes are identical (note the six 64-bit-only MSRs).
Note, direct_access_msrs will soon be removed entirely; keeping the assert
synchronized with the array isn't expected to be along-term maintenance
burden.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Tag init_msrpm_offsets() and add_msr_offset() with __init, as they're used
only during hardware setup to map potential passthrough MSRs to offsets in
the bitmap.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
WARN and reject module loading if there is a problem with KVM's MSR
interception bitmaps. Panicking the host in this situation is inexcusable
since it is trivially easy to propagate the error up the stack.
Link: https://lore.kernel.org/r/20250610225737.156318-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Allocate pages for the IOPM after initial setup has been completed in
svm_hardware_setup(), so that sanity checks can be added in the setup flow
without needing to free the IOPM pages. The IOPM is only referenced (via
iopm_base) in init_vmcb() and svm_hardware_unsetup(), so there's no need
to allocate it early on.
No functional change intended (beyond the obvious ordering differences,
e.g. if the allocation fails).
Link: https://lore.kernel.org/r/20250610225737.156318-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Disable interception of SPEC_CTRL when the CPU virtualizes (i.e. context
switches) SPEC_CTRL if and only if the MSR exists according to the vCPU's
CPUID model. Letting the guest access SPEC_CTRL is generally benign, but
the guest would see inconsistent behavior if KVM happened to emulate an
access to the MSR.
Fixes: d00b99c514b3 ("KVM: SVM: Add support for Virtual SPEC_CTRL")
Reported-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Set/clear DEBUGCTLMSR_FREEZE_IN_SMM in GUEST_IA32_DEBUGCTL based on the
host's pre-VM-Enter value, i.e. preserve the host's FREEZE_IN_SMM setting
while running the guest. When running with the "default treatment of SMIs"
in effect (the only mode KVM supports), SMIs do not generate a VM-Exit that
is visible to host (non-SMM) software, and instead transitions directly
from VMX non-root to SMM. And critically, DEBUGCTL isn't context switched
by hardware on SMI or RSM, i.e. SMM will run with whatever value was
resident in hardware at the time of the SMI.
Failure to preserve FREEZE_IN_SMM results in the PMU unexpectedly counting
events while the CPU is executing in SMM, which can pollute profiling and
potentially leak information into the guest.
Check for changes in FREEZE_IN_SMM prior to every entry into KVM's inner
run loop, as the bit can be toggled in IRQ context via IPI callback (SMP
function call), by way of /sys/devices/cpu/freeze_on_smi.
Add a field in kvm_x86_ops to communicate which DEBUGCTL bits need to be
preserved, as FREEZE_IN_SMM is only supported and defined for Intel CPUs,
i.e. explicitly checking FREEZE_IN_SMM in common x86 is at best weird, and
at worst could lead to undesirable behavior in the future if AMD CPUs ever
happened to pick up a collision with the bit.
Exempt TDX vCPUs, i.e. protected guests, from the check, as the TDX Module
owns and controls GUEST_IA32_DEBUGCTL.
WARN in SVM if KVM_RUN_LOAD_DEBUGCTL is set, mostly to document that the
lack of handling isn't a KVM bug (TDX already WARNs on any run_flag).
Lastly, explicitly reload GUEST_IA32_DEBUGCTL on a VM-Fail that is missed
by KVM but detected by hardware, i.e. in nested_vmx_restore_host_state().
Doing so avoids the need to track host_debugctl on a per-VMCS basis, as
GUEST_IA32_DEBUGCTL is unconditionally written by prepare_vmcs02() and
load_vmcs12_host_state(). For the VM-Fail case, even though KVM won't
have actually entered the guest, vcpu_enter_guest() will have run with
vmcs02 active and thus could result in vmcs01 being run with a stale value.
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250610232010.162191-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Introduce vmx_guest_debugctl_{read,write}() to handle all accesses to
vmcs.GUEST_IA32_DEBUGCTL. This will allow stuffing FREEZE_IN_SMM into
GUEST_IA32_DEBUGCTL based on the host setting without bleeding the state
into the guest, and without needing to copy+paste the FREEZE_IN_SMM
logic into every patch that accesses GUEST_IA32_DEBUGCTL.
No functional change intended.
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: massage changelog, make inline, use in all prepare_vmcs02() cases]
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610232010.162191-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a consistency check for L2's guest_ia32_debugctl, as KVM only supports
a subset of hardware functionality, i.e. KVM can't rely on hardware to
detect illegal/unsupported values. Failure to check the vmcs12 value
would allow the guest to load any harware-supported value while running L2.
Take care to exempt BTF and LBR from the validity check in order to match
KVM's behavior for writes via WRMSR, but without clobbering vmcs12. Even
if VM_EXIT_SAVE_DEBUG_CONTROLS is set in vmcs12, L1 can reasonably expect
that vmcs12->guest_ia32_debugctl will not be modified if writes to the MSR
are being intercepted.
Arguably, KVM _should_ update vmcs12 if VM_EXIT_SAVE_DEBUG_CONTROLS is set
*and* writes to MSR_IA32_DEBUGCTLMSR are not being intercepted by L1, but
that would incur non-trivial complexity and wouldn't change the fact that
KVM's handling of DEBUGCTL is blatantly broken. I.e. the extra complexity
is not worth carrying.
Cc: stable@vger.kernel.org
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250610232010.162191-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move VMX's logic to check DEBUGCTL values into a standalone helper so that
the code can be used by nested VM-Enter to apply the same logic to the
value being loaded from vmcs12.
KVM needs to explicitly check vmcs12->guest_ia32_debugctl on nested
VM-Enter, as hardware may support features that KVM does not, i.e. relying
on hardware to detect invalid guest state will result in false negatives.
Unfortunately, that means applying KVM's funky suppression of BTF and LBR
to vmcs12 so as not to break existing guests.
No functional change intended.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610232010.162191-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Let the guest set DEBUGCTL.RTM_DEBUG if RTM is supported according to the
guest CPUID model, as debug support is supposed to be available if RTM is
supported, and there are no known downsides to letting the guest debug RTM
aborts.
Note, there are no known bug reports related to RTM_DEBUG, the primary
motivation is to reduce the probability of breaking existing guests when a
future change adds a missing consistency check on vmcs12.GUEST_DEBUGCTL
(KVM currently lets L2 run with whatever hardware supports; whoops).
Note #2, KVM already emulates DR6.RTM, and doesn't restrict access to
DR7.RTM.
Fixes: 83c529151ab0 ("KVM: x86: expose Intel cpu new features (HLE, RTM) to guest")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250610232010.162191-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|