Age | Commit message (Collapse) | Author |
|
Add a helper to get the physical address of the AVIC backing page, both
to deduplicate code and to prepare for getting the address directly from
apic->regs, at which point it won't be all that obvious that the address
in question is what SVM calls the AVIC backing page.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop AVIC_HPA_MASK and all its users, the mask is just the 4KiB-aligned
maximum theoretical physical address for x86-64 CPUs, as x86-64 is
currently defined (going beyond PA52 would require an entirely new paging
mode, which would arguably create a new, different architecture).
All usage in KVM masks the result of page_to_phys(), which on x86-64 is
guaranteed to be 4KiB aligned and a legal physical address; if either of
those requirements doesn't hold true, KVM has far bigger problems.
Drop masking the avic_backing_page with
AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK for all the same reasons, but
keep the macro even though it's unused in functional code. It's a
distinct architectural define, and having the definition in software
helps visualize the layout of an entry. And to be hyper-paranoid about
MAXPA going beyond 52, add a compile-time assert to ensure the kernel's
maximum supported physical address stays in bounds.
The unnecessary masking in avic_init_vmcb() also incorrectly assumes that
SME's C-bit resides between bits 51:11; that holds true for current CPUs,
but isn't required by AMD's architecture:
In some implementations, the bit used may be a physical address bit
Key word being "may".
Opportunistically use the GENMASK_ULL() version for
AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK, which is far more readable
than a set of repeating Fs.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop VMCB_AVIC_APIC_BAR_MASK, it's just a regurgitation of the maximum
theoretical 4KiB-aligned physical address, i.e. is not novel in any way,
and its only usage is to mask the default APIC base, which is 4KiB aligned
and (obviously) a legal physical address.
No functional change intended.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Delete the IRTE link from the previous vCPU irrespective of the new
routing state, i.e. even if the IRTE won't be configured to post IRQs to a
vCPU. Whether or not the new route is postable as no bearing on the *old*
route. Failure to delete the link can result in KVM incorrectly updating
the IRTE, e.g. if the "old" vCPU is scheduled in/out.
Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt")
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Delete the amd_ir_data.prev_ga_tag field now that all usage is
superfluous.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Delete the previous per-vCPU IRTE link prior to modifying the IRTE. If
forcing the IRTE back to remapped mode fails, the IRQ is already broken;
keeping stale metadata won't change that, and the IOMMU should be
sufficiently paranoid to sanitize the IRTE when the IRQ is freed and
reallocated.
This will allow hoisting the vCPU tracking to common x86, which in turn
will allow most of the IRTE update code to be deduplicated.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Track the IRTEs that are posting to an SVM vCPU via the associated irqfd
structure and GSI routing instead of dynamically allocating a separate
data structure. In addition to eliminating an atomic allocation, this
will allow hoisting much of the IRTE update logic to common x86.
Cc: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When updating IRTEs in response to a GSI routing or IRQ bypass change,
pass the new/current routing information along with the associated irqfd.
This will allow KVM x86 to harden, simplify, and deduplicate its code.
Since adding/removing a bypass producer is now conveniently protected with
irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(),
use the routing information cached in the irqfd instead of looking up
the information in the current GSI routing tables.
Opportunistically convert an existing printk() to pr_info() and put its
string onto a single line (old code that strictly adhered to 80 chars).
Link: https://lore.kernel.org/r/20250611224604.313496-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
KVM currently returns -EINVAL when it attempts to create an SNP guest if
the SINGLE_SOCKET guest policy bit is set. The reason for this action is
that KVM would need specific support (SNP_ACTIVATE_EX command support) to
achieve this when running on a system with more than one socket. However,
the SEV firmware will make the proper check and return POLICY_FAILURE
during SNP_ACTIVATE if the single socket guest policy bit is set and the
system has more than one socket:
- System with one socket
- Guest policy SINGLE_SOCKET == 0 ==> SNP_ACTIVATE succeeds
- Guest policy SINGLE_SOCKET == 1 ==> SNP_ACTIVATE succeeds
- System with more than one socket
- Guest policy SINGLE_SOCKET == 0 ==> SNP_ACTIVATE succeeds
- Guest policy SINGLE_SOCKET == 1 ==> SNP_ACTIVATE fails with
POLICY_FAILURE
Remove the check for the SINGLE_SOCKET policy bit from snp_launch_start()
and allow the firmware to perform the proper checking.
This does have the effect of allowing an SNP guest with the SINGLE_SOCKET
policy bit set to run on a single socket system, but fail when run on a
system with more than one socket. However, this should not affect existing
SNP guests as setting the SINGLE_SOCKET policy bit is not allowed today.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/4c51018dd3e4f2c543935134d2c4f47076f109f6.1748553480.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
KVM currently returns -EINVAL when it attempts to create an SNP guest if
the SMT guest policy bit is not set. However, there is no reason to check
this, as there is no specific support in KVM that is required to support
this. The SEV firmware will determine if SMT has been enabled or disabled
in the BIOS and process the policy in the proper way:
- SMT enabled in BIOS
- Guest policy SMT == 0 ==> SNP_LAUNCH_START fails with POLICY_FAILURE
- Guest policy SMT == 1 ==> SNP_LAUNCH_START succeeds
- SMT disabled in BIOS
- Guest policy SMT == 0 ==> SNP_LAUNCH_START succeeds
- Guest policy SMT == 1 ==> SNP_LAUNCH_START succeeds
Remove the check for the SMT policy bit from snp_launch_start() and allow
the firmware to perform the proper checking.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/71043abdd9ef23b6f98fffa9c5c6045ac3a50187.1748553480.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Refactor {svm,vmx}_disable_intercept_for_msr() to simplify the handling of
userspace filters that disallow access to an MSR. The more complicated
logic is no longer needed or justified now that KVM recalculates all MSR
intercepts on a userspace MSR filter change, i.e. now that KVM doesn't
need to also update shadow bitmaps.
No functional change intended.
Suggested-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-32-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a helper to allocate and initialize an MSR or I/O permissions map, as
the logic is identical between the two map types, the only difference is
the size of the bitmap. Opportunistically add a comment to explain why
the bitmaps are initialized with 0xff, e.g. instead of the more common
zero-initialized behavior, which is the main motivation for deduplicating
the code.
No functional change intended.
Link: https://lore.kernel.org/r/20250610225737.156318-31-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When merging L0 and L1 MSRPMs as part of nested VMRUN emulation, access
the bitmaps using "unsigned long" chunks, i.e. use 8-byte access for
64-bit kernels instead of arbitrarily working on 4-byte chunks.
Opportunistically rename local variables in nested_svm_merge_msrpm() to
more precisely/accurately reflect their purpose ("offset" in particular is
extremely ambiguous).
Link: https://lore.kernel.org/r/20250610225737.156318-30-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Return -EINVAL instead of MSR_INVALID from svm_msrpm_bit_nr() to indicate
that the MSR isn't covered by one of the (currently) three MSRPM ranges,
and delete the MSR_INVALID macro now that all users are gone.
Link: https://lore.kernel.org/r/20250610225737.156318-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Access the MSRPM using u32/4-byte chunks (and appropriately adjusted
offsets) only when merging L0 and L1 bitmaps as part of emulating VMRUN.
The only reason to batch accesses to MSRPMs is to avoid the overhead of
uaccess operations (e.g. STAC/CLAC and bounds checks) when reading L1's
bitmap pointed at by vmcb12. For all other uses, either per-bit accesses
are more than fast enough (no uaccess), or KVM is only accessing a single
bit (nested_svm_exit_handled_msr()) and so there's nothing to batch.
In addition to (hopefully) documenting the uniqueness of the merging code,
restricting chunked access to _just_ the merging code will allow for
increasing the chunk size (to unsigned long) with minimal risk.
Link: https://lore.kernel.org/r/20250610225737.156318-28-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Store KVM's MSRPM pointers as "void *" instead of "u32 *" to guard against
directly accessing the bitmaps outside of code that is explicitly written
to access the bitmaps with a specific type.
Opportunistically use svm_vcpu_free_msrpm() in svm_vcpu_free() instead of
open coding an equivalent.
Link: https://lore.kernel.org/r/20250610225737.156318-27-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move svm_msrpm_offset() from svm.c to nested.c now that all usage of the
u32-index offsets is nested virtualization specific.
No functional change intended.
Link: https://lore.kernel.org/r/20250610225737.156318-26-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that msr_write_intercepted() defaults to true, i.e. accurately reflects
hardware behavior for out-of-range MSRs, and doesn't WARN (or BUG) on an
out-of-range MSR, drop sev_es_prevent_msr_access()'s svm_msrpm_offset()
check that guarded against calling msr_write_intercepted() with a "bad"
index.
Opportunistically clean up the helper's formatting.
Link: https://lore.kernel.org/r/20250610225737.156318-25-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Merge svm_recalc_intercepts_after_set_cpuid() and
svm_recalc_instruction_intercepts() such that the "after set CPUID" helper
simply invokes the type-specific helpers (MSRs vs. instructions), i.e.
make svm_recalc_intercepts_after_set_cpuid() a single entry point for all
intercept updates that need to be performed after a CPUID change.
No functional change intended.
Link: https://lore.kernel.org/r/20250610225737.156318-24-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Fold svm_vcpu_init_msrpm() into svm_recalc_msr_intercepts() now that there
is only the one caller (and because the "init" misnomer is even more
misleading than it was in the past).
No functional change intended.
Link: https://lore.kernel.org/r/20250610225737.156318-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename init_vmcb_after_set_cpuid() to svm_recalc_intercepts_after_set_cpuid()
to more precisely describe its role. Strictly speaking, the name isn't
perfect as toggling virtual VM{LOAD,SAVE} is arguably not recalculating an
intercept, but practically speaking it's close enough.
No functional change intended.
Link: https://lore.kernel.org/r/20250610225737.156318-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename msr_filter_changed() to recalc_msr_intercepts() and drop the
trampoline wrapper now that both SVM and VMX use a filter-agnostic recalc
helper to react to the new userspace filter.
No functional change intended.
Reviewed-by: Xin Li (Intel) <xin@zytor.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
On a userspace MSR filter change, recalculate all MSR intercepts using the
filter-agnostic logic instead of maintaining a "shadow copy" of KVM's
desired intercepts. The shadow bitmaps add yet another point of failure,
are confusing (e.g. what does "handled specially" mean!?!?), an eyesore,
and a maintenance burden.
Given that KVM *must* be able to recalculate the correct intercepts at any
given time, and that MSR filter updates are not hot paths, there is zero
benefit to maintaining the shadow bitmaps.
Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as
appropriate.
Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com
Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local
Cc: Francesco Lavra <francescolavra.fl@gmail.com>
Link: https://lore.kernel.org/r/20250610225737.156318-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Dedup the definition of X2APIC_MSR and put it in the local APIC code
where it belongs.
No functional change intended.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the "always" flag from the array of possible passthrough MSRs, and
instead manually initialize the permissions for the handful of MSRs that
KVM passes through by default. In addition to cutting down on boilerplate
copy+paste code and eliminating a misleading flag (the MSRs aren't always
passed through, e.g. thanks to MSR filters), this will allow for removing
the direct_access_msrs array entirely.
Link: https://lore.kernel.org/r/20250610225737.156318-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Disable interception of the GHCB MSR if and only if the VM is an SEV-ES
guest. While the exact behavior is completely undocumented in the APM,
common sense and testing on SEV-ES capable CPUs says that accesses to the
GHCB from non-SEV-ES guests will #GP. I.e. from the guest's perspective,
no functional change intended.
Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading")
Link: https://lore.kernel.org/r/20250610225737.156318-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add and use SVM MSR interception APIs (in most paths) to match VMX's
APIs and nomenclature. Specifically, add SVM variants of:
vmx_disable_intercept_for_msr(vcpu, msr, type)
vmx_enable_intercept_for_msr(vcpu, msr, type)
vmx_set_intercept_for_msr(vcpu, msr, type, intercept)
to eventually replace SVM's single helper:
set_msr_interception(vcpu, msrpm, msr, allow_read, allow_write)
which is awkward to use (in all cases, KVM either applies the same logic
for both reads and writes, or intercepts one of read or write), and is
unintuitive due to using '0' to indicate interception should be *set*.
Keep the guts of the old API for the moment to avoid churning the MSR
filter code, as that mess will be overhauled in the near future. Leave
behind a temporary comment to call out that the shadow bitmaps have
inverted polarity relative to the bitmaps consumed by hardware.
No functional change intended.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add macro-built helpers for testing, setting, and clearing MSRPM entries
without relying on precomputed offsets. This sets the stage for eventually
removing general KVM use of precomputed offsets, which are quite confusing
and rather inefficient for the vast majority of KVM's usage.
Outside of merging L0 and L1 bitmaps for nested SVM, using u32-indexed
offsets and accesses is at best unnecessary, and at worst introduces extra
operations to retrieve the individual bit from within the offset u32 value.
And simply calling them "offsets" is very confusing, as the "unit" of the
offset isn't immediately obvious.
Use the new helpers in set_msr_interception_bitmap() and
msr_write_intercepted() to verify the math and operations, but keep the
existing offset-based logic in set_msr_interception_bitmap() to sanity
check the "clear" and "set" operations. Manipulating MSR interceptions
isn't a hot path and no kernel release is ever expected to contain this
specific version of set_msr_interception_bitmap() (it will be removed
entirely in the near future).
Link: https://lore.kernel.org/r/20250610225737.156318-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't initialize vmcb02's MSRPM with KVM's set of "always passthrough"
MSRs, as KVM always needs to consult L1's intercepts, i.e. needs to merge
vmcb01 with vmcb12 and write the result to vmcb02. This will eventually
allow for the removal of svm_vcpu_init_msrpm().
Note, the bitmaps are truly initialized by svm_vcpu_alloc_msrpm() (default
to intercepting all MSRs), e.g. if there is a bug lurking elsewhere, the
worst case scenario from dropping the call to svm_vcpu_init_msrpm() should
be that KVM would fail to passthrough MSRs to L2.
Link: https://lore.kernel.org/r/20250610225737.156318-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't merge bitmaps on nested VMRUN for MSRs that KVM passes through only
for SEV-ES guests. KVM doesn't support nested virtualization for SEV-ES,
and likely never will.
Link: https://lore.kernel.org/r/20250610225737.156318-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use a dedicated array of MSRPM offsets to merge L0 and L1 bitmaps, i.e. to
merge KVM's vmcb01 bitmap with L1's vmcb12 bitmap. This will eventually
allow for the removal of direct_access_msrs, as the only path where
tracking the offsets is truly justified is the merge for nested SVM, where
merging in chunks is an easy way to batch uaccess reads/writes.
Opportunistically omit the x2APIC MSRs from the merge-specific array
instead of filtering them out at runtime.
Note, disabling interception of DEBUGCTL, XSS, EFER, PAT, GHCB, and
TSC_AUX is mutually exclusive with nested virtualization, as KVM passes
through those MSRs only for SEV-ES guests, and KVM doesn't support nested
virtualization for SEV+ guests. Defer removing those MSRs to a future
cleanup in order to make this refactoring as benign as possible.
Link: https://lore.kernel.org/r/20250610225737.156318-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move SVM's MSR Permissions Map macros to svm.h in anticipation of adding
helpers that are available to SVM code, and opportunistically replace a
variety of open-coded literals with (hopefully) informative macros.
Opportunistically open code ARRAY_SIZE(msrpm_ranges) instead of wrapping
it as NUM_MSR_MAPS, which is an ambiguous name even if it were qualified
with "SVM_MSRPM".
Deliberately leave the ranges as open coded literals, as using macros to
define the ranges actually introduces more potential failure points, since
both the definitions and the usage have to be careful to use the correct
index. The lack of clear intent behind the ranges will be addressed in
future patches.
No functional change intended.
Link: https://lore.kernel.org/r/20250610225737.156318-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename nested_svm_vmrun_msrpm() to nested_svm_merge_msrpm() to better
capture its role, and opportunistically feed it @vcpu instead of @svm, as
grabbing "svm" only to turn around and grab svm->vcpu is rather silly.
No functional change intended.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Manipulate the MSR bitmaps using non-atomic bit ops APIs (two underscores),
as the bitmaps are per-vCPU and are only ever accessed while vcpu->mutex is
held.
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
WARN and kill the VM instead of panicking the host if KVM attempts to set
or query MSR interception for an unsupported MSR. Accessing the MSR
interception bitmaps only meaningfully affects post-VMRUN behavior, and
KVM_BUG_ON() is guaranteed to prevent the current vCPU from doing VMRUN,
i.e. there is no need to panic the entire host.
Opportunistically move the sanity checks about their use to index into the
MSRPM, e.g. so that bugs only WARN and terminate the VM, as opposed to
doing that _and_ generating an out-of-bounds load.
Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the unnecessary and dangerous value-terminated behavior of
direct_access_msrs, and simply iterate over the actual size of the array.
The use in svm_set_x2apic_msr_interception() is especially sketchy, as it
relies on unused capacity being zero-initialized, and '0' being outside
the range of x2APIC MSRs.
To ensure the array and shadow_msr_intercept stay synchronized, simply
assert that their sizes are identical (note the six 64-bit-only MSRs).
Note, direct_access_msrs will soon be removed entirely; keeping the assert
synchronized with the array isn't expected to be along-term maintenance
burden.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Tag init_msrpm_offsets() and add_msr_offset() with __init, as they're used
only during hardware setup to map potential passthrough MSRs to offsets in
the bitmap.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
WARN and reject module loading if there is a problem with KVM's MSR
interception bitmaps. Panicking the host in this situation is inexcusable
since it is trivially easy to propagate the error up the stack.
Link: https://lore.kernel.org/r/20250610225737.156318-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Allocate pages for the IOPM after initial setup has been completed in
svm_hardware_setup(), so that sanity checks can be added in the setup flow
without needing to free the IOPM pages. The IOPM is only referenced (via
iopm_base) in init_vmcb() and svm_hardware_unsetup(), so there's no need
to allocate it early on.
No functional change intended (beyond the obvious ordering differences,
e.g. if the allocation fails).
Link: https://lore.kernel.org/r/20250610225737.156318-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Disable interception of SPEC_CTRL when the CPU virtualizes (i.e. context
switches) SPEC_CTRL if and only if the MSR exists according to the vCPU's
CPUID model. Letting the guest access SPEC_CTRL is generally benign, but
the guest would see inconsistent behavior if KVM happened to emulate an
access to the MSR.
Fixes: d00b99c514b3 ("KVM: SVM: Add support for Virtual SPEC_CTRL")
Reported-by: Chao Gao <chao.gao@intel.com>
Link: https://lore.kernel.org/r/20250610225737.156318-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Instruct vendor code to load the guest's DR6 into hardware via a new
KVM_RUN flag, and remove kvm_x86_ops.set_dr6(), whose sole purpose was to
load vcpu->arch.dr6 into hardware when DR6 can be read/written directly
by the guest.
Note, TDX already WARNs on any run_flag being set, i.e. will yell if KVM
thinks DR6 needs to be reloaded. TDX vCPUs force KVM_DEBUGREG_AUTO_SWITCH
and never clear the flag, i.e. should never observe KVM_RUN_LOAD_GUEST_DR6.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250610232010.162191-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Convert kvm_x86_ops.vcpu_run()'s "force_immediate_exit" boolean parameter
into an a generic bitmap so that similar "take action" information can be
passed to vendor code without creating a pile of boolean parameters.
This will allow dropping kvm_x86_ops.set_dr6() in favor of a new flag, and
will also allow for adding similar functionality for re-loading debugctl
in the active VMCS.
Opportunistically massage the TDX WARN and comment to prepare for adding
more run_flags, all of which are expected to be mutually exclusive with
TDX, i.e. should be WARNed on.
No functional change intended.
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250610232010.162191-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add the required features detection glue to bugs.c et all in order to
support the TSA mitigation.
Co-developed-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
|
|
During platform init, SNP initialization may fail for several reasons,
such as firmware command failures and incompatible versions. However,
the KVM capability may continue to advertise support for it.
The platform may have SNP enabled but if SNP_INIT fails then SNP is
not supported by KVM.
During KVM module initialization query the SNP platform status to obtain
the SNP initialization state and use it as an additional condition to
determine support for SEV-SNP.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Pratik R. Sampat <prsampat@amd.com>
Signed-off-by: Pratik R. Sampat <prsampat@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Reviewed-by: Pavan Kumar Paluri <papaluri@amd.com>
Message-ID: <20250512221634.12045-1-Ashish.Kalra@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Pull more kvm updates from Paolo Bonzini:
Generic:
- Clean up locking of all vCPUs for a VM by using the *_nest_lock()
family of functions, and move duplicated code to virt/kvm/. kernel/
patches acked by Peter Zijlstra
- Add MGLRU support to the access tracking perf test
ARM fixes:
- Make the irqbypass hooks resilient to changes in the GSI<->MSI
routing, avoiding behind stale vLPI mappings being left behind. The
fix is to resolve the VGIC IRQ using the host IRQ (which is stable)
and nuking the vLPI mapping upon a routing change
- Close another VGIC race where vCPU creation races with VGIC
creation, leading to in-flight vCPUs entering the kernel w/o
private IRQs allocated
- Fix a build issue triggered by the recently added workaround for
Ampere's AC04_CPU_23 erratum
- Correctly sign-extend the VA when emulating a TLBI instruction
potentially targeting a VNCR mapping
- Avoid dereferencing a NULL pointer in the VGIC debug code, which
can happen if the device doesn't have any mapping yet
s390:
- Fix interaction between some filesystems and Secure Execution
- Some cleanups and refactorings, preparing for an upcoming big
series
x86:
- Wait for target vCPU to ack KVM_REQ_UPDATE_PROTECTED_GUEST_STATE
to fix a race between AP destroy and VMRUN
- Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for
the VM
- Refine and harden handling of spurious faults
- Add support for ALLOWED_SEV_FEATURES
- Add #VMGEXIT to the set of handlers special cased for
CONFIG_RETPOLINE=y
- Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing
features that utilize those bits
- Don't account temporary allocations in sev_send_update_data()
- Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock
Threshold
- Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU
IBPB, between SVM and VMX
- Advertise support to userspace for WRMSRNS and PREFETCHI
- Rescan I/O APIC routes after handling EOI that needed to be
intercepted due to the old/previous routing, but not the
new/current routing
- Add a module param to control and enumerate support for device
posted interrupts
- Fix a potential overflow with nested virt on Intel systems running
32-bit kernels
- Flush shadow VMCSes on emergency reboot
- Add support for SNP to the various SEV selftests
- Add a selftest to verify fastops instructions via forced emulation
- Refine and optimize KVM's software processing of the posted
interrupt bitmap, and share the harvesting code between KVM and the
kernel's Posted MSI handler"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits)
rtmutex_api: provide correct extern functions
KVM: arm64: vgic-debug: Avoid dereferencing NULL ITE pointer
KVM: arm64: vgic-init: Plug vCPU vs. VGIC creation race
KVM: arm64: Unmap vLPIs affected by changes to GSI routing information
KVM: arm64: Resolve vLPI by host IRQ in vgic_v4_unset_forwarding()
KVM: arm64: Protect vLPI translation with vgic_irq::irq_lock
KVM: arm64: Use lock guard in vgic_v4_set_forwarding()
KVM: arm64: Mask out non-VA bits from TLBI VA* on VNCR invalidation
arm64: sysreg: Drag linux/kconfig.h to work around vdso build issue
KVM: s390: Simplify and move pv code
KVM: s390: Refactor and split some gmap helpers
KVM: s390: Remove unneeded srcu lock
s390: Remove unneeded includes
s390/uv: Improve splitting of large folios that cannot be split while dirty
s390/uv: Always return 0 from s390_wiggle_split_folio() if successful
s390/uv: Don't return 0 from make_hva_secure() if the operation was not successful
rust: add helper for mutex_trylock
RISC-V: KVM: use kvm_trylock_all_vcpus when locking all vCPUs
KVM: arm64: use kvm_trylock_all_vcpus when locking all vCPUs
x86: KVM: SVM: use kvm_lock_all_vcpus instead of a custom implementation
...
|
|
Pull kvm updates from Paolo Bonzini:
"As far as x86 goes this pull request "only" includes TDX host support.
Quotes are appropriate because (at 6k lines and 100+ commits) it is
much bigger than the rest, which will come later this week and
consists mostly of bugfixes and selftests. s390 changes will also come
in the second batch.
ARM:
- Add large stage-2 mapping (THP) support for non-protected guests
when pKVM is enabled, clawing back some performance.
- Enable nested virtualisation support on systems that support it,
though it is disabled by default.
- Add UBSAN support to the standalone EL2 object used in nVHE/hVHE
and protected modes.
- Large rework of the way KVM tracks architecture features and links
them with the effects of control bits. While this has no functional
impact, it ensures correctness of emulation (the data is
automatically extracted from the published JSON files), and helps
dealing with the evolution of the architecture.
- Significant changes to the way pKVM tracks ownership of pages,
avoiding page table walks by storing the state in the hypervisor's
vmemmap. This in turn enables the THP support described above.
- New selftest checking the pKVM ownership transition rules
- Fixes for FEAT_MTE_ASYNC being accidentally advertised to guests
even if the host didn't have it.
- Fixes for the address translation emulation, which happened to be
rather buggy in some specific contexts.
- Fixes for the PMU emulation in NV contexts, decoupling PMCR_EL0.N
from the number of counters exposed to a guest and addressing a
number of issues in the process.
- Add a new selftest for the SVE host state being corrupted by a
guest.
- Keep HCR_EL2.xMO set at all times for systems running with the
kernel at EL2, ensuring that the window for interrupts is slightly
bigger, and avoiding a pretty bad erratum on the AmpereOne HW.
- Add workaround for AmpereOne's erratum AC04_CPU_23, which suffers
from a pretty bad case of TLB corruption unless accesses to HCR_EL2
are heavily synchronised.
- Add a per-VM, per-ITS debugfs entry to dump the state of the ITS
tables in a human-friendly fashion.
- and the usual random cleanups.
LoongArch:
- Don't flush tlb if the host supports hardware page table walks.
- Add KVM selftests support.
RISC-V:
- Add vector registers to get-reg-list selftest
- VCPU reset related improvements
- Remove scounteren initialization from VCPU reset
- Support VCPU reset from userspace using set_mpstate() ioctl
x86:
- Initial support for TDX in KVM.
This finally makes it possible to use the TDX module to run
confidential guests on Intel processors. This is quite a large
series, including support for private page tables (managed by the
TDX module and mirrored in KVM for efficiency), forwarding some
TDVMCALLs to userspace, and handling several special VM exits from
the TDX module.
This has been in the works for literally years and it's not really
possible to describe everything here, so I'll defer to the various
merge commits up to and including commit 7bcf7246c42a ('Merge
branch 'kvm-tdx-finish-initial' into HEAD')"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (248 commits)
x86/tdx: mark tdh_vp_enter() as __flatten
Documentation: virt/kvm: remove unreferenced footnote
RISC-V: KVM: lock the correct mp_state during reset
KVM: arm64: Fix documentation for vgic_its_iter_next()
KVM: arm64: np-guest CMOs with PMD_SIZE fixmap
KVM: arm64: Stage-2 huge mappings for np-guests
KVM: arm64: Add a range to pkvm_mappings
KVM: arm64: Convert pkvm_mappings to interval tree
KVM: arm64: Add a range to __pkvm_host_test_clear_young_guest()
KVM: arm64: Add a range to __pkvm_host_wrprotect_guest()
KVM: arm64: Add a range to __pkvm_host_unshare_guest()
KVM: arm64: Add a range to __pkvm_host_share_guest()
KVM: arm64: Introduce for_each_hyp_page
KVM: arm64: Handle huge mappings for np-guest CMOs
KVM: arm64: nv: Release faulted-in VNCR page from mmu_lock critical section
KVM: arm64: nv: Handle TLBI S1E2 for VNCR invalidation with mmu_lock held
KVM: arm64: nv: Hold mmu_lock when invalidating VNCR SW-TLB before translating
RISC-V: KVM: add KVM_CAP_RISCV_MP_STATE_RESET
RISC-V: KVM: Remove scounteren initialization
KVM: RISC-V: remove unnecessary SBI reset state
...
|
|
Introduce new mutex locking functions mutex_trylock_nest_lock() and
mutex_lock_killable_nest_lock() and use them to clean up locking
of all vCPUs for a VM.
For x86, this removes some complex code that was used instead
of lockdep's "nest_lock" feature.
For ARM and RISC-V, this removes a lockdep warning when the VM is
configured to have more than MAX_LOCK_DEPTH vCPUs, and removes a fair
amount of duplicate code by sharing the logic across all architectures.
Signed-off-by: Paolo BOnzini <pbonzini@redhat.com>
|
|
Use kvm_lock_all_vcpus instead of sev's own implementation.
Because kvm_lock_all_vcpus uses the _nest_lock feature of lockdep, which
ignores subclasses, there is no longer a need to use separate subclasses
for source and target VMs.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Message-ID: <20250512180407.659015-5-mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM SVM changes for 6.16:
- Wait for target vCPU to acknowledge KVM_REQ_UPDATE_PROTECTED_GUEST_STATE to
fix a race between AP destroy and VMRUN.
- Decrypt and dump the VMSA in dump_vmcb() if debugging enabled for the VM.
- Add support for ALLOWED_SEV_FEATURES.
- Add #VMGEXIT to the set of handlers special cased for CONFIG_RETPOLINE=y.
- Treat DEBUGCTL[5:2] as reserved to pave the way for virtualizing features
that utilize those bits.
- Don't account temporary allocations in sev_send_update_data().
- Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM, via Bus Lock Threshold.
|
|
KVM x86 misc changes for 6.16:
- Unify virtualization of IBRS on nested VM-Exit, and cross-vCPU IBPB, between
SVM and VMX.
- Advertise support to userspace for WRMSRNS and PREFETCHI.
- Rescan I/O APIC routes after handling EOI that needed to be intercepted due
to the old/previous routing, but not the new/current routing.
- Add a module param to control and enumerate support for device posted
interrupts.
- Misc cleanups.
|