linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-07-09	KVM: selftests: Convert arch_timer tests to common helpers to pin task	Sean Christopherson
	Convert the arch timer tests to use __pin_task_to_cpu() and pin_self_to_cpu(). No functional change intended. Link: https://lore.kernel.org/r/20250626001225.744268-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09	KVM: selftests: Test behavior of KVM_X86_DISABLE_EXITS_APERFMPERF	Jim Mattson
	For a VCPU thread pinned to a single LPU, verify that interleaved host and guest reads of IA32_[AM]PERF return strictly increasing values when APERFMPERF exiting is disabled. Run the test in both L1 and L2 to verify that KVM passes through the APERF and MPERF MSRs when L1 doesn't want to intercept them (or any MSRs). Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250530185239.2335185-4-jmattson@google.com Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250626001225.744268-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09	KVM: selftests: Expand set of APIs for pinning tasks to a single CPU	Sean Christopherson
	Expand kvm_pin_this_task_to_pcpu() into a set of APIs to allow pinning a task (or self) to a CPU (any or specific). This will allow deduplicating code throughout a variety of selftests. Opportunistically use "self" instead of "this_task" as it is both more concise and less ambiguous. Link: https://lore.kernel.org/r/20250626001225.744268-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09	KVM: x86: Provide a capability to disable APERF/MPERF read intercepts	Jim Mattson
	Allow a guest to read the physical IA32_APERF and IA32_MPERF MSRs without interception. The IA32_APERF and IA32_MPERF MSRs are not virtualized. Writes are not handled at all. The MSR values are not zeroed on vCPU creation, saved on suspend, or restored on resume. No accommodation is made for processor migration or for sharing a logical processor with other tasks. No adjustments are made for non-unit TSC multipliers. The MSRs do not account for time the same way as the comparable PMU events, whether the PMU is virtualized by the traditional emulation method or the new mediated pass-through approach. Nonetheless, in a properly constrained environment, this capability can be combined with a guest CPUID table that advertises support for CPUID.6:ECX.APERFMPERF[bit 0] to induce a Linux guest to report the effective physical CPU frequency in /proc/cpuinfo. Moreover, there is no performance cost for this capability. Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250530185239.2335185-3-jmattson@google.com Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250626001225.744268-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09	KVM: x86: Replace growing set of *_in_guest bools with a u64	Jim Mattson
	Store each "disabled exit" boolean in a single bit rather than a byte. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250530185239.2335185-2-jmattson@google.com Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250626001225.744268-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09	KVM: x86: Advertise support for LKGS	Xin Li
	Advertise support for LKGS (load into IA32_KERNEL_GS_BASE) to userspace if the instruction is supported by the underlying CPU. LKGS is introduced with FRED to completely eliminate the need to swapgs explicilty. It behaves like the MOV to GS instruction except that it loads the base address into the IA32_KERNEL_GS_BASE MSR instead of the GS segment’s descriptor cache, which is exactly what Linux kernel does to load a user level GS base. Thus there is no need to SWAPGS away from the kernel GS base. LKGS is an independent CPU feature that works correctly in a KVM guest without requiring explicit enablement. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Link: https://lore.kernel.org/r/20250626173521.2301088-1-xin@zytor.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-07-09	KVM: VMX: Add a macro to track which DEBUGCTL bits are host-owned	Sean Christopherson
	Add VMX_HOST_OWNED_DEBUGCTL_BITS to track which bits are host-owned, i.e. need to be preserved when running the guest, to dedup the logic without having to incur a memory load to get at kvm_x86_ops.HOST_OWNED_DEBUGCTL. No functional change intended. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/all/aF1yni8U6XNkyfRf@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25	KVM: SVM: Simplify MSR interception logic for IA32_XSS MSR	Chao Gao
	Use svm_set_intercept_for_msr() directly to configure IA32_XSS MSR interception, ensuring consistency with other cases where MSRs are intercepted depending on guest caps and CPUIDs. No functional change intended. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250612081947.94081-3-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24	KVM: x86: Deduplicate MSR interception enabling and disabling	Chao Gao
	Extract a common function from MSR interception disabling logic and create disabling and enabling functions based on it. This removes most of the duplicated code for MSR interception disabling/enabling. No functional change intended. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250612081947.94081-2-chao.gao@intel.com [sean: s/enable/set, inline the wrappers] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: x86: Refactor handling of SIPI_RECEIVED when setting MP_STATE	Sean Christopherson
	Convert the incoming mp_state to INIT_RECIEVED instead of manually calling kvm_set_mp_state() to make it more obvious that the SIPI_RECEIVED logic is translating the incoming state to KVM's internal tracking, as opposed to being some entirely unique flow. Opportunistically add a comment to explain what the code is doing. No functional change intended. Link: https://lore.kernel.org/r/20250605195018.539901-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: x86: Move INIT_RECEIVED vs. INIT/SIPI blocked check to KVM_RUN	Sean Christopherson
	Check for the should-be-impossible scenario of a vCPU being in Wait-For-SIPI with INIT/SIPI blocked during KVM_RUN instead of trying to detect and prevent illegal combinations in every ioctl that sets relevant state. Attempting to handle every possible "set" path is a losing game of whack-a-mole, and risks breaking userspace. E.g. INIT/SIPI are blocked on Intel if the vCPU is in VMX Root mode (post-VMXON), and on AMD if GIF=0. Handling those scenarios would require potentially breaking changes to {vmx,svm}_set_nested_state(). Moving the check to KVM_RUN fixes a syzkaller-induced splat due to the aforementioned VMXON case, and in theory should close the hole once and for all. Note, kvm_x86_vcpu_pre_run() already handles SIPI_RECEIVED, only the WFS case needs additional attention. Reported-by: syzbot+c1cbaedc2613058d5194@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?id=490ae63d8d89cb82c5d462d16962cf371df0e476 Link: https://lore.kernel.org/r/20250605195018.539901-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: x86: WARN and reject KVM_RUN if vCPU's MP_STATE is SIPI_RECEIVED	Sean Christopherson
	WARN if KVM_RUN is reached with a vCPU's mp_state set to SIPI_RECEIVED, as KVM no longer uses SIPI_RECEIVED internally, and should morph SIPI_RECEIVED into INIT_RECEIVED with a pending SIPI if userspace forces SIPI_RECEIVED. See commit 66450a21f996 ("KVM: x86: Rework INIT and SIPI handling") for more history and details. Link: https://lore.kernel.org/r/20250605195018.539901-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: x86: Drop pending_smi vs. INIT_RECEIVED check when setting MP_STATE	Sean Christopherson
	Allow userspace to set a vCPU's mp_state to INIT_RECEIVED in conjunction with a pending SMI, as rejecting that combination could result in KVM disallowing reflecting the output from KVM_GET_VCPU_EVENTS back into KVM via KVM_SET_VCPU_EVENTS. At the time the check was added, smi_pending could only be set in the context of KVM_RUN, with the vCPU in the RUNNABLE state. I.e. it was impossible for KVM to save vCPU state such that userspace could see a pending SMI for a vCPU in WFS. That no longer holds true now that KVM processes requested SMIs during KVM_GET_VCPU_EVENTS, e.g. if a vCPU receives an SMI while in WFS, and then userspace saves vCPU state. Note, this may partially re-open the user-triggerable WARN that was mostly closed by commit 28bf28887976 ("KVM: x86: fix user triggerable warning in kvm_apic_accept_events()"), but that WARN can already be triggered in several other ways, e.g. if userspace stuffs VMXON=1 after putting the vCPU into WFS. That issue will be addressed in an upcoming commit, in a more robust fashion (hopefully). Fixes: 1f7becf1b7e2 ("KVM: x86: get smi pending status correctly") Link: https://lore.kernel.org/r/20250605195018.539901-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: selftests: Verify KVM disable interception (for userspace) on filter change	Sean Christopherson
	Re-read MSR_{FS,GS}_BASE after restoring the "allow everything" userspace MSR filter to verify that KVM stops forwarding exits to userspace. This can also be used in conjunction with manual verification (e.g. printk) to ensure KVM is correctly updating the MSR bitmaps consumed by hardware. Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Manali Shukla <Manali.Shukla@amd.com> Link: https://lore.kernel.org/r/20250610225737.156318-33-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: x86: Simplify userspace filter logic when disabling MSR interception	Sean Christopherson
	Refactor {svm,vmx}_disable_intercept_for_msr() to simplify the handling of userspace filters that disallow access to an MSR. The more complicated logic is no longer needed or justified now that KVM recalculates all MSR intercepts on a userspace MSR filter change, i.e. now that KVM doesn't need to also update shadow bitmaps. No functional change intended. Suggested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-32-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Add a helper to allocate and initialize permissions bitmaps	Sean Christopherson
	Add a helper to allocate and initialize an MSR or I/O permissions map, as the logic is identical between the two map types, the only difference is the size of the bitmap. Opportunistically add a comment to explain why the bitmaps are initialized with 0xff, e.g. instead of the more common zero-initialized behavior, which is the main motivation for deduplicating the code. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-31-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: nSVM: Merge MSRPM in 64-bit chunks on 64-bit kernels	Sean Christopherson
	When merging L0 and L1 MSRPMs as part of nested VMRUN emulation, access the bitmaps using "unsigned long" chunks, i.e. use 8-byte access for 64-bit kernels instead of arbitrarily working on 4-byte chunks. Opportunistically rename local variables in nested_svm_merge_msrpm() to more precisely/accurately reflect their purpose ("offset" in particular is extremely ambiguous). Link: https://lore.kernel.org/r/20250610225737.156318-30-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Return -EINVAL instead of MSR_INVALID to signal out-of-range MSR	Sean Christopherson
	Return -EINVAL instead of MSR_INVALID from svm_msrpm_bit_nr() to indicate that the MSR isn't covered by one of the (currently) three MSRPM ranges, and delete the MSR_INVALID macro now that all users are gone. Link: https://lore.kernel.org/r/20250610225737.156318-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: nSVM: Access MSRPM in 4-byte chunks only for merging L0 and L1 bitmaps	Sean Christopherson
	Access the MSRPM using u32/4-byte chunks (and appropriately adjusted offsets) only when merging L0 and L1 bitmaps as part of emulating VMRUN. The only reason to batch accesses to MSRPMs is to avoid the overhead of uaccess operations (e.g. STAC/CLAC and bounds checks) when reading L1's bitmap pointed at by vmcb12. For all other uses, either per-bit accesses are more than fast enough (no uaccess), or KVM is only accessing a single bit (nested_svm_exit_handled_msr()) and so there's nothing to batch. In addition to (hopefully) documenting the uniqueness of the merging code, restricting chunked access to _just_ the merging code will allow for increasing the chunk size (to unsigned long) with minimal risk. Link: https://lore.kernel.org/r/20250610225737.156318-28-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Store MSRPM pointer as "void " instead of "u32 "	Sean Christopherson
	Store KVM's MSRPM pointers as "void " instead of "u32 " to guard against directly accessing the bitmaps outside of code that is explicitly written to access the bitmaps with a specific type. Opportunistically use svm_vcpu_free_msrpm() in svm_vcpu_free() instead of open coding an equivalent. Link: https://lore.kernel.org/r/20250610225737.156318-27-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Move svm_msrpm_offset() to nested.c	Sean Christopherson
	Move svm_msrpm_offset() from svm.c to nested.c now that all usage of the u32-index offsets is nested virtualization specific. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-26-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Drop explicit check on MSRPM offset when emulating SEV-ES accesses	Sean Christopherson
	Now that msr_write_intercepted() defaults to true, i.e. accurately reflects hardware behavior for out-of-range MSRs, and doesn't WARN (or BUG) on an out-of-range MSR, drop sev_es_prevent_msr_access()'s svm_msrpm_offset() check that guarded against calling msr_write_intercepted() with a "bad" index. Opportunistically clean up the helper's formatting. Link: https://lore.kernel.org/r/20250610225737.156318-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Merge "after set CPUID" intercept recalc helpers	Sean Christopherson
	Merge svm_recalc_intercepts_after_set_cpuid() and svm_recalc_instruction_intercepts() such that the "after set CPUID" helper simply invokes the type-specific helpers (MSRs vs. instructions), i.e. make svm_recalc_intercepts_after_set_cpuid() a single entry point for all intercept updates that need to be performed after a CPUID change. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Fold svm_vcpu_init_msrpm() into its sole caller	Sean Christopherson
	Fold svm_vcpu_init_msrpm() into svm_recalc_msr_intercepts() now that there is only the one caller (and because the "init" misnomer is even more misleading than it was in the past). No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Rename init_vmcb_after_set_cpuid() to make it intercepts specific	Sean Christopherson
	Rename init_vmcb_after_set_cpuid() to svm_recalc_intercepts_after_set_cpuid() to more precisely describe its role. Strictly speaking, the name isn't perfect as toggling virtual VM{LOAD,SAVE} is arguably not recalculating an intercept, but practically speaking it's close enough. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: x86: Rename msr_filter_changed() => recalc_msr_intercepts()	Sean Christopherson
	Rename msr_filter_changed() to recalc_msr_intercepts() and drop the trampoline wrapper now that both SVM and VMX use a filter-agnostic recalc helper to react to the new userspace filter. No functional change intended. Reviewed-by: Xin Li (Intel) <xin@zytor.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Manually recalc all MSR intercepts on userspace MSR filter change	Sean Christopherson
	On a userspace MSR filter change, recalculate all MSR intercepts using the filter-agnostic logic instead of maintaining a "shadow copy" of KVM's desired intercepts. The shadow bitmaps add yet another point of failure, are confusing (e.g. what does "handled specially" mean!?!?), an eyesore, and a maintenance burden. Given that KVM must be able to recalculate the correct intercepts at any given time, and that MSR filter updates are not hot paths, there is zero benefit to maintaining the shadow bitmaps. Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as appropriate. Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local Cc: Francesco Lavra <francescolavra.fl@gmail.com> Link: https://lore.kernel.org/r/20250610225737.156318-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: VMX: Manually recalc all MSR intercepts on userspace MSR filter change	Sean Christopherson
	On a userspace MSR filter change, recalculate all MSR intercepts using the filter-agnostic logic instead of maintaining a "shadow copy" of KVM's desired intercepts. The shadow bitmaps add yet another point of failure, are confusing (e.g. what does "handled specially" mean!?!?), an eyesore, and a maintenance burden. Given that KVM must be able to recalculate the correct intercepts at any given time, and that MSR filter updates are not hot paths, there is zero benefit to maintaining the shadow bitmaps. Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as appropriate. Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local Cc: Borislav Petkov <bp@alien8.de> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xin Li (Intel) <xin@zytor.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: x86: Move definition of X2APIC_MSR() to lapic.h	Sean Christopherson
	Dedup the definition of X2APIC_MSR and put it in the local APIC code where it belongs. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Drop "always" flag from list of possible passthrough MSRs	Sean Christopherson
	Drop the "always" flag from the array of possible passthrough MSRs, and instead manually initialize the permissions for the handful of MSRs that KVM passes through by default. In addition to cutting down on boilerplate copy+paste code and eliminating a misleading flag (the MSRs aren't always passed through, e.g. thanks to MSR filters), this will allow for removing the direct_access_msrs array entirely. Link: https://lore.kernel.org/r/20250610225737.156318-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Pass through GHCB MSR if and only if VM is an SEV-ES guest	Sean Christopherson
	Disable interception of the GHCB MSR if and only if the VM is an SEV-ES guest. While the exact behavior is completely undocumented in the APM, common sense and testing on SEV-ES capable CPUs says that accesses to the GHCB from non-SEV-ES guests will #GP. I.e. from the guest's perspective, no functional change intended. Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading") Link: https://lore.kernel.org/r/20250610225737.156318-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Implement and adopt VMX style MSR intercepts APIs	Sean Christopherson
	Add and use SVM MSR interception APIs (in most paths) to match VMX's APIs and nomenclature. Specifically, add SVM variants of: vmx_disable_intercept_for_msr(vcpu, msr, type) vmx_enable_intercept_for_msr(vcpu, msr, type) vmx_set_intercept_for_msr(vcpu, msr, type, intercept) to eventually replace SVM's single helper: set_msr_interception(vcpu, msrpm, msr, allow_read, allow_write) which is awkward to use (in all cases, KVM either applies the same logic for both reads and writes, or intercepts one of read or write), and is unintuitive due to using '0' to indicate interception should be set. Keep the guts of the old API for the moment to avoid churning the MSR filter code, as that mess will be overhauled in the near future. Leave behind a temporary comment to call out that the shadow bitmaps have inverted polarity relative to the bitmaps consumed by hardware. No functional change intended. Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Add helpers for accessing MSR bitmap that don't rely on offsets	Sean Christopherson
	Add macro-built helpers for testing, setting, and clearing MSRPM entries without relying on precomputed offsets. This sets the stage for eventually removing general KVM use of precomputed offsets, which are quite confusing and rather inefficient for the vast majority of KVM's usage. Outside of merging L0 and L1 bitmaps for nested SVM, using u32-indexed offsets and accesses is at best unnecessary, and at worst introduces extra operations to retrieve the individual bit from within the offset u32 value. And simply calling them "offsets" is very confusing, as the "unit" of the offset isn't immediately obvious. Use the new helpers in set_msr_interception_bitmap() and msr_write_intercepted() to verify the math and operations, but keep the existing offset-based logic in set_msr_interception_bitmap() to sanity check the "clear" and "set" operations. Manipulating MSR interceptions isn't a hot path and no kernel release is ever expected to contain this specific version of set_msr_interception_bitmap() (it will be removed entirely in the near future). Link: https://lore.kernel.org/r/20250610225737.156318-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: nSVM: Don't initialize vmcb02 MSRPM with vmcb01's "always passthrough"	Sean Christopherson
	Don't initialize vmcb02's MSRPM with KVM's set of "always passthrough" MSRs, as KVM always needs to consult L1's intercepts, i.e. needs to merge vmcb01 with vmcb12 and write the result to vmcb02. This will eventually allow for the removal of svm_vcpu_init_msrpm(). Note, the bitmaps are truly initialized by svm_vcpu_alloc_msrpm() (default to intercepting all MSRs), e.g. if there is a bug lurking elsewhere, the worst case scenario from dropping the call to svm_vcpu_init_msrpm() should be that KVM would fail to passthrough MSRs to L2. Link: https://lore.kernel.org/r/20250610225737.156318-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: nSVM: Omit SEV-ES specific passthrough MSRs from L0+L1 bitmap merge	Sean Christopherson
	Don't merge bitmaps on nested VMRUN for MSRs that KVM passes through only for SEV-ES guests. KVM doesn't support nested virtualization for SEV-ES, and likely never will. Link: https://lore.kernel.org/r/20250610225737.156318-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: nSVM: Use dedicated array of MSRPM offsets to merge L0 and L1 bitmaps	Sean Christopherson
	Use a dedicated array of MSRPM offsets to merge L0 and L1 bitmaps, i.e. to merge KVM's vmcb01 bitmap with L1's vmcb12 bitmap. This will eventually allow for the removal of direct_access_msrs, as the only path where tracking the offsets is truly justified is the merge for nested SVM, where merging in chunks is an easy way to batch uaccess reads/writes. Opportunistically omit the x2APIC MSRs from the merge-specific array instead of filtering them out at runtime. Note, disabling interception of DEBUGCTL, XSS, EFER, PAT, GHCB, and TSC_AUX is mutually exclusive with nested virtualization, as KVM passes through those MSRs only for SEV-ES guests, and KVM doesn't support nested virtualization for SEV+ guests. Defer removing those MSRs to a future cleanup in order to make this refactoring as benign as possible. Link: https://lore.kernel.org/r/20250610225737.156318-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Clean up macros related to architectural MSRPM definitions	Sean Christopherson
	Move SVM's MSR Permissions Map macros to svm.h in anticipation of adding helpers that are available to SVM code, and opportunistically replace a variety of open-coded literals with (hopefully) informative macros. Opportunistically open code ARRAY_SIZE(msrpm_ranges) instead of wrapping it as NUM_MSR_MAPS, which is an ambiguous name even if it were qualified with "SVM_MSRPM". Deliberately leave the ranges as open coded literals, as using macros to define the ranges actually introduces more potential failure points, since both the definitions and the usage have to be careful to use the correct index. The lack of clear intent behind the ranges will be addressed in future patches. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Massage name and param of helper that merges vmcb01 and vmcb12 MSRPMs	Sean Christopherson
	Rename nested_svm_vmrun_msrpm() to nested_svm_merge_msrpm() to better capture its role, and opportunistically feed it @vcpu instead of @svm, as grabbing "svm" only to turn around and grab svm->vcpu is rather silly. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: x86: Use non-atomic bit ops to manipulate "shadow" MSR intercepts	Sean Christopherson
	Manipulate the MSR bitmaps using non-atomic bit ops APIs (two underscores), as the bitmaps are per-vCPU and are only ever accessed while vcpu->mutex is held. Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Kill the VM instead of the host if MSR interception is buggy	Sean Christopherson
	WARN and kill the VM instead of panicking the host if KVM attempts to set or query MSR interception for an unsupported MSR. Accessing the MSR interception bitmaps only meaningfully affects post-VMRUN behavior, and KVM_BUG_ON() is guaranteed to prevent the current vCPU from doing VMRUN, i.e. there is no need to panic the entire host. Opportunistically move the sanity checks about their use to index into the MSRPM, e.g. so that bugs only WARN and terminate the VM, as opposed to doing that _and_ generating an out-of-bounds load. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Use ARRAY_SIZE() to iterate over direct_access_msrs	Sean Christopherson
	Drop the unnecessary and dangerous value-terminated behavior of direct_access_msrs, and simply iterate over the actual size of the array. The use in svm_set_x2apic_msr_interception() is especially sketchy, as it relies on unused capacity being zero-initialized, and '0' being outside the range of x2APIC MSRs. To ensure the array and shadow_msr_intercept stay synchronized, simply assert that their sizes are identical (note the six 64-bit-only MSRs). Note, direct_access_msrs will soon be removed entirely; keeping the assert synchronized with the array isn't expected to be along-term maintenance burden. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Tag MSR bitmap initialization helpers with __init	Sean Christopherson
	Tag init_msrpm_offsets() and add_msr_offset() with __init, as they're used only during hardware setup to map potential passthrough MSRs to offsets in the bitmap. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Don't BUG if setting up the MSR intercept bitmaps fails	Sean Christopherson
	WARN and reject module loading if there is a problem with KVM's MSR interception bitmaps. Panicking the host in this situation is inexcusable since it is trivially easy to propagate the error up the stack. Link: https://lore.kernel.org/r/20250610225737.156318-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Allocate IOPM pages after initial setup in svm_hardware_setup()	Sean Christopherson
	Allocate pages for the IOPM after initial setup has been completed in svm_hardware_setup(), so that sanity checks can be added in the setup flow without needing to free the IOPM pages. The IOPM is only referenced (via iopm_base) in init_vmcb() and svm_hardware_unsetup(), so there's no need to allocate it early on. No functional change intended (beyond the obvious ordering differences, e.g. if the allocation fails). Link: https://lore.kernel.org/r/20250610225737.156318-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Disable interception of SPEC_CTRL iff the MSR exists for the guest	Sean Christopherson
	Disable interception of SPEC_CTRL when the CPU virtualizes (i.e. context switches) SPEC_CTRL if and only if the MSR exists according to the vCPU's CPUID model. Letting the guest access SPEC_CTRL is generally benign, but the guest would see inconsistent behavior if KVM happened to emulate an access to the MSR. Fixes: d00b99c514b3 ("KVM: SVM: Add support for Virtual SPEC_CTRL") Reported-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: VMX: Preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while running the guest	Maxim Levitsky
	Set/clear DEBUGCTLMSR_FREEZE_IN_SMM in GUEST_IA32_DEBUGCTL based on the host's pre-VM-Enter value, i.e. preserve the host's FREEZE_IN_SMM setting while running the guest. When running with the "default treatment of SMIs" in effect (the only mode KVM supports), SMIs do not generate a VM-Exit that is visible to host (non-SMM) software, and instead transitions directly from VMX non-root to SMM. And critically, DEBUGCTL isn't context switched by hardware on SMI or RSM, i.e. SMM will run with whatever value was resident in hardware at the time of the SMI. Failure to preserve FREEZE_IN_SMM results in the PMU unexpectedly counting events while the CPU is executing in SMM, which can pollute profiling and potentially leak information into the guest. Check for changes in FREEZE_IN_SMM prior to every entry into KVM's inner run loop, as the bit can be toggled in IRQ context via IPI callback (SMP function call), by way of /sys/devices/cpu/freeze_on_smi. Add a field in kvm_x86_ops to communicate which DEBUGCTL bits need to be preserved, as FREEZE_IN_SMM is only supported and defined for Intel CPUs, i.e. explicitly checking FREEZE_IN_SMM in common x86 is at best weird, and at worst could lead to undesirable behavior in the future if AMD CPUs ever happened to pick up a collision with the bit. Exempt TDX vCPUs, i.e. protected guests, from the check, as the TDX Module owns and controls GUEST_IA32_DEBUGCTL. WARN in SVM if KVM_RUN_LOAD_DEBUGCTL is set, mostly to document that the lack of handling isn't a KVM bug (TDX already WARNs on any run_flag). Lastly, explicitly reload GUEST_IA32_DEBUGCTL on a VM-Fail that is missed by KVM but detected by hardware, i.e. in nested_vmx_restore_host_state(). Doing so avoids the need to track host_debugctl on a per-VMCS basis, as GUEST_IA32_DEBUGCTL is unconditionally written by prepare_vmcs02() and load_vmcs12_host_state(). For the VM-Fail case, even though KVM won't have actually entered the guest, vcpu_enter_guest() will have run with vmcs02 active and thus could result in vmcs01 being run with a stale value. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250610232010.162191-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: VMX: Wrap all accesses to IA32_DEBUGCTL with getter/setter APIs	Maxim Levitsky
	Introduce vmx_guest_debugctl_{read,write}() to handle all accesses to vmcs.GUEST_IA32_DEBUGCTL. This will allow stuffing FREEZE_IN_SMM into GUEST_IA32_DEBUGCTL based on the host setting without bleeding the state into the guest, and without needing to copy+paste the FREEZE_IN_SMM logic into every patch that accesses GUEST_IA32_DEBUGCTL. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> [sean: massage changelog, make inline, use in all prepare_vmcs02() cases] Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610232010.162191-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-Enter	Maxim Levitsky
	Add a consistency check for L2's guest_ia32_debugctl, as KVM only supports a subset of hardware functionality, i.e. KVM can't rely on hardware to detect illegal/unsupported values. Failure to check the vmcs12 value would allow the guest to load any harware-supported value while running L2. Take care to exempt BTF and LBR from the validity check in order to match KVM's behavior for writes via WRMSR, but without clobbering vmcs12. Even if VM_EXIT_SAVE_DEBUG_CONTROLS is set in vmcs12, L1 can reasonably expect that vmcs12->guest_ia32_debugctl will not be modified if writes to the MSR are being intercepted. Arguably, KVM _should_ update vmcs12 if VM_EXIT_SAVE_DEBUG_CONTROLS is set and writes to MSR_IA32_DEBUGCTLMSR are not being intercepted by L1, but that would incur non-trivial complexity and wouldn't change the fact that KVM's handling of DEBUGCTL is blatantly broken. I.e. the extra complexity is not worth carrying. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250610232010.162191-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: VMX: Extract checking of guest's DEBUGCTL into helper	Sean Christopherson
	Move VMX's logic to check DEBUGCTL values into a standalone helper so that the code can be used by nested VM-Enter to apply the same logic to the value being loaded from vmcs12. KVM needs to explicitly check vmcs12->guest_ia32_debugctl on nested VM-Enter, as hardware may support features that KVM does not, i.e. relying on hardware to detect invalid guest state will result in false negatives. Unfortunately, that means applying KVM's funky suppression of BTF and LBR to vmcs12 so as not to break existing guests. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610232010.162191-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: VMX: Allow guest to set DEBUGCTL.RTM_DEBUG if RTM is supported	Sean Christopherson
	Let the guest set DEBUGCTL.RTM_DEBUG if RTM is supported according to the guest CPUID model, as debug support is supposed to be available if RTM is supported, and there are no known downsides to letting the guest debug RTM aborts. Note, there are no known bug reports related to RTM_DEBUG, the primary motivation is to reduce the probability of breaking existing guests when a future change adds a missing consistency check on vmcs12.GUEST_DEBUGCTL (KVM currently lets L2 run with whatever hardware supports; whoops). Note #2, KVM already emulates DR6.RTM, and doesn't restrict access to DR7.RTM. Fixes: 83c529151ab0 ("KVM: x86: expose Intel cpu new features (HLE, RTM) to guest") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>