summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)Author
2025-06-20KVM: x86: Move KVM_{GET,SET}_IRQCHIP ioctl helpers to irq.cSean Christopherson
Move the ioctl helpers for getting/setting fully in-kernel IRQ chip state to irq.c, partly to trim down x86.c, but mostly in preparation for adding a Kconfig to control support for in-kernel I/O APIC, PIC, and PIT emulation. No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Move PIT ioctl helpers to i8254.cSean Christopherson
Move the PIT ioctl helpers to i8254.c, i.e. to the file that implements PIT emulation. Eliminating PIT code in x86.c will allow adding a Kconfig to control support for in-kernel I/O APIC, PIC, and PIT emulation with minimal #ifdefs. Opportunistically make kvm_pit_set_reinject() and kvm_pit_load_count() local to i8254.c as they were only publicly visible to make them available to the ioctl helpers. No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Drop superfluous kvm_hv_set_sint() => kvm_hv_synic_set_irq() wrapperSean Christopherson
Drop the superfluous kvm_hv_set_sint() and instead wire up ->set() directly to its final destination, kvm_hv_synic_set_irq(). Keep hv_synic_set_irq() instead of kvm_hv_set_sint() to provide some amount of consistency in the ->set() helpers, e.g. to match kvm_pic_set_irq() and kvm_ioapic_set_irq(). kvm_set_msi() is arguably the oddball, e.g. kvm_set_msi_irq() should be something like kvm_msi_to_lapic_irq() so that kvm_set_msi() can instead be kvm_set_msi_irq(), but that's a future problem to solve. No functional change intended. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Kai Huang <kai.huang@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Drop superfluous kvm_set_ioapic_irq() => kvm_ioapic_set_irq() wrapperSean Christopherson
Drop the superfluous and confusing kvm_set_ioapic_irq() and instead wire up ->set() directly to its final destination. No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Drop superfluous kvm_set_pic_irq() => kvm_pic_set_irq() wrapperSean Christopherson
Drop the superfluous and confusing kvm_set_pic_irq() => kvm_pic_set_irq() wrapper, and instead wire up ->set() directly to its final destination. Opportunistically move the declaration kvm_pic_set_irq() to irq.h to start gathering more of the in-kernel APIC/IO-APIC logic in irq.{c,h}. No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Trigger I/O APIC route rescan in kvm_arch_irq_routing_update()Sean Christopherson
Trigger the I/O APIC route rescan that's performed for a split IRQ chip after userspace updates IRQ routes in kvm_arch_irq_routing_update(), i.e. before dropping kvm->irq_lock. Calling kvm_make_all_cpus_request() under a mutex is perfectly safe, and the smp_wmb()+smp_mb__after_atomic() pair in __kvm_make_request()+kvm_check_request() ensures the new routing is visible to vCPUs prior to the request being visible to vCPUs. In all likelihood, commit b053b2aef25d ("KVM: x86: Add EOI exit bitmap inference") somewhat arbitrarily made the request outside of irq_lock to avoid holding irq_lock any longer than is strictly necessary. And then commit abdb080f7ac8 ("kvm/irqchip: kvm_arch_irq_routing_update renaming split") took the easy route of adding another arch hook instead of risking a functional change. Note, the call to synchronize_srcu_expedited() does NOT provide ordering guarantees with respect to vCPUs scanning the new routing; as above, the request infrastructure provides the necessary ordering. I.e. there's no need to wait for kvm_scan_ioapic_routes() to complete if it's actively running, because regardless of whether it grabs the old or new table, the vCPU will have another KVM_REQ_SCAN_IOAPIC pending, i.e. will rescan again and see the new mappings. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Take ownership of producer/consumer token trackingSean Christopherson
Move ownership of IRQ bypass token tracking into irqbypass.ko, and explicitly require callers to pass an eventfd_ctx structure instead of a completely opaque token. Relying on producers and consumers to set the token appropriately is error prone, and hiding the fact that the token must be an eventfd_ctx pointer (for all intents and purposes) unnecessarily obfuscates the code and makes it more brittle. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: arm64: WARN if unmapping a vLPI fails in any pathSean Christopherson
When unmapping a vLPI, WARN if nullifying vCPU affinity fails, not just if failure occurs when freeing an ITE. If undoing vCPU affinity fails, then odds are very good that vLPI state tracking has has gotten out of whack, i.e. that KVM and the GIC disagree on the state of an IRQ/vLPI. At best, inconsistent state means there is a lurking bug/flaw somewhere. At worst, the inconsistency could eventually be fatal to the host, e.g. if an ITS command fails because KVM's view of things doesn't match reality/hardware. Note, only the call from kvm_arch_irq_bypass_del_producer() by way of kvm_vgic_v4_unset_forwarding() doesn't already WARN. Common KVM's kvm_irq_routing_update() WARNs if kvm_arch_update_irqfd_routing() fails. For that path, if its_unmap_vlpi() fails in kvm_vgic_v4_unset_forwarding(), the only possible causes are that the GIC doesn't have a v4 ITS (from its_irq_set_vcpu_affinity()): /* Need a v4 ITS */ if (!is_v4(its_dev->its)) return -EINVAL; guard(raw_spinlock)(&its_dev->event_map.vlpi_lock); /* Unmap request? */ if (!info) return its_vlpi_unmap(d); or that KVM has gotten out of sync with the GIC/ITS (from its_vlpi_unmap()): if (!its_dev->event_map.vm || !irqd_is_forwarded_to_vcpu(d)) return -EINVAL; All of the above failure scenarios are warnable offences, as they should never occur absent a kernel/KVM bug. Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/all/aFWY2LTVIxz5rfhh@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Allow SNP guest policy to specify SINGLE_SOCKETTom Lendacky
KVM currently returns -EINVAL when it attempts to create an SNP guest if the SINGLE_SOCKET guest policy bit is set. The reason for this action is that KVM would need specific support (SNP_ACTIVATE_EX command support) to achieve this when running on a system with more than one socket. However, the SEV firmware will make the proper check and return POLICY_FAILURE during SNP_ACTIVATE if the single socket guest policy bit is set and the system has more than one socket: - System with one socket - Guest policy SINGLE_SOCKET == 0 ==> SNP_ACTIVATE succeeds - Guest policy SINGLE_SOCKET == 1 ==> SNP_ACTIVATE succeeds - System with more than one socket - Guest policy SINGLE_SOCKET == 0 ==> SNP_ACTIVATE succeeds - Guest policy SINGLE_SOCKET == 1 ==> SNP_ACTIVATE fails with POLICY_FAILURE Remove the check for the SINGLE_SOCKET policy bit from snp_launch_start() and allow the firmware to perform the proper checking. This does have the effect of allowing an SNP guest with the SINGLE_SOCKET policy bit set to run on a single socket system, but fail when run on a system with more than one socket. However, this should not affect existing SNP guests as setting the SINGLE_SOCKET policy bit is not allowed today. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/4c51018dd3e4f2c543935134d2c4f47076f109f6.1748553480.git.thomas.lendacky@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Allow SNP guest policy disallow running with SMT enabledTom Lendacky
KVM currently returns -EINVAL when it attempts to create an SNP guest if the SMT guest policy bit is not set. However, there is no reason to check this, as there is no specific support in KVM that is required to support this. The SEV firmware will determine if SMT has been enabled or disabled in the BIOS and process the policy in the proper way: - SMT enabled in BIOS - Guest policy SMT == 0 ==> SNP_LAUNCH_START fails with POLICY_FAILURE - Guest policy SMT == 1 ==> SNP_LAUNCH_START succeeds - SMT disabled in BIOS - Guest policy SMT == 0 ==> SNP_LAUNCH_START succeeds - Guest policy SMT == 1 ==> SNP_LAUNCH_START succeeds Remove the check for the SMT policy bit from snp_launch_start() and allow the firmware to perform the proper checking. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/71043abdd9ef23b6f98fffa9c5c6045ac3a50187.1748553480.git.thomas.lendacky@amd.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: TDX: Move TDX hardware setup from main.c to tdx.cSean Christopherson
Move TDX hardware setup to tdx.c, as the code is obviously TDX specific, co-locating the setup with tdx_bringup() makes it easier to see and document the success_disable_tdx "error" path, and configuring the TDX specific hooks in tdx.c reduces the number of globally visible TDX symbols. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250523001138.3182794-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86/mmu: Exempt nested EPT page tables from !USER, CR0.WP=0 logicSean Christopherson
Exempt nested EPT shadow pages tables from the CR0.WP=0 handling of supervisor writes, as EPT doesn't have a U/S bit and isn't affected by CR0.WP (or CR4.SMEP in the exception to the exception). Opportunistically refresh the comment to explain what KVM is doing, as the only record of why KVM shoves in WRITE and drops USER is buried in years-old changelogs. Cc: Jon Kohler <jon@nutanix.com> Cc: Sergey Dyasli <sergey.dyasli@nutanix.com> Reviewed-by: Jon Kohler <jon@nutanix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@nutanix.com> Link: https://lore.kernel.org/r/20250602234851.54573-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Refactor handling of SIPI_RECEIVED when setting MP_STATESean Christopherson
Convert the incoming mp_state to INIT_RECIEVED instead of manually calling kvm_set_mp_state() to make it more obvious that the SIPI_RECEIVED logic is translating the incoming state to KVM's internal tracking, as opposed to being some entirely unique flow. Opportunistically add a comment to explain what the code is doing. No functional change intended. Link: https://lore.kernel.org/r/20250605195018.539901-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Move INIT_RECEIVED vs. INIT/SIPI blocked check to KVM_RUNSean Christopherson
Check for the should-be-impossible scenario of a vCPU being in Wait-For-SIPI with INIT/SIPI blocked during KVM_RUN instead of trying to detect and prevent illegal combinations in every ioctl that sets relevant state. Attempting to handle every possible "set" path is a losing game of whack-a-mole, and risks breaking userspace. E.g. INIT/SIPI are blocked on Intel if the vCPU is in VMX Root mode (post-VMXON), and on AMD if GIF=0. Handling those scenarios would require potentially breaking changes to {vmx,svm}_set_nested_state(). Moving the check to KVM_RUN fixes a syzkaller-induced splat due to the aforementioned VMXON case, and in theory should close the hole once and for all. Note, kvm_x86_vcpu_pre_run() already handles SIPI_RECEIVED, only the WFS case needs additional attention. Reported-by: syzbot+c1cbaedc2613058d5194@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?id=490ae63d8d89cb82c5d462d16962cf371df0e476 Link: https://lore.kernel.org/r/20250605195018.539901-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: WARN and reject KVM_RUN if vCPU's MP_STATE is SIPI_RECEIVEDSean Christopherson
WARN if KVM_RUN is reached with a vCPU's mp_state set to SIPI_RECEIVED, as KVM no longer uses SIPI_RECEIVED internally, and should morph SIPI_RECEIVED into INIT_RECEIVED with a pending SIPI if userspace forces SIPI_RECEIVED. See commit 66450a21f996 ("KVM: x86: Rework INIT and SIPI handling") for more history and details. Link: https://lore.kernel.org/r/20250605195018.539901-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Drop pending_smi vs. INIT_RECEIVED check when setting MP_STATESean Christopherson
Allow userspace to set a vCPU's mp_state to INIT_RECEIVED in conjunction with a pending SMI, as rejecting that combination could result in KVM disallowing reflecting the output from KVM_GET_VCPU_EVENTS back into KVM via KVM_SET_VCPU_EVENTS. At the time the check was added, smi_pending could only be set in the context of KVM_RUN, with the vCPU in the RUNNABLE state. I.e. it was impossible for KVM to save vCPU state such that userspace could see a pending SMI for a vCPU in WFS. That no longer holds true now that KVM processes requested SMIs during KVM_GET_VCPU_EVENTS, e.g. if a vCPU receives an SMI while in WFS, and then userspace saves vCPU state. Note, this may partially re-open the user-triggerable WARN that was mostly closed by commit 28bf28887976 ("KVM: x86: fix user triggerable warning in kvm_apic_accept_events()"), but that WARN can already be triggered in several other ways, e.g. if userspace stuffs VMXON=1 after putting the vCPU into WFS. That issue will be addressed in an upcoming commit, in a more robust fashion (hopefully). Fixes: 1f7becf1b7e2 ("KVM: x86: get smi pending status correctly") Link: https://lore.kernel.org/r/20250605195018.539901-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Simplify userspace filter logic when disabling MSR interceptionSean Christopherson
Refactor {svm,vmx}_disable_intercept_for_msr() to simplify the handling of userspace filters that disallow access to an MSR. The more complicated logic is no longer needed or justified now that KVM recalculates all MSR intercepts on a userspace MSR filter change, i.e. now that KVM doesn't need to also update shadow bitmaps. No functional change intended. Suggested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-32-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Add a helper to allocate and initialize permissions bitmapsSean Christopherson
Add a helper to allocate and initialize an MSR or I/O permissions map, as the logic is identical between the two map types, the only difference is the size of the bitmap. Opportunistically add a comment to explain why the bitmaps are initialized with 0xff, e.g. instead of the more common zero-initialized behavior, which is the main motivation for deduplicating the code. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-31-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: nSVM: Merge MSRPM in 64-bit chunks on 64-bit kernelsSean Christopherson
When merging L0 and L1 MSRPMs as part of nested VMRUN emulation, access the bitmaps using "unsigned long" chunks, i.e. use 8-byte access for 64-bit kernels instead of arbitrarily working on 4-byte chunks. Opportunistically rename local variables in nested_svm_merge_msrpm() to more precisely/accurately reflect their purpose ("offset" in particular is extremely ambiguous). Link: https://lore.kernel.org/r/20250610225737.156318-30-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Return -EINVAL instead of MSR_INVALID to signal out-of-range MSRSean Christopherson
Return -EINVAL instead of MSR_INVALID from svm_msrpm_bit_nr() to indicate that the MSR isn't covered by one of the (currently) three MSRPM ranges, and delete the MSR_INVALID macro now that all users are gone. Link: https://lore.kernel.org/r/20250610225737.156318-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: nSVM: Access MSRPM in 4-byte chunks only for merging L0 and L1 bitmapsSean Christopherson
Access the MSRPM using u32/4-byte chunks (and appropriately adjusted offsets) only when merging L0 and L1 bitmaps as part of emulating VMRUN. The only reason to batch accesses to MSRPMs is to avoid the overhead of uaccess operations (e.g. STAC/CLAC and bounds checks) when reading L1's bitmap pointed at by vmcb12. For all other uses, either per-bit accesses are more than fast enough (no uaccess), or KVM is only accessing a single bit (nested_svm_exit_handled_msr()) and so there's nothing to batch. In addition to (hopefully) documenting the uniqueness of the merging code, restricting chunked access to _just_ the merging code will allow for increasing the chunk size (to unsigned long) with minimal risk. Link: https://lore.kernel.org/r/20250610225737.156318-28-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Store MSRPM pointer as "void *" instead of "u32 *"Sean Christopherson
Store KVM's MSRPM pointers as "void *" instead of "u32 *" to guard against directly accessing the bitmaps outside of code that is explicitly written to access the bitmaps with a specific type. Opportunistically use svm_vcpu_free_msrpm() in svm_vcpu_free() instead of open coding an equivalent. Link: https://lore.kernel.org/r/20250610225737.156318-27-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Move svm_msrpm_offset() to nested.cSean Christopherson
Move svm_msrpm_offset() from svm.c to nested.c now that all usage of the u32-index offsets is nested virtualization specific. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-26-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Drop explicit check on MSRPM offset when emulating SEV-ES accessesSean Christopherson
Now that msr_write_intercepted() defaults to true, i.e. accurately reflects hardware behavior for out-of-range MSRs, and doesn't WARN (or BUG) on an out-of-range MSR, drop sev_es_prevent_msr_access()'s svm_msrpm_offset() check that guarded against calling msr_write_intercepted() with a "bad" index. Opportunistically clean up the helper's formatting. Link: https://lore.kernel.org/r/20250610225737.156318-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Merge "after set CPUID" intercept recalc helpersSean Christopherson
Merge svm_recalc_intercepts_after_set_cpuid() and svm_recalc_instruction_intercepts() such that the "after set CPUID" helper simply invokes the type-specific helpers (MSRs vs. instructions), i.e. make svm_recalc_intercepts_after_set_cpuid() a single entry point for all intercept updates that need to be performed after a CPUID change. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Fold svm_vcpu_init_msrpm() into its sole callerSean Christopherson
Fold svm_vcpu_init_msrpm() into svm_recalc_msr_intercepts() now that there is only the one caller (and because the "init" misnomer is even more misleading than it was in the past). No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Rename init_vmcb_after_set_cpuid() to make it intercepts specificSean Christopherson
Rename init_vmcb_after_set_cpuid() to svm_recalc_intercepts_after_set_cpuid() to more precisely describe its role. Strictly speaking, the name isn't perfect as toggling virtual VM{LOAD,SAVE} is arguably not recalculating an intercept, but practically speaking it's close enough. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Rename msr_filter_changed() => recalc_msr_intercepts()Sean Christopherson
Rename msr_filter_changed() to recalc_msr_intercepts() and drop the trampoline wrapper now that both SVM and VMX use a filter-agnostic recalc helper to react to the new userspace filter. No functional change intended. Reviewed-by: Xin Li (Intel) <xin@zytor.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Manually recalc all MSR intercepts on userspace MSR filter changeSean Christopherson
On a userspace MSR filter change, recalculate all MSR intercepts using the filter-agnostic logic instead of maintaining a "shadow copy" of KVM's desired intercepts. The shadow bitmaps add yet another point of failure, are confusing (e.g. what does "handled specially" mean!?!?), an eyesore, and a maintenance burden. Given that KVM *must* be able to recalculate the correct intercepts at any given time, and that MSR filter updates are not hot paths, there is zero benefit to maintaining the shadow bitmaps. Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as appropriate. Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local Cc: Francesco Lavra <francescolavra.fl@gmail.com> Link: https://lore.kernel.org/r/20250610225737.156318-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: VMX: Manually recalc all MSR intercepts on userspace MSR filter changeSean Christopherson
On a userspace MSR filter change, recalculate all MSR intercepts using the filter-agnostic logic instead of maintaining a "shadow copy" of KVM's desired intercepts. The shadow bitmaps add yet another point of failure, are confusing (e.g. what does "handled specially" mean!?!?), an eyesore, and a maintenance burden. Given that KVM *must* be able to recalculate the correct intercepts at any given time, and that MSR filter updates are not hot paths, there is zero benefit to maintaining the shadow bitmaps. Opportunistically switch from boot_cpu_has() to cpu_feature_enabled() as appropriate. Link: https://lore.kernel.org/all/aCdPbZiYmtni4Bjs@google.com Link: https://lore.kernel.org/all/20241126180253.GAZ0YNTdXH1UGeqsu6@fat_crate.local Cc: Borislav Petkov <bp@alien8.de> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Xin Li (Intel) <xin@zytor.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Move definition of X2APIC_MSR() to lapic.hSean Christopherson
Dedup the definition of X2APIC_MSR and put it in the local APIC code where it belongs. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Drop "always" flag from list of possible passthrough MSRsSean Christopherson
Drop the "always" flag from the array of possible passthrough MSRs, and instead manually initialize the permissions for the handful of MSRs that KVM passes through by default. In addition to cutting down on boilerplate copy+paste code and eliminating a misleading flag (the MSRs aren't always passed through, e.g. thanks to MSR filters), this will allow for removing the direct_access_msrs array entirely. Link: https://lore.kernel.org/r/20250610225737.156318-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Pass through GHCB MSR if and only if VM is an SEV-ES guestSean Christopherson
Disable interception of the GHCB MSR if and only if the VM is an SEV-ES guest. While the exact behavior is completely undocumented in the APM, common sense and testing on SEV-ES capable CPUs says that accesses to the GHCB from non-SEV-ES guests will #GP. I.e. from the guest's perspective, no functional change intended. Fixes: 376c6d285017 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading") Link: https://lore.kernel.org/r/20250610225737.156318-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Implement and adopt VMX style MSR intercepts APIsSean Christopherson
Add and use SVM MSR interception APIs (in most paths) to match VMX's APIs and nomenclature. Specifically, add SVM variants of: vmx_disable_intercept_for_msr(vcpu, msr, type) vmx_enable_intercept_for_msr(vcpu, msr, type) vmx_set_intercept_for_msr(vcpu, msr, type, intercept) to eventually replace SVM's single helper: set_msr_interception(vcpu, msrpm, msr, allow_read, allow_write) which is awkward to use (in all cases, KVM either applies the same logic for both reads and writes, or intercepts one of read or write), and is unintuitive due to using '0' to indicate interception should be *set*. Keep the guts of the old API for the moment to avoid churning the MSR filter code, as that mess will be overhauled in the near future. Leave behind a temporary comment to call out that the shadow bitmaps have inverted polarity relative to the bitmaps consumed by hardware. No functional change intended. Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Add helpers for accessing MSR bitmap that don't rely on offsetsSean Christopherson
Add macro-built helpers for testing, setting, and clearing MSRPM entries without relying on precomputed offsets. This sets the stage for eventually removing general KVM use of precomputed offsets, which are quite confusing and rather inefficient for the vast majority of KVM's usage. Outside of merging L0 and L1 bitmaps for nested SVM, using u32-indexed offsets and accesses is at best unnecessary, and at worst introduces extra operations to retrieve the individual bit from within the offset u32 value. And simply calling them "offsets" is very confusing, as the "unit" of the offset isn't immediately obvious. Use the new helpers in set_msr_interception_bitmap() and msr_write_intercepted() to verify the math and operations, but keep the existing offset-based logic in set_msr_interception_bitmap() to sanity check the "clear" and "set" operations. Manipulating MSR interceptions isn't a hot path and no kernel release is ever expected to contain this specific version of set_msr_interception_bitmap() (it will be removed entirely in the near future). Link: https://lore.kernel.org/r/20250610225737.156318-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: nSVM: Don't initialize vmcb02 MSRPM with vmcb01's "always passthrough"Sean Christopherson
Don't initialize vmcb02's MSRPM with KVM's set of "always passthrough" MSRs, as KVM always needs to consult L1's intercepts, i.e. needs to merge vmcb01 with vmcb12 and write the result to vmcb02. This will eventually allow for the removal of svm_vcpu_init_msrpm(). Note, the bitmaps are truly initialized by svm_vcpu_alloc_msrpm() (default to intercepting all MSRs), e.g. if there is a bug lurking elsewhere, the worst case scenario from dropping the call to svm_vcpu_init_msrpm() should be that KVM would fail to passthrough MSRs to L2. Link: https://lore.kernel.org/r/20250610225737.156318-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: nSVM: Omit SEV-ES specific passthrough MSRs from L0+L1 bitmap mergeSean Christopherson
Don't merge bitmaps on nested VMRUN for MSRs that KVM passes through only for SEV-ES guests. KVM doesn't support nested virtualization for SEV-ES, and likely never will. Link: https://lore.kernel.org/r/20250610225737.156318-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: nSVM: Use dedicated array of MSRPM offsets to merge L0 and L1 bitmapsSean Christopherson
Use a dedicated array of MSRPM offsets to merge L0 and L1 bitmaps, i.e. to merge KVM's vmcb01 bitmap with L1's vmcb12 bitmap. This will eventually allow for the removal of direct_access_msrs, as the only path where tracking the offsets is truly justified is the merge for nested SVM, where merging in chunks is an easy way to batch uaccess reads/writes. Opportunistically omit the x2APIC MSRs from the merge-specific array instead of filtering them out at runtime. Note, disabling interception of DEBUGCTL, XSS, EFER, PAT, GHCB, and TSC_AUX is mutually exclusive with nested virtualization, as KVM passes through those MSRs only for SEV-ES guests, and KVM doesn't support nested virtualization for SEV+ guests. Defer removing those MSRs to a future cleanup in order to make this refactoring as benign as possible. Link: https://lore.kernel.org/r/20250610225737.156318-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Clean up macros related to architectural MSRPM definitionsSean Christopherson
Move SVM's MSR Permissions Map macros to svm.h in anticipation of adding helpers that are available to SVM code, and opportunistically replace a variety of open-coded literals with (hopefully) informative macros. Opportunistically open code ARRAY_SIZE(msrpm_ranges) instead of wrapping it as NUM_MSR_MAPS, which is an ambiguous name even if it were qualified with "SVM_MSRPM". Deliberately leave the ranges as open coded literals, as using macros to define the ranges actually introduces more potential failure points, since both the definitions and the usage have to be careful to use the correct index. The lack of clear intent behind the ranges will be addressed in future patches. No functional change intended. Link: https://lore.kernel.org/r/20250610225737.156318-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Massage name and param of helper that merges vmcb01 and vmcb12 MSRPMsSean Christopherson
Rename nested_svm_vmrun_msrpm() to nested_svm_merge_msrpm() to better capture its role, and opportunistically feed it @vcpu instead of @svm, as grabbing "svm" only to turn around and grab svm->vcpu is rather silly. No functional change intended. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Use non-atomic bit ops to manipulate "shadow" MSR interceptsSean Christopherson
Manipulate the MSR bitmaps using non-atomic bit ops APIs (two underscores), as the bitmaps are per-vCPU and are only ever accessed while vcpu->mutex is held. Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Kill the VM instead of the host if MSR interception is buggySean Christopherson
WARN and kill the VM instead of panicking the host if KVM attempts to set or query MSR interception for an unsupported MSR. Accessing the MSR interception bitmaps only meaningfully affects post-VMRUN behavior, and KVM_BUG_ON() is guaranteed to prevent the current vCPU from doing VMRUN, i.e. there is no need to panic the entire host. Opportunistically move the sanity checks about their use to index into the MSRPM, e.g. so that bugs only WARN and terminate the VM, as opposed to doing that _and_ generating an out-of-bounds load. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Use ARRAY_SIZE() to iterate over direct_access_msrsSean Christopherson
Drop the unnecessary and dangerous value-terminated behavior of direct_access_msrs, and simply iterate over the actual size of the array. The use in svm_set_x2apic_msr_interception() is especially sketchy, as it relies on unused capacity being zero-initialized, and '0' being outside the range of x2APIC MSRs. To ensure the array and shadow_msr_intercept stay synchronized, simply assert that their sizes are identical (note the six 64-bit-only MSRs). Note, direct_access_msrs will soon be removed entirely; keeping the assert synchronized with the array isn't expected to be along-term maintenance burden. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Tag MSR bitmap initialization helpers with __initSean Christopherson
Tag init_msrpm_offsets() and add_msr_offset() with __init, as they're used only during hardware setup to map potential passthrough MSRs to offsets in the bitmap. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Don't BUG if setting up the MSR intercept bitmaps failsSean Christopherson
WARN and reject module loading if there is a problem with KVM's MSR interception bitmaps. Panicking the host in this situation is inexcusable since it is trivially easy to propagate the error up the stack. Link: https://lore.kernel.org/r/20250610225737.156318-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Allocate IOPM pages after initial setup in svm_hardware_setup()Sean Christopherson
Allocate pages for the IOPM after initial setup has been completed in svm_hardware_setup(), so that sanity checks can be added in the setup flow without needing to free the IOPM pages. The IOPM is only referenced (via iopm_base) in init_vmcb() and svm_hardware_unsetup(), so there's no need to allocate it early on. No functional change intended (beyond the obvious ordering differences, e.g. if the allocation fails). Link: https://lore.kernel.org/r/20250610225737.156318-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Disable interception of SPEC_CTRL iff the MSR exists for the guestSean Christopherson
Disable interception of SPEC_CTRL when the CPU virtualizes (i.e. context switches) SPEC_CTRL if and only if the MSR exists according to the vCPU's CPUID model. Letting the guest access SPEC_CTRL is generally benign, but the guest would see inconsistent behavior if KVM happened to emulate an access to the MSR. Fixes: d00b99c514b3 ("KVM: SVM: Add support for Virtual SPEC_CTRL") Reported-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: VMX: Preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while running the guestMaxim Levitsky
Set/clear DEBUGCTLMSR_FREEZE_IN_SMM in GUEST_IA32_DEBUGCTL based on the host's pre-VM-Enter value, i.e. preserve the host's FREEZE_IN_SMM setting while running the guest. When running with the "default treatment of SMIs" in effect (the only mode KVM supports), SMIs do not generate a VM-Exit that is visible to host (non-SMM) software, and instead transitions directly from VMX non-root to SMM. And critically, DEBUGCTL isn't context switched by hardware on SMI or RSM, i.e. SMM will run with whatever value was resident in hardware at the time of the SMI. Failure to preserve FREEZE_IN_SMM results in the PMU unexpectedly counting events while the CPU is executing in SMM, which can pollute profiling and potentially leak information into the guest. Check for changes in FREEZE_IN_SMM prior to every entry into KVM's inner run loop, as the bit can be toggled in IRQ context via IPI callback (SMP function call), by way of /sys/devices/cpu/freeze_on_smi. Add a field in kvm_x86_ops to communicate which DEBUGCTL bits need to be preserved, as FREEZE_IN_SMM is only supported and defined for Intel CPUs, i.e. explicitly checking FREEZE_IN_SMM in common x86 is at best weird, and at worst could lead to undesirable behavior in the future if AMD CPUs ever happened to pick up a collision with the bit. Exempt TDX vCPUs, i.e. protected guests, from the check, as the TDX Module owns and controls GUEST_IA32_DEBUGCTL. WARN in SVM if KVM_RUN_LOAD_DEBUGCTL is set, mostly to document that the lack of handling isn't a KVM bug (TDX already WARNs on any run_flag). Lastly, explicitly reload GUEST_IA32_DEBUGCTL on a VM-Fail that is missed by KVM but detected by hardware, i.e. in nested_vmx_restore_host_state(). Doing so avoids the need to track host_debugctl on a per-VMCS basis, as GUEST_IA32_DEBUGCTL is unconditionally written by prepare_vmcs02() and load_vmcs12_host_state(). For the VM-Fail case, even though KVM won't have actually entered the guest, vcpu_enter_guest() will have run with vmcs02 active and thus could result in vmcs01 being run with a stale value. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250610232010.162191-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: VMX: Wrap all accesses to IA32_DEBUGCTL with getter/setter APIsMaxim Levitsky
Introduce vmx_guest_debugctl_{read,write}() to handle all accesses to vmcs.GUEST_IA32_DEBUGCTL. This will allow stuffing FREEZE_IN_SMM into GUEST_IA32_DEBUGCTL based on the host setting without bleeding the state into the guest, and without needing to copy+paste the FREEZE_IN_SMM logic into every patch that accesses GUEST_IA32_DEBUGCTL. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> [sean: massage changelog, make inline, use in all prepare_vmcs02() cases] Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250610232010.162191-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: nVMX: Check vmcs12->guest_ia32_debugctl on nested VM-EnterMaxim Levitsky
Add a consistency check for L2's guest_ia32_debugctl, as KVM only supports a subset of hardware functionality, i.e. KVM can't rely on hardware to detect illegal/unsupported values. Failure to check the vmcs12 value would allow the guest to load any harware-supported value while running L2. Take care to exempt BTF and LBR from the validity check in order to match KVM's behavior for writes via WRMSR, but without clobbering vmcs12. Even if VM_EXIT_SAVE_DEBUG_CONTROLS is set in vmcs12, L1 can reasonably expect that vmcs12->guest_ia32_debugctl will not be modified if writes to the MSR are being intercepted. Arguably, KVM _should_ update vmcs12 if VM_EXIT_SAVE_DEBUG_CONTROLS is set *and* writes to MSR_IA32_DEBUGCTLMSR are not being intercepted by L1, but that would incur non-trivial complexity and wouldn't change the fact that KVM's handling of DEBUGCTL is blatantly broken. I.e. the extra complexity is not worth carrying. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250610232010.162191-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>